Commit 670e9f34ee3c7e052514c85014d2fdd99b672cdc

Authored by Paolo Ornati
Committed by Adrian Bunk
1 parent 53cb47268e

Documentation: remove duplicated words

Remove many duplicated words under Documentation/ and do other small
cleanups.

Examples:
        "and and" --> "and"
        "in in" --> "in"
        "the the" --> "the"
        "the the" --> "to the"
        ...

Signed-off-by: Paolo Ornati <ornati@fastwebnet.it>
Signed-off-by: Adrian Bunk <bunk@stusta.de>

Showing 52 changed files with 61 additions and 62 deletions Inline Diff

Documentation/DMA-mapping.txt
1 Dynamic DMA mapping 1 Dynamic DMA mapping
2 =================== 2 ===================
3 3
4 David S. Miller <davem@redhat.com> 4 David S. Miller <davem@redhat.com>
5 Richard Henderson <rth@cygnus.com> 5 Richard Henderson <rth@cygnus.com>
6 Jakub Jelinek <jakub@redhat.com> 6 Jakub Jelinek <jakub@redhat.com>
7 7
8 This document describes the DMA mapping system in terms of the pci_ 8 This document describes the DMA mapping system in terms of the pci_
9 API. For a similar API that works for generic devices, see 9 API. For a similar API that works for generic devices, see
10 DMA-API.txt. 10 DMA-API.txt.
11 11
12 Most of the 64bit platforms have special hardware that translates bus 12 Most of the 64bit platforms have special hardware that translates bus
13 addresses (DMA addresses) into physical addresses. This is similar to 13 addresses (DMA addresses) into physical addresses. This is similar to
14 how page tables and/or a TLB translates virtual addresses to physical 14 how page tables and/or a TLB translates virtual addresses to physical
15 addresses on a CPU. This is needed so that e.g. PCI devices can 15 addresses on a CPU. This is needed so that e.g. PCI devices can
16 access with a Single Address Cycle (32bit DMA address) any page in the 16 access with a Single Address Cycle (32bit DMA address) any page in the
17 64bit physical address space. Previously in Linux those 64bit 17 64bit physical address space. Previously in Linux those 64bit
18 platforms had to set artificial limits on the maximum RAM size in the 18 platforms had to set artificial limits on the maximum RAM size in the
19 system, so that the virt_to_bus() static scheme works (the DMA address 19 system, so that the virt_to_bus() static scheme works (the DMA address
20 translation tables were simply filled on bootup to map each bus 20 translation tables were simply filled on bootup to map each bus
21 address to the physical page __pa(bus_to_virt())). 21 address to the physical page __pa(bus_to_virt())).
22 22
23 So that Linux can use the dynamic DMA mapping, it needs some help from the 23 So that Linux can use the dynamic DMA mapping, it needs some help from the
24 drivers, namely it has to take into account that DMA addresses should be 24 drivers, namely it has to take into account that DMA addresses should be
25 mapped only for the time they are actually used and unmapped after the DMA 25 mapped only for the time they are actually used and unmapped after the DMA
26 transfer. 26 transfer.
27 27
28 The following API will work of course even on platforms where no such 28 The following API will work of course even on platforms where no such
29 hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on 29 hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on
30 top of the virt_to_bus interface. 30 top of the virt_to_bus interface.
31 31
32 First of all, you should make sure 32 First of all, you should make sure
33 33
34 #include <linux/pci.h> 34 #include <linux/pci.h>
35 35
36 is in your driver. This file will obtain for you the definition of the 36 is in your driver. This file will obtain for you the definition of the
37 dma_addr_t (which can hold any valid DMA address for the platform) 37 dma_addr_t (which can hold any valid DMA address for the platform)
38 type which should be used everywhere you hold a DMA (bus) address 38 type which should be used everywhere you hold a DMA (bus) address
39 returned from the DMA mapping functions. 39 returned from the DMA mapping functions.
40 40
41 What memory is DMA'able? 41 What memory is DMA'able?
42 42
43 The first piece of information you must know is what kernel memory can 43 The first piece of information you must know is what kernel memory can
44 be used with the DMA mapping facilities. There has been an unwritten 44 be used with the DMA mapping facilities. There has been an unwritten
45 set of rules regarding this, and this text is an attempt to finally 45 set of rules regarding this, and this text is an attempt to finally
46 write them down. 46 write them down.
47 47
48 If you acquired your memory via the page allocator 48 If you acquired your memory via the page allocator
49 (i.e. __get_free_page*()) or the generic memory allocators 49 (i.e. __get_free_page*()) or the generic memory allocators
50 (i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from 50 (i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
51 that memory using the addresses returned from those routines. 51 that memory using the addresses returned from those routines.
52 52
53 This means specifically that you may _not_ use the memory/addresses 53 This means specifically that you may _not_ use the memory/addresses
54 returned from vmalloc() for DMA. It is possible to DMA to the 54 returned from vmalloc() for DMA. It is possible to DMA to the
55 _underlying_ memory mapped into a vmalloc() area, but this requires 55 _underlying_ memory mapped into a vmalloc() area, but this requires
56 walking page tables to get the physical addresses, and then 56 walking page tables to get the physical addresses, and then
57 translating each of those pages back to a kernel address using 57 translating each of those pages back to a kernel address using
58 something like __va(). [ EDIT: Update this when we integrate 58 something like __va(). [ EDIT: Update this when we integrate
59 Gerd Knorr's generic code which does this. ] 59 Gerd Knorr's generic code which does this. ]
60 60
61 This rule also means that you may use neither kernel image addresses 61 This rule also means that you may use neither kernel image addresses
62 (items in data/text/bss segments), nor module image addresses, nor 62 (items in data/text/bss segments), nor module image addresses, nor
63 stack addresses for DMA. These could all be mapped somewhere entirely 63 stack addresses for DMA. These could all be mapped somewhere entirely
64 different than the rest of physical memory. Even if those classes of 64 different than the rest of physical memory. Even if those classes of
65 memory could physically work with DMA, you'd need to ensure the I/O 65 memory could physically work with DMA, you'd need to ensure the I/O
66 buffers were cacheline-aligned. Without that, you'd see cacheline 66 buffers were cacheline-aligned. Without that, you'd see cacheline
67 sharing problems (data corruption) on CPUs with DMA-incoherent caches. 67 sharing problems (data corruption) on CPUs with DMA-incoherent caches.
68 (The CPU could write to one word, DMA would write to a different one 68 (The CPU could write to one word, DMA would write to a different one
69 in the same cache line, and one of them could be overwritten.) 69 in the same cache line, and one of them could be overwritten.)
70 70
71 Also, this means that you cannot take the return of a kmap() 71 Also, this means that you cannot take the return of a kmap()
72 call and DMA to/from that. This is similar to vmalloc(). 72 call and DMA to/from that. This is similar to vmalloc().
73 73
74 What about block I/O and networking buffers? The block I/O and 74 What about block I/O and networking buffers? The block I/O and
75 networking subsystems make sure that the buffers they use are valid 75 networking subsystems make sure that the buffers they use are valid
76 for you to DMA from/to. 76 for you to DMA from/to.
77 77
78 DMA addressing limitations 78 DMA addressing limitations
79 79
80 Does your device have any DMA addressing limitations? For example, is 80 Does your device have any DMA addressing limitations? For example, is
81 your device only capable of driving the low order 24-bits of address 81 your device only capable of driving the low order 24-bits of address
82 on the PCI bus for SAC DMA transfers? If so, you need to inform the 82 on the PCI bus for SAC DMA transfers? If so, you need to inform the
83 PCI layer of this fact. 83 PCI layer of this fact.
84 84
85 By default, the kernel assumes that your device can address the full 85 By default, the kernel assumes that your device can address the full
86 32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs 86 32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs
87 to be increased. And for a device with limitations, as discussed in 87 to be increased. And for a device with limitations, as discussed in
88 the previous paragraph, it needs to be decreased. 88 the previous paragraph, it needs to be decreased.
89 89
90 pci_alloc_consistent() by default will return 32-bit DMA addresses. 90 pci_alloc_consistent() by default will return 32-bit DMA addresses.
91 PCI-X specification requires PCI-X devices to support 64-bit 91 PCI-X specification requires PCI-X devices to support 64-bit
92 addressing (DAC) for all transactions. And at least one platform (SGI 92 addressing (DAC) for all transactions. And at least one platform (SGI
93 SN2) requires 64-bit consistent allocations to operate correctly when 93 SN2) requires 64-bit consistent allocations to operate correctly when
94 the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(), 94 the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(),
95 it's good practice to call pci_set_consistent_dma_mask() to set the 95 it's good practice to call pci_set_consistent_dma_mask() to set the
96 appropriate mask even if your device only supports 32-bit DMA 96 appropriate mask even if your device only supports 32-bit DMA
97 (default) and especially if it's a PCI-X device. 97 (default) and especially if it's a PCI-X device.
98 98
99 For correct operation, you must interrogate the PCI layer in your 99 For correct operation, you must interrogate the PCI layer in your
100 device probe routine to see if the PCI controller on the machine can 100 device probe routine to see if the PCI controller on the machine can
101 properly support the DMA addressing limitation your device has. It is 101 properly support the DMA addressing limitation your device has. It is
102 good style to do this even if your device holds the default setting, 102 good style to do this even if your device holds the default setting,
103 because this shows that you did think about these issues wrt. your 103 because this shows that you did think about these issues wrt. your
104 device. 104 device.
105 105
106 The query is performed via a call to pci_set_dma_mask(): 106 The query is performed via a call to pci_set_dma_mask():
107 107
108 int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask); 108 int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask);
109 109
110 The query for consistent allocations is performed via a a call to 110 The query for consistent allocations is performed via a call to
111 pci_set_consistent_dma_mask(): 111 pci_set_consistent_dma_mask():
112 112
113 int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask); 113 int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask);
114 114
115 Here, pdev is a pointer to the PCI device struct of your device, and 115 Here, pdev is a pointer to the PCI device struct of your device, and
116 device_mask is a bit mask describing which bits of a PCI address your 116 device_mask is a bit mask describing which bits of a PCI address your
117 device supports. It returns zero if your card can perform DMA 117 device supports. It returns zero if your card can perform DMA
118 properly on the machine given the address mask you provided. 118 properly on the machine given the address mask you provided.
119 119
120 If it returns non-zero, your device cannot perform DMA properly on 120 If it returns non-zero, your device cannot perform DMA properly on
121 this platform, and attempting to do so will result in undefined 121 this platform, and attempting to do so will result in undefined
122 behavior. You must either use a different mask, or not use DMA. 122 behavior. You must either use a different mask, or not use DMA.
123 123
124 This means that in the failure case, you have three options: 124 This means that in the failure case, you have three options:
125 125
126 1) Use another DMA mask, if possible (see below). 126 1) Use another DMA mask, if possible (see below).
127 2) Use some non-DMA mode for data transfer, if possible. 127 2) Use some non-DMA mode for data transfer, if possible.
128 3) Ignore this device and do not initialize it. 128 3) Ignore this device and do not initialize it.
129 129
130 It is recommended that your driver print a kernel KERN_WARNING message 130 It is recommended that your driver print a kernel KERN_WARNING message
131 when you end up performing either #2 or #3. In this manner, if a user 131 when you end up performing either #2 or #3. In this manner, if a user
132 of your driver reports that performance is bad or that the device is not 132 of your driver reports that performance is bad or that the device is not
133 even detected, you can ask them for the kernel messages to find out 133 even detected, you can ask them for the kernel messages to find out
134 exactly why. 134 exactly why.
135 135
136 The standard 32-bit addressing PCI device would do something like 136 The standard 32-bit addressing PCI device would do something like
137 this: 137 this:
138 138
139 if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { 139 if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
140 printk(KERN_WARNING 140 printk(KERN_WARNING
141 "mydev: No suitable DMA available.\n"); 141 "mydev: No suitable DMA available.\n");
142 goto ignore_this_device; 142 goto ignore_this_device;
143 } 143 }
144 144
145 Another common scenario is a 64-bit capable device. The approach 145 Another common scenario is a 64-bit capable device. The approach
146 here is to try for 64-bit DAC addressing, but back down to a 146 here is to try for 64-bit DAC addressing, but back down to a
147 32-bit mask should that fail. The PCI platform code may fail the 147 32-bit mask should that fail. The PCI platform code may fail the
148 64-bit mask not because the platform is not capable of 64-bit 148 64-bit mask not because the platform is not capable of 64-bit
149 addressing. Rather, it may fail in this case simply because 149 addressing. Rather, it may fail in this case simply because
150 32-bit SAC addressing is done more efficiently than DAC addressing. 150 32-bit SAC addressing is done more efficiently than DAC addressing.
151 Sparc64 is one platform which behaves in this way. 151 Sparc64 is one platform which behaves in this way.
152 152
153 Here is how you would handle a 64-bit capable device which can drive 153 Here is how you would handle a 64-bit capable device which can drive
154 all 64-bits when accessing streaming DMA: 154 all 64-bits when accessing streaming DMA:
155 155
156 int using_dac; 156 int using_dac;
157 157
158 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { 158 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
159 using_dac = 1; 159 using_dac = 1;
160 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { 160 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
161 using_dac = 0; 161 using_dac = 0;
162 } else { 162 } else {
163 printk(KERN_WARNING 163 printk(KERN_WARNING
164 "mydev: No suitable DMA available.\n"); 164 "mydev: No suitable DMA available.\n");
165 goto ignore_this_device; 165 goto ignore_this_device;
166 } 166 }
167 167
168 If a card is capable of using 64-bit consistent allocations as well, 168 If a card is capable of using 64-bit consistent allocations as well,
169 the case would look like this: 169 the case would look like this:
170 170
171 int using_dac, consistent_using_dac; 171 int using_dac, consistent_using_dac;
172 172
173 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) { 173 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
174 using_dac = 1; 174 using_dac = 1;
175 consistent_using_dac = 1; 175 consistent_using_dac = 1;
176 pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK); 176 pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
177 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) { 177 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
178 using_dac = 0; 178 using_dac = 0;
179 consistent_using_dac = 0; 179 consistent_using_dac = 0;
180 pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK); 180 pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
181 } else { 181 } else {
182 printk(KERN_WARNING 182 printk(KERN_WARNING
183 "mydev: No suitable DMA available.\n"); 183 "mydev: No suitable DMA available.\n");
184 goto ignore_this_device; 184 goto ignore_this_device;
185 } 185 }
186 186
187 pci_set_consistent_dma_mask() will always be able to set the same or a 187 pci_set_consistent_dma_mask() will always be able to set the same or a
188 smaller mask as pci_set_dma_mask(). However for the rare case that a 188 smaller mask as pci_set_dma_mask(). However for the rare case that a
189 device driver only uses consistent allocations, one would have to 189 device driver only uses consistent allocations, one would have to
190 check the return value from pci_set_consistent_dma_mask(). 190 check the return value from pci_set_consistent_dma_mask().
191 191
192 If your 64-bit device is going to be an enormous consumer of DMA 192 If your 64-bit device is going to be an enormous consumer of DMA
193 mappings, this can be problematic since the DMA mappings are a 193 mappings, this can be problematic since the DMA mappings are a
194 finite resource on many platforms. Please see the "DAC Addressing 194 finite resource on many platforms. Please see the "DAC Addressing
195 for Address Space Hungry Devices" section near the end of this 195 for Address Space Hungry Devices" section near the end of this
196 document for how to handle this case. 196 document for how to handle this case.
197 197
198 Finally, if your device can only drive the low 24-bits of 198 Finally, if your device can only drive the low 24-bits of
199 address during PCI bus mastering you might do something like: 199 address during PCI bus mastering you might do something like:
200 200
201 if (pci_set_dma_mask(pdev, DMA_24BIT_MASK)) { 201 if (pci_set_dma_mask(pdev, DMA_24BIT_MASK)) {
202 printk(KERN_WARNING 202 printk(KERN_WARNING
203 "mydev: 24-bit DMA addressing not available.\n"); 203 "mydev: 24-bit DMA addressing not available.\n");
204 goto ignore_this_device; 204 goto ignore_this_device;
205 } 205 }
206 [Better use DMA_24BIT_MASK instead of 0x00ffffff. 206 [Better use DMA_24BIT_MASK instead of 0x00ffffff.
207 See linux/include/dma-mapping.h for reference.] 207 See linux/include/dma-mapping.h for reference.]
208 208
209 When pci_set_dma_mask() is successful, and returns zero, the PCI layer 209 When pci_set_dma_mask() is successful, and returns zero, the PCI layer
210 saves away this mask you have provided. The PCI layer will use this 210 saves away this mask you have provided. The PCI layer will use this
211 information later when you make DMA mappings. 211 information later when you make DMA mappings.
212 212
213 There is a case which we are aware of at this time, which is worth 213 There is a case which we are aware of at this time, which is worth
214 mentioning in this documentation. If your device supports multiple 214 mentioning in this documentation. If your device supports multiple
215 functions (for example a sound card provides playback and record 215 functions (for example a sound card provides playback and record
216 functions) and the various different functions have _different_ 216 functions) and the various different functions have _different_
217 DMA addressing limitations, you may wish to probe each mask and 217 DMA addressing limitations, you may wish to probe each mask and
218 only provide the functionality which the machine can handle. It 218 only provide the functionality which the machine can handle. It
219 is important that the last call to pci_set_dma_mask() be for the 219 is important that the last call to pci_set_dma_mask() be for the
220 most specific mask. 220 most specific mask.
221 221
222 Here is pseudo-code showing how this might be done: 222 Here is pseudo-code showing how this might be done:
223 223
224 #define PLAYBACK_ADDRESS_BITS DMA_32BIT_MASK 224 #define PLAYBACK_ADDRESS_BITS DMA_32BIT_MASK
225 #define RECORD_ADDRESS_BITS 0x00ffffff 225 #define RECORD_ADDRESS_BITS 0x00ffffff
226 226
227 struct my_sound_card *card; 227 struct my_sound_card *card;
228 struct pci_dev *pdev; 228 struct pci_dev *pdev;
229 229
230 ... 230 ...
231 if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) { 231 if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) {
232 card->playback_enabled = 1; 232 card->playback_enabled = 1;
233 } else { 233 } else {
234 card->playback_enabled = 0; 234 card->playback_enabled = 0;
235 printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n", 235 printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n",
236 card->name); 236 card->name);
237 } 237 }
238 if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) { 238 if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) {
239 card->record_enabled = 1; 239 card->record_enabled = 1;
240 } else { 240 } else {
241 card->record_enabled = 0; 241 card->record_enabled = 0;
242 printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n", 242 printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n",
243 card->name); 243 card->name);
244 } 244 }
245 245
246 A sound card was used as an example here because this genre of PCI 246 A sound card was used as an example here because this genre of PCI
247 devices seems to be littered with ISA chips given a PCI front end, 247 devices seems to be littered with ISA chips given a PCI front end,
248 and thus retaining the 16MB DMA addressing limitations of ISA. 248 and thus retaining the 16MB DMA addressing limitations of ISA.
249 249
250 Types of DMA mappings 250 Types of DMA mappings
251 251
252 There are two types of DMA mappings: 252 There are two types of DMA mappings:
253 253
254 - Consistent DMA mappings which are usually mapped at driver 254 - Consistent DMA mappings which are usually mapped at driver
255 initialization, unmapped at the end and for which the hardware should 255 initialization, unmapped at the end and for which the hardware should
256 guarantee that the device and the CPU can access the data 256 guarantee that the device and the CPU can access the data
257 in parallel and will see updates made by each other without any 257 in parallel and will see updates made by each other without any
258 explicit software flushing. 258 explicit software flushing.
259 259
260 Think of "consistent" as "synchronous" or "coherent". 260 Think of "consistent" as "synchronous" or "coherent".
261 261
262 The current default is to return consistent memory in the low 32 262 The current default is to return consistent memory in the low 32
263 bits of the PCI bus space. However, for future compatibility you 263 bits of the PCI bus space. However, for future compatibility you
264 should set the consistent mask even if this default is fine for your 264 should set the consistent mask even if this default is fine for your
265 driver. 265 driver.
266 266
267 Good examples of what to use consistent mappings for are: 267 Good examples of what to use consistent mappings for are:
268 268
269 - Network card DMA ring descriptors. 269 - Network card DMA ring descriptors.
270 - SCSI adapter mailbox command data structures. 270 - SCSI adapter mailbox command data structures.
271 - Device firmware microcode executed out of 271 - Device firmware microcode executed out of
272 main memory. 272 main memory.
273 273
274 The invariant these examples all require is that any CPU store 274 The invariant these examples all require is that any CPU store
275 to memory is immediately visible to the device, and vice 275 to memory is immediately visible to the device, and vice
276 versa. Consistent mappings guarantee this. 276 versa. Consistent mappings guarantee this.
277 277
278 IMPORTANT: Consistent DMA memory does not preclude the usage of 278 IMPORTANT: Consistent DMA memory does not preclude the usage of
279 proper memory barriers. The CPU may reorder stores to 279 proper memory barriers. The CPU may reorder stores to
280 consistent memory just as it may normal memory. Example: 280 consistent memory just as it may normal memory. Example:
281 if it is important for the device to see the first word 281 if it is important for the device to see the first word
282 of a descriptor updated before the second, you must do 282 of a descriptor updated before the second, you must do
283 something like: 283 something like:
284 284
285 desc->word0 = address; 285 desc->word0 = address;
286 wmb(); 286 wmb();
287 desc->word1 = DESC_VALID; 287 desc->word1 = DESC_VALID;
288 288
289 in order to get correct behavior on all platforms. 289 in order to get correct behavior on all platforms.
290 290
291 Also, on some platforms your driver may need to flush CPU write 291 Also, on some platforms your driver may need to flush CPU write
292 buffers in much the same way as it needs to flush write buffers 292 buffers in much the same way as it needs to flush write buffers
293 found in PCI bridges (such as by reading a register's value 293 found in PCI bridges (such as by reading a register's value
294 after writing it). 294 after writing it).
295 295
296 - Streaming DMA mappings which are usually mapped for one DMA transfer, 296 - Streaming DMA mappings which are usually mapped for one DMA transfer,
297 unmapped right after it (unless you use pci_dma_sync_* below) and for which 297 unmapped right after it (unless you use pci_dma_sync_* below) and for which
298 hardware can optimize for sequential accesses. 298 hardware can optimize for sequential accesses.
299 299
300 This of "streaming" as "asynchronous" or "outside the coherency 300 This of "streaming" as "asynchronous" or "outside the coherency
301 domain". 301 domain".
302 302
303 Good examples of what to use streaming mappings for are: 303 Good examples of what to use streaming mappings for are:
304 304
305 - Networking buffers transmitted/received by a device. 305 - Networking buffers transmitted/received by a device.
306 - Filesystem buffers written/read by a SCSI device. 306 - Filesystem buffers written/read by a SCSI device.
307 307
308 The interfaces for using this type of mapping were designed in 308 The interfaces for using this type of mapping were designed in
309 such a way that an implementation can make whatever performance 309 such a way that an implementation can make whatever performance
310 optimizations the hardware allows. To this end, when using 310 optimizations the hardware allows. To this end, when using
311 such mappings you must be explicit about what you want to happen. 311 such mappings you must be explicit about what you want to happen.
312 312
313 Neither type of DMA mapping has alignment restrictions that come 313 Neither type of DMA mapping has alignment restrictions that come
314 from PCI, although some devices may have such restrictions. 314 from PCI, although some devices may have such restrictions.
315 Also, systems with caches that aren't DMA-coherent will work better 315 Also, systems with caches that aren't DMA-coherent will work better
316 when the underlying buffers don't share cache lines with other data. 316 when the underlying buffers don't share cache lines with other data.
317 317
318 318
319 Using Consistent DMA mappings. 319 Using Consistent DMA mappings.
320 320
321 To allocate and map large (PAGE_SIZE or so) consistent DMA regions, 321 To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
322 you should do: 322 you should do:
323 323
324 dma_addr_t dma_handle; 324 dma_addr_t dma_handle;
325 325
326 cpu_addr = pci_alloc_consistent(dev, size, &dma_handle); 326 cpu_addr = pci_alloc_consistent(dev, size, &dma_handle);
327 327
328 where dev is a struct pci_dev *. You should pass NULL for PCI like buses 328 where dev is a struct pci_dev *. You should pass NULL for PCI like buses
329 where devices don't have struct pci_dev (like ISA, EISA). This may be 329 where devices don't have struct pci_dev (like ISA, EISA). This may be
330 called in interrupt context. 330 called in interrupt context.
331 331
332 This argument is needed because the DMA translations may be bus 332 This argument is needed because the DMA translations may be bus
333 specific (and often is private to the bus which the device is attached 333 specific (and often is private to the bus which the device is attached
334 to). 334 to).
335 335
336 Size is the length of the region you want to allocate, in bytes. 336 Size is the length of the region you want to allocate, in bytes.
337 337
338 This routine will allocate RAM for that region, so it acts similarly to 338 This routine will allocate RAM for that region, so it acts similarly to
339 __get_free_pages (but takes size instead of a page order). If your 339 __get_free_pages (but takes size instead of a page order). If your
340 driver needs regions sized smaller than a page, you may prefer using 340 driver needs regions sized smaller than a page, you may prefer using
341 the pci_pool interface, described below. 341 the pci_pool interface, described below.
342 342
343 The consistent DMA mapping interfaces, for non-NULL dev, will by 343 The consistent DMA mapping interfaces, for non-NULL dev, will by
344 default return a DMA address which is SAC (Single Address Cycle) 344 default return a DMA address which is SAC (Single Address Cycle)
345 addressable. Even if the device indicates (via PCI dma mask) that it 345 addressable. Even if the device indicates (via PCI dma mask) that it
346 may address the upper 32-bits and thus perform DAC cycles, consistent 346 may address the upper 32-bits and thus perform DAC cycles, consistent
347 allocation will only return > 32-bit PCI addresses for DMA if the 347 allocation will only return > 32-bit PCI addresses for DMA if the
348 consistent dma mask has been explicitly changed via 348 consistent dma mask has been explicitly changed via
349 pci_set_consistent_dma_mask(). This is true of the pci_pool interface 349 pci_set_consistent_dma_mask(). This is true of the pci_pool interface
350 as well. 350 as well.
351 351
352 pci_alloc_consistent returns two values: the virtual address which you 352 pci_alloc_consistent returns two values: the virtual address which you
353 can use to access it from the CPU and dma_handle which you pass to the 353 can use to access it from the CPU and dma_handle which you pass to the
354 card. 354 card.
355 355
356 The cpu return address and the DMA bus master address are both 356 The cpu return address and the DMA bus master address are both
357 guaranteed to be aligned to the smallest PAGE_SIZE order which 357 guaranteed to be aligned to the smallest PAGE_SIZE order which
358 is greater than or equal to the requested size. This invariant 358 is greater than or equal to the requested size. This invariant
359 exists (for example) to guarantee that if you allocate a chunk 359 exists (for example) to guarantee that if you allocate a chunk
360 which is smaller than or equal to 64 kilobytes, the extent of the 360 which is smaller than or equal to 64 kilobytes, the extent of the
361 buffer you receive will not cross a 64K boundary. 361 buffer you receive will not cross a 64K boundary.
362 362
363 To unmap and free such a DMA region, you call: 363 To unmap and free such a DMA region, you call:
364 364
365 pci_free_consistent(dev, size, cpu_addr, dma_handle); 365 pci_free_consistent(dev, size, cpu_addr, dma_handle);
366 366
367 where dev, size are the same as in the above call and cpu_addr and 367 where dev, size are the same as in the above call and cpu_addr and
368 dma_handle are the values pci_alloc_consistent returned to you. 368 dma_handle are the values pci_alloc_consistent returned to you.
369 This function may not be called in interrupt context. 369 This function may not be called in interrupt context.
370 370
371 If your driver needs lots of smaller memory regions, you can write 371 If your driver needs lots of smaller memory regions, you can write
372 custom code to subdivide pages returned by pci_alloc_consistent, 372 custom code to subdivide pages returned by pci_alloc_consistent,
373 or you can use the pci_pool API to do that. A pci_pool is like 373 or you can use the pci_pool API to do that. A pci_pool is like
374 a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages. 374 a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages.
375 Also, it understands common hardware constraints for alignment, 375 Also, it understands common hardware constraints for alignment,
376 like queue heads needing to be aligned on N byte boundaries. 376 like queue heads needing to be aligned on N byte boundaries.
377 377
378 Create a pci_pool like this: 378 Create a pci_pool like this:
379 379
380 struct pci_pool *pool; 380 struct pci_pool *pool;
381 381
382 pool = pci_pool_create(name, dev, size, align, alloc); 382 pool = pci_pool_create(name, dev, size, align, alloc);
383 383
384 The "name" is for diagnostics (like a kmem_cache name); dev and size 384 The "name" is for diagnostics (like a kmem_cache name); dev and size
385 are as above. The device's hardware alignment requirement for this 385 are as above. The device's hardware alignment requirement for this
386 type of data is "align" (which is expressed in bytes, and must be a 386 type of data is "align" (which is expressed in bytes, and must be a
387 power of two). If your device has no boundary crossing restrictions, 387 power of two). If your device has no boundary crossing restrictions,
388 pass 0 for alloc; passing 4096 says memory allocated from this pool 388 pass 0 for alloc; passing 4096 says memory allocated from this pool
389 must not cross 4KByte boundaries (but at that time it may be better to 389 must not cross 4KByte boundaries (but at that time it may be better to
390 go for pci_alloc_consistent directly instead). 390 go for pci_alloc_consistent directly instead).
391 391
392 Allocate memory from a pci pool like this: 392 Allocate memory from a pci pool like this:
393 393
394 cpu_addr = pci_pool_alloc(pool, flags, &dma_handle); 394 cpu_addr = pci_pool_alloc(pool, flags, &dma_handle);
395 395
396 flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor 396 flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor
397 holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent, 397 holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent,
398 this returns two values, cpu_addr and dma_handle. 398 this returns two values, cpu_addr and dma_handle.
399 399
400 Free memory that was allocated from a pci_pool like this: 400 Free memory that was allocated from a pci_pool like this:
401 401
402 pci_pool_free(pool, cpu_addr, dma_handle); 402 pci_pool_free(pool, cpu_addr, dma_handle);
403 403
404 where pool is what you passed to pci_pool_alloc, and cpu_addr and 404 where pool is what you passed to pci_pool_alloc, and cpu_addr and
405 dma_handle are the values pci_pool_alloc returned. This function 405 dma_handle are the values pci_pool_alloc returned. This function
406 may be called in interrupt context. 406 may be called in interrupt context.
407 407
408 Destroy a pci_pool by calling: 408 Destroy a pci_pool by calling:
409 409
410 pci_pool_destroy(pool); 410 pci_pool_destroy(pool);
411 411
412 Make sure you've called pci_pool_free for all memory allocated 412 Make sure you've called pci_pool_free for all memory allocated
413 from a pool before you destroy the pool. This function may not 413 from a pool before you destroy the pool. This function may not
414 be called in interrupt context. 414 be called in interrupt context.
415 415
416 DMA Direction 416 DMA Direction
417 417
418 The interfaces described in subsequent portions of this document 418 The interfaces described in subsequent portions of this document
419 take a DMA direction argument, which is an integer and takes on 419 take a DMA direction argument, which is an integer and takes on
420 one of the following values: 420 one of the following values:
421 421
422 PCI_DMA_BIDIRECTIONAL 422 PCI_DMA_BIDIRECTIONAL
423 PCI_DMA_TODEVICE 423 PCI_DMA_TODEVICE
424 PCI_DMA_FROMDEVICE 424 PCI_DMA_FROMDEVICE
425 PCI_DMA_NONE 425 PCI_DMA_NONE
426 426
427 One should provide the exact DMA direction if you know it. 427 One should provide the exact DMA direction if you know it.
428 428
429 PCI_DMA_TODEVICE means "from main memory to the PCI device" 429 PCI_DMA_TODEVICE means "from main memory to the PCI device"
430 PCI_DMA_FROMDEVICE means "from the PCI device to main memory" 430 PCI_DMA_FROMDEVICE means "from the PCI device to main memory"
431 It is the direction in which the data moves during the DMA 431 It is the direction in which the data moves during the DMA
432 transfer. 432 transfer.
433 433
434 You are _strongly_ encouraged to specify this as precisely 434 You are _strongly_ encouraged to specify this as precisely
435 as you possibly can. 435 as you possibly can.
436 436
437 If you absolutely cannot know the direction of the DMA transfer, 437 If you absolutely cannot know the direction of the DMA transfer,
438 specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in 438 specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in
439 either direction. The platform guarantees that you may legally 439 either direction. The platform guarantees that you may legally
440 specify this, and that it will work, but this may be at the 440 specify this, and that it will work, but this may be at the
441 cost of performance for example. 441 cost of performance for example.
442 442
443 The value PCI_DMA_NONE is to be used for debugging. One can 443 The value PCI_DMA_NONE is to be used for debugging. One can
444 hold this in a data structure before you come to know the 444 hold this in a data structure before you come to know the
445 precise direction, and this will help catch cases where your 445 precise direction, and this will help catch cases where your
446 direction tracking logic has failed to set things up properly. 446 direction tracking logic has failed to set things up properly.
447 447
448 Another advantage of specifying this value precisely (outside of 448 Another advantage of specifying this value precisely (outside of
449 potential platform-specific optimizations of such) is for debugging. 449 potential platform-specific optimizations of such) is for debugging.
450 Some platforms actually have a write permission boolean which DMA 450 Some platforms actually have a write permission boolean which DMA
451 mappings can be marked with, much like page protections in the user 451 mappings can be marked with, much like page protections in the user
452 program address space. Such platforms can and do report errors in the 452 program address space. Such platforms can and do report errors in the
453 kernel logs when the PCI controller hardware detects violation of the 453 kernel logs when the PCI controller hardware detects violation of the
454 permission setting. 454 permission setting.
455 455
456 Only streaming mappings specify a direction, consistent mappings 456 Only streaming mappings specify a direction, consistent mappings
457 implicitly have a direction attribute setting of 457 implicitly have a direction attribute setting of
458 PCI_DMA_BIDIRECTIONAL. 458 PCI_DMA_BIDIRECTIONAL.
459 459
460 The SCSI subsystem tells you the direction to use in the 460 The SCSI subsystem tells you the direction to use in the
461 'sc_data_direction' member of the SCSI command your driver is 461 'sc_data_direction' member of the SCSI command your driver is
462 working on. 462 working on.
463 463
464 For Networking drivers, it's a rather simple affair. For transmit 464 For Networking drivers, it's a rather simple affair. For transmit
465 packets, map/unmap them with the PCI_DMA_TODEVICE direction 465 packets, map/unmap them with the PCI_DMA_TODEVICE direction
466 specifier. For receive packets, just the opposite, map/unmap them 466 specifier. For receive packets, just the opposite, map/unmap them
467 with the PCI_DMA_FROMDEVICE direction specifier. 467 with the PCI_DMA_FROMDEVICE direction specifier.
468 468
469 Using Streaming DMA mappings 469 Using Streaming DMA mappings
470 470
471 The streaming DMA mapping routines can be called from interrupt 471 The streaming DMA mapping routines can be called from interrupt
472 context. There are two versions of each map/unmap, one which will 472 context. There are two versions of each map/unmap, one which will
473 map/unmap a single memory region, and one which will map/unmap a 473 map/unmap a single memory region, and one which will map/unmap a
474 scatterlist. 474 scatterlist.
475 475
476 To map a single region, you do: 476 To map a single region, you do:
477 477
478 struct pci_dev *pdev = mydev->pdev; 478 struct pci_dev *pdev = mydev->pdev;
479 dma_addr_t dma_handle; 479 dma_addr_t dma_handle;
480 void *addr = buffer->ptr; 480 void *addr = buffer->ptr;
481 size_t size = buffer->len; 481 size_t size = buffer->len;
482 482
483 dma_handle = pci_map_single(dev, addr, size, direction); 483 dma_handle = pci_map_single(dev, addr, size, direction);
484 484
485 and to unmap it: 485 and to unmap it:
486 486
487 pci_unmap_single(dev, dma_handle, size, direction); 487 pci_unmap_single(dev, dma_handle, size, direction);
488 488
489 You should call pci_unmap_single when the DMA activity is finished, e.g. 489 You should call pci_unmap_single when the DMA activity is finished, e.g.
490 from the interrupt which told you that the DMA transfer is done. 490 from the interrupt which told you that the DMA transfer is done.
491 491
492 Using cpu pointers like this for single mappings has a disadvantage, 492 Using cpu pointers like this for single mappings has a disadvantage,
493 you cannot reference HIGHMEM memory in this way. Thus, there is a 493 you cannot reference HIGHMEM memory in this way. Thus, there is a
494 map/unmap interface pair akin to pci_{map,unmap}_single. These 494 map/unmap interface pair akin to pci_{map,unmap}_single. These
495 interfaces deal with page/offset pairs instead of cpu pointers. 495 interfaces deal with page/offset pairs instead of cpu pointers.
496 Specifically: 496 Specifically:
497 497
498 struct pci_dev *pdev = mydev->pdev; 498 struct pci_dev *pdev = mydev->pdev;
499 dma_addr_t dma_handle; 499 dma_addr_t dma_handle;
500 struct page *page = buffer->page; 500 struct page *page = buffer->page;
501 unsigned long offset = buffer->offset; 501 unsigned long offset = buffer->offset;
502 size_t size = buffer->len; 502 size_t size = buffer->len;
503 503
504 dma_handle = pci_map_page(dev, page, offset, size, direction); 504 dma_handle = pci_map_page(dev, page, offset, size, direction);
505 505
506 ... 506 ...
507 507
508 pci_unmap_page(dev, dma_handle, size, direction); 508 pci_unmap_page(dev, dma_handle, size, direction);
509 509
510 Here, "offset" means byte offset within the given page. 510 Here, "offset" means byte offset within the given page.
511 511
512 With scatterlists, you map a region gathered from several regions by: 512 With scatterlists, you map a region gathered from several regions by:
513 513
514 int i, count = pci_map_sg(dev, sglist, nents, direction); 514 int i, count = pci_map_sg(dev, sglist, nents, direction);
515 struct scatterlist *sg; 515 struct scatterlist *sg;
516 516
517 for (i = 0, sg = sglist; i < count; i++, sg++) { 517 for (i = 0, sg = sglist; i < count; i++, sg++) {
518 hw_address[i] = sg_dma_address(sg); 518 hw_address[i] = sg_dma_address(sg);
519 hw_len[i] = sg_dma_len(sg); 519 hw_len[i] = sg_dma_len(sg);
520 } 520 }
521 521
522 where nents is the number of entries in the sglist. 522 where nents is the number of entries in the sglist.
523 523
524 The implementation is free to merge several consecutive sglist entries 524 The implementation is free to merge several consecutive sglist entries
525 into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any 525 into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any
526 consecutive sglist entries can be merged into one provided the first one 526 consecutive sglist entries can be merged into one provided the first one
527 ends and the second one starts on a page boundary - in fact this is a huge 527 ends and the second one starts on a page boundary - in fact this is a huge
528 advantage for cards which either cannot do scatter-gather or have very 528 advantage for cards which either cannot do scatter-gather or have very
529 limited number of scatter-gather entries) and returns the actual number 529 limited number of scatter-gather entries) and returns the actual number
530 of sg entries it mapped them to. On failure 0 is returned. 530 of sg entries it mapped them to. On failure 0 is returned.
531 531
532 Then you should loop count times (note: this can be less than nents times) 532 Then you should loop count times (note: this can be less than nents times)
533 and use sg_dma_address() and sg_dma_len() macros where you previously 533 and use sg_dma_address() and sg_dma_len() macros where you previously
534 accessed sg->address and sg->length as shown above. 534 accessed sg->address and sg->length as shown above.
535 535
536 To unmap a scatterlist, just call: 536 To unmap a scatterlist, just call:
537 537
538 pci_unmap_sg(dev, sglist, nents, direction); 538 pci_unmap_sg(dev, sglist, nents, direction);
539 539
540 Again, make sure DMA activity has already finished. 540 Again, make sure DMA activity has already finished.
541 541
542 PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be 542 PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be
543 the _same_ one you passed into the pci_map_sg call, 543 the _same_ one you passed into the pci_map_sg call,
544 it should _NOT_ be the 'count' value _returned_ from the 544 it should _NOT_ be the 'count' value _returned_ from the
545 pci_map_sg call. 545 pci_map_sg call.
546 546
547 Every pci_map_{single,sg} call should have its pci_unmap_{single,sg} 547 Every pci_map_{single,sg} call should have its pci_unmap_{single,sg}
548 counterpart, because the bus address space is a shared resource (although 548 counterpart, because the bus address space is a shared resource (although
549 in some ports the mapping is per each BUS so less devices contend for the 549 in some ports the mapping is per each BUS so less devices contend for the
550 same bus address space) and you could render the machine unusable by eating 550 same bus address space) and you could render the machine unusable by eating
551 all bus addresses. 551 all bus addresses.
552 552
553 If you need to use the same streaming DMA region multiple times and touch 553 If you need to use the same streaming DMA region multiple times and touch
554 the data in between the DMA transfers, the buffer needs to be synced 554 the data in between the DMA transfers, the buffer needs to be synced
555 properly in order for the cpu and device to see the most uptodate and 555 properly in order for the cpu and device to see the most uptodate and
556 correct copy of the DMA buffer. 556 correct copy of the DMA buffer.
557 557
558 So, firstly, just map it with pci_map_{single,sg}, and after each DMA 558 So, firstly, just map it with pci_map_{single,sg}, and after each DMA
559 transfer call either: 559 transfer call either:
560 560
561 pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction); 561 pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction);
562 562
563 or: 563 or:
564 564
565 pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction); 565 pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction);
566 566
567 as appropriate. 567 as appropriate.
568 568
569 Then, if you wish to let the device get at the DMA area again, 569 Then, if you wish to let the device get at the DMA area again,
570 finish accessing the data with the cpu, and then before actually 570 finish accessing the data with the cpu, and then before actually
571 giving the buffer to the hardware call either: 571 giving the buffer to the hardware call either:
572 572
573 pci_dma_sync_single_for_device(dev, dma_handle, size, direction); 573 pci_dma_sync_single_for_device(dev, dma_handle, size, direction);
574 574
575 or: 575 or:
576 576
577 pci_dma_sync_sg_for_device(dev, sglist, nents, direction); 577 pci_dma_sync_sg_for_device(dev, sglist, nents, direction);
578 578
579 as appropriate. 579 as appropriate.
580 580
581 After the last DMA transfer call one of the DMA unmap routines 581 After the last DMA transfer call one of the DMA unmap routines
582 pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_* 582 pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_*
583 call till pci_unmap_*, then you don't have to call the pci_dma_sync_* 583 call till pci_unmap_*, then you don't have to call the pci_dma_sync_*
584 routines at all. 584 routines at all.
585 585
586 Here is pseudo code which shows a situation in which you would need 586 Here is pseudo code which shows a situation in which you would need
587 to use the pci_dma_sync_*() interfaces. 587 to use the pci_dma_sync_*() interfaces.
588 588
589 my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len) 589 my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
590 { 590 {
591 dma_addr_t mapping; 591 dma_addr_t mapping;
592 592
593 mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE); 593 mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE);
594 594
595 cp->rx_buf = buffer; 595 cp->rx_buf = buffer;
596 cp->rx_len = len; 596 cp->rx_len = len;
597 cp->rx_dma = mapping; 597 cp->rx_dma = mapping;
598 598
599 give_rx_buf_to_card(cp); 599 give_rx_buf_to_card(cp);
600 } 600 }
601 601
602 ... 602 ...
603 603
604 my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs) 604 my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs)
605 { 605 {
606 struct my_card *cp = devid; 606 struct my_card *cp = devid;
607 607
608 ... 608 ...
609 if (read_card_status(cp) == RX_BUF_TRANSFERRED) { 609 if (read_card_status(cp) == RX_BUF_TRANSFERRED) {
610 struct my_card_header *hp; 610 struct my_card_header *hp;
611 611
612 /* Examine the header to see if we wish 612 /* Examine the header to see if we wish
613 * to accept the data. But synchronize 613 * to accept the data. But synchronize
614 * the DMA transfer with the CPU first 614 * the DMA transfer with the CPU first
615 * so that we see updated contents. 615 * so that we see updated contents.
616 */ 616 */
617 pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma, 617 pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma,
618 cp->rx_len, 618 cp->rx_len,
619 PCI_DMA_FROMDEVICE); 619 PCI_DMA_FROMDEVICE);
620 620
621 /* Now it is safe to examine the buffer. */ 621 /* Now it is safe to examine the buffer. */
622 hp = (struct my_card_header *) cp->rx_buf; 622 hp = (struct my_card_header *) cp->rx_buf;
623 if (header_is_ok(hp)) { 623 if (header_is_ok(hp)) {
624 pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len, 624 pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len,
625 PCI_DMA_FROMDEVICE); 625 PCI_DMA_FROMDEVICE);
626 pass_to_upper_layers(cp->rx_buf); 626 pass_to_upper_layers(cp->rx_buf);
627 make_and_setup_new_rx_buf(cp); 627 make_and_setup_new_rx_buf(cp);
628 } else { 628 } else {
629 /* Just sync the buffer and give it back 629 /* Just sync the buffer and give it back
630 * to the card. 630 * to the card.
631 */ 631 */
632 pci_dma_sync_single_for_device(cp->pdev, 632 pci_dma_sync_single_for_device(cp->pdev,
633 cp->rx_dma, 633 cp->rx_dma,
634 cp->rx_len, 634 cp->rx_len,
635 PCI_DMA_FROMDEVICE); 635 PCI_DMA_FROMDEVICE);
636 give_rx_buf_to_card(cp); 636 give_rx_buf_to_card(cp);
637 } 637 }
638 } 638 }
639 } 639 }
640 640
641 Drivers converted fully to this interface should not use virt_to_bus any 641 Drivers converted fully to this interface should not use virt_to_bus any
642 longer, nor should they use bus_to_virt. Some drivers have to be changed a 642 longer, nor should they use bus_to_virt. Some drivers have to be changed a
643 little bit, because there is no longer an equivalent to bus_to_virt in the 643 little bit, because there is no longer an equivalent to bus_to_virt in the
644 dynamic DMA mapping scheme - you have to always store the DMA addresses 644 dynamic DMA mapping scheme - you have to always store the DMA addresses
645 returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single 645 returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single
646 calls (pci_map_sg stores them in the scatterlist itself if the platform 646 calls (pci_map_sg stores them in the scatterlist itself if the platform
647 supports dynamic DMA mapping in hardware) in your driver structures and/or 647 supports dynamic DMA mapping in hardware) in your driver structures and/or
648 in the card registers. 648 in the card registers.
649 649
650 All PCI drivers should be using these interfaces with no exceptions. 650 All PCI drivers should be using these interfaces with no exceptions.
651 It is planned to completely remove virt_to_bus() and bus_to_virt() as 651 It is planned to completely remove virt_to_bus() and bus_to_virt() as
652 they are entirely deprecated. Some ports already do not provide these 652 they are entirely deprecated. Some ports already do not provide these
653 as it is impossible to correctly support them. 653 as it is impossible to correctly support them.
654 654
655 64-bit DMA and DAC cycle support 655 64-bit DMA and DAC cycle support
656 656
657 Do you understand all of the text above? Great, then you already 657 Do you understand all of the text above? Great, then you already
658 know how to use 64-bit DMA addressing under Linux. Simply make 658 know how to use 64-bit DMA addressing under Linux. Simply make
659 the appropriate pci_set_dma_mask() calls based upon your cards 659 the appropriate pci_set_dma_mask() calls based upon your cards
660 capabilities, then use the mapping APIs above. 660 capabilities, then use the mapping APIs above.
661 661
662 It is that simple. 662 It is that simple.
663 663
664 Well, not for some odd devices. See the next section for information 664 Well, not for some odd devices. See the next section for information
665 about that. 665 about that.
666 666
667 DAC Addressing for Address Space Hungry Devices 667 DAC Addressing for Address Space Hungry Devices
668 668
669 There exists a class of devices which do not mesh well with the PCI 669 There exists a class of devices which do not mesh well with the PCI
670 DMA mapping API. By definition these "mappings" are a finite 670 DMA mapping API. By definition these "mappings" are a finite
671 resource. The number of total available mappings per bus is platform 671 resource. The number of total available mappings per bus is platform
672 specific, but there will always be a reasonable amount. 672 specific, but there will always be a reasonable amount.
673 673
674 What is "reasonable"? Reasonable means that networking and block I/O 674 What is "reasonable"? Reasonable means that networking and block I/O
675 devices need not worry about using too many mappings. 675 devices need not worry about using too many mappings.
676 676
677 As an example of a problematic device, consider compute cluster cards. 677 As an example of a problematic device, consider compute cluster cards.
678 They can potentially need to access gigabytes of memory at once via 678 They can potentially need to access gigabytes of memory at once via
679 DMA. Dynamic mappings are unsuitable for this kind of access pattern. 679 DMA. Dynamic mappings are unsuitable for this kind of access pattern.
680 680
681 To this end we've provided a small API by which a device driver 681 To this end we've provided a small API by which a device driver
682 may use DAC cycles to directly address all of physical memory. 682 may use DAC cycles to directly address all of physical memory.
683 Not all platforms support this, but most do. It is easy to determine 683 Not all platforms support this, but most do. It is easy to determine
684 whether the platform will work properly at probe time. 684 whether the platform will work properly at probe time.
685 685
686 First, understand that there may be a SEVERE performance penalty for 686 First, understand that there may be a SEVERE performance penalty for
687 using these interfaces on some platforms. Therefore, you MUST only 687 using these interfaces on some platforms. Therefore, you MUST only
688 use these interfaces if it is absolutely required. %99 of devices can 688 use these interfaces if it is absolutely required. %99 of devices can
689 use the normal APIs without any problems. 689 use the normal APIs without any problems.
690 690
691 Note that for streaming type mappings you must either use these 691 Note that for streaming type mappings you must either use these
692 interfaces, or the dynamic mapping interfaces above. You may not mix 692 interfaces, or the dynamic mapping interfaces above. You may not mix
693 usage of both for the same device. Such an act is illegal and is 693 usage of both for the same device. Such an act is illegal and is
694 guaranteed to put a banana in your tailpipe. 694 guaranteed to put a banana in your tailpipe.
695 695
696 However, consistent mappings may in fact be used in conjunction with 696 However, consistent mappings may in fact be used in conjunction with
697 these interfaces. Remember that, as defined, consistent mappings are 697 these interfaces. Remember that, as defined, consistent mappings are
698 always going to be SAC addressable. 698 always going to be SAC addressable.
699 699
700 The first thing your driver needs to do is query the PCI platform 700 The first thing your driver needs to do is query the PCI platform
701 layer if it is capable of handling your devices DAC addressing 701 layer if it is capable of handling your devices DAC addressing
702 capabilities: 702 capabilities:
703 703
704 int pci_dac_dma_supported(struct pci_dev *hwdev, u64 mask); 704 int pci_dac_dma_supported(struct pci_dev *hwdev, u64 mask);
705 705
706 You may not use the following interfaces if this routine fails. 706 You may not use the following interfaces if this routine fails.
707 707
708 Next, DMA addresses using this API are kept track of using the 708 Next, DMA addresses using this API are kept track of using the
709 dma64_addr_t type. It is guaranteed to be big enough to hold any 709 dma64_addr_t type. It is guaranteed to be big enough to hold any
710 DAC address the platform layer will give to you from the following 710 DAC address the platform layer will give to you from the following
711 routines. If you have consistent mappings as well, you still 711 routines. If you have consistent mappings as well, you still
712 use plain dma_addr_t to keep track of those. 712 use plain dma_addr_t to keep track of those.
713 713
714 All mappings obtained here will be direct. The mappings are not 714 All mappings obtained here will be direct. The mappings are not
715 translated, and this is the purpose of this dialect of the DMA API. 715 translated, and this is the purpose of this dialect of the DMA API.
716 716
717 All routines work with page/offset pairs. This is the _ONLY_ way to 717 All routines work with page/offset pairs. This is the _ONLY_ way to
718 portably refer to any piece of memory. If you have a cpu pointer 718 portably refer to any piece of memory. If you have a cpu pointer
719 (which may be validly DMA'd too) you may easily obtain the page 719 (which may be validly DMA'd too) you may easily obtain the page
720 and offset using something like this: 720 and offset using something like this:
721 721
722 struct page *page = virt_to_page(ptr); 722 struct page *page = virt_to_page(ptr);
723 unsigned long offset = offset_in_page(ptr); 723 unsigned long offset = offset_in_page(ptr);
724 724
725 Here are the interfaces: 725 Here are the interfaces:
726 726
727 dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev, 727 dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev,
728 struct page *page, 728 struct page *page,
729 unsigned long offset, 729 unsigned long offset,
730 int direction); 730 int direction);
731 731
732 The DAC address for the tuple PAGE/OFFSET are returned. The direction 732 The DAC address for the tuple PAGE/OFFSET are returned. The direction
733 argument is the same as for pci_{map,unmap}_single(). The same rules 733 argument is the same as for pci_{map,unmap}_single(). The same rules
734 for cpu/device access apply here as for the streaming mapping 734 for cpu/device access apply here as for the streaming mapping
735 interfaces. To reiterate: 735 interfaces. To reiterate:
736 736
737 The cpu may touch the buffer before pci_dac_page_to_dma. 737 The cpu may touch the buffer before pci_dac_page_to_dma.
738 The device may touch the buffer after pci_dac_page_to_dma 738 The device may touch the buffer after pci_dac_page_to_dma
739 is made, but the cpu may NOT. 739 is made, but the cpu may NOT.
740 740
741 When the DMA transfer is complete, invoke: 741 When the DMA transfer is complete, invoke:
742 742
743 void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev, 743 void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev,
744 dma64_addr_t dma_addr, 744 dma64_addr_t dma_addr,
745 size_t len, int direction); 745 size_t len, int direction);
746 746
747 This must be done before the CPU looks at the buffer again. 747 This must be done before the CPU looks at the buffer again.
748 This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu(). 748 This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu().
749 749
750 And likewise, if you wish to let the device get back at the buffer after 750 And likewise, if you wish to let the device get back at the buffer after
751 the cpu has read/written it, invoke: 751 the cpu has read/written it, invoke:
752 752
753 void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev, 753 void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev,
754 dma64_addr_t dma_addr, 754 dma64_addr_t dma_addr,
755 size_t len, int direction); 755 size_t len, int direction);
756 756
757 before letting the device access the DMA area again. 757 before letting the device access the DMA area again.
758 758
759 If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t 759 If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t
760 the following interfaces are provided: 760 the following interfaces are provided:
761 761
762 struct page *pci_dac_dma_to_page(struct pci_dev *pdev, 762 struct page *pci_dac_dma_to_page(struct pci_dev *pdev,
763 dma64_addr_t dma_addr); 763 dma64_addr_t dma_addr);
764 unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev, 764 unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev,
765 dma64_addr_t dma_addr); 765 dma64_addr_t dma_addr);
766 766
767 This is possible with the DAC interfaces purely because they are 767 This is possible with the DAC interfaces purely because they are
768 not translated in any way. 768 not translated in any way.
769 769
770 Optimizing Unmap State Space Consumption 770 Optimizing Unmap State Space Consumption
771 771
772 On many platforms, pci_unmap_{single,page}() is simply a nop. 772 On many platforms, pci_unmap_{single,page}() is simply a nop.
773 Therefore, keeping track of the mapping address and length is a waste 773 Therefore, keeping track of the mapping address and length is a waste
774 of space. Instead of filling your drivers up with ifdefs and the like 774 of space. Instead of filling your drivers up with ifdefs and the like
775 to "work around" this (which would defeat the whole purpose of a 775 to "work around" this (which would defeat the whole purpose of a
776 portable API) the following facilities are provided. 776 portable API) the following facilities are provided.
777 777
778 Actually, instead of describing the macros one by one, we'll 778 Actually, instead of describing the macros one by one, we'll
779 transform some example code. 779 transform some example code.
780 780
781 1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures. 781 1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures.
782 Example, before: 782 Example, before:
783 783
784 struct ring_state { 784 struct ring_state {
785 struct sk_buff *skb; 785 struct sk_buff *skb;
786 dma_addr_t mapping; 786 dma_addr_t mapping;
787 __u32 len; 787 __u32 len;
788 }; 788 };
789 789
790 after: 790 after:
791 791
792 struct ring_state { 792 struct ring_state {
793 struct sk_buff *skb; 793 struct sk_buff *skb;
794 DECLARE_PCI_UNMAP_ADDR(mapping) 794 DECLARE_PCI_UNMAP_ADDR(mapping)
795 DECLARE_PCI_UNMAP_LEN(len) 795 DECLARE_PCI_UNMAP_LEN(len)
796 }; 796 };
797 797
798 NOTE: DO NOT put a semicolon at the end of the DECLARE_*() 798 NOTE: DO NOT put a semicolon at the end of the DECLARE_*()
799 macro. 799 macro.
800 800
801 2) Use pci_unmap_{addr,len}_set to set these values. 801 2) Use pci_unmap_{addr,len}_set to set these values.
802 Example, before: 802 Example, before:
803 803
804 ringp->mapping = FOO; 804 ringp->mapping = FOO;
805 ringp->len = BAR; 805 ringp->len = BAR;
806 806
807 after: 807 after:
808 808
809 pci_unmap_addr_set(ringp, mapping, FOO); 809 pci_unmap_addr_set(ringp, mapping, FOO);
810 pci_unmap_len_set(ringp, len, BAR); 810 pci_unmap_len_set(ringp, len, BAR);
811 811
812 3) Use pci_unmap_{addr,len} to access these values. 812 3) Use pci_unmap_{addr,len} to access these values.
813 Example, before: 813 Example, before:
814 814
815 pci_unmap_single(pdev, ringp->mapping, ringp->len, 815 pci_unmap_single(pdev, ringp->mapping, ringp->len,
816 PCI_DMA_FROMDEVICE); 816 PCI_DMA_FROMDEVICE);
817 817
818 after: 818 after:
819 819
820 pci_unmap_single(pdev, 820 pci_unmap_single(pdev,
821 pci_unmap_addr(ringp, mapping), 821 pci_unmap_addr(ringp, mapping),
822 pci_unmap_len(ringp, len), 822 pci_unmap_len(ringp, len),
823 PCI_DMA_FROMDEVICE); 823 PCI_DMA_FROMDEVICE);
824 824
825 It really should be self-explanatory. We treat the ADDR and LEN 825 It really should be self-explanatory. We treat the ADDR and LEN
826 separately, because it is possible for an implementation to only 826 separately, because it is possible for an implementation to only
827 need the address in order to perform the unmap operation. 827 need the address in order to perform the unmap operation.
828 828
829 Platform Issues 829 Platform Issues
830 830
831 If you are just writing drivers for Linux and do not maintain 831 If you are just writing drivers for Linux and do not maintain
832 an architecture port for the kernel, you can safely skip down 832 an architecture port for the kernel, you can safely skip down
833 to "Closing". 833 to "Closing".
834 834
835 1) Struct scatterlist requirements. 835 1) Struct scatterlist requirements.
836 836
837 Struct scatterlist must contain, at a minimum, the following 837 Struct scatterlist must contain, at a minimum, the following
838 members: 838 members:
839 839
840 struct page *page; 840 struct page *page;
841 unsigned int offset; 841 unsigned int offset;
842 unsigned int length; 842 unsigned int length;
843 843
844 The base address is specified by a "page+offset" pair. 844 The base address is specified by a "page+offset" pair.
845 845
846 Previous versions of struct scatterlist contained a "void *address" 846 Previous versions of struct scatterlist contained a "void *address"
847 field that was sometimes used instead of page+offset. As of Linux 847 field that was sometimes used instead of page+offset. As of Linux
848 2.5., page+offset is always used, and the "address" field has been 848 2.5., page+offset is always used, and the "address" field has been
849 deleted. 849 deleted.
850 850
851 2) More to come... 851 2) More to come...
852 852
853 Handling Errors 853 Handling Errors
854 854
855 DMA address space is limited on some architectures and an allocation 855 DMA address space is limited on some architectures and an allocation
856 failure can be determined by: 856 failure can be determined by:
857 857
858 - checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0 858 - checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0
859 859
860 - checking the returned dma_addr_t of pci_map_single and pci_map_page 860 - checking the returned dma_addr_t of pci_map_single and pci_map_page
861 by using pci_dma_mapping_error(): 861 by using pci_dma_mapping_error():
862 862
863 dma_addr_t dma_handle; 863 dma_addr_t dma_handle;
864 864
865 dma_handle = pci_map_single(dev, addr, size, direction); 865 dma_handle = pci_map_single(dev, addr, size, direction);
866 if (pci_dma_mapping_error(dma_handle)) { 866 if (pci_dma_mapping_error(dma_handle)) {
867 /* 867 /*
868 * reduce current DMA mapping usage, 868 * reduce current DMA mapping usage,
869 * delay and try again later or 869 * delay and try again later or
870 * reset driver. 870 * reset driver.
871 */ 871 */
872 } 872 }
873 873
874 Closing 874 Closing
875 875
876 This document, and the API itself, would not be in it's current 876 This document, and the API itself, would not be in it's current
877 form without the feedback and suggestions from numerous individuals. 877 form without the feedback and suggestions from numerous individuals.
878 We would like to specifically mention, in no particular order, the 878 We would like to specifically mention, in no particular order, the
879 following people: 879 following people:
880 880
881 Russell King <rmk@arm.linux.org.uk> 881 Russell King <rmk@arm.linux.org.uk>
882 Leo Dagum <dagum@barrel.engr.sgi.com> 882 Leo Dagum <dagum@barrel.engr.sgi.com>
883 Ralf Baechle <ralf@oss.sgi.com> 883 Ralf Baechle <ralf@oss.sgi.com>
884 Grant Grundler <grundler@cup.hp.com> 884 Grant Grundler <grundler@cup.hp.com>
885 Jay Estabrook <Jay.Estabrook@compaq.com> 885 Jay Estabrook <Jay.Estabrook@compaq.com>
886 Thomas Sailer <sailer@ife.ee.ethz.ch> 886 Thomas Sailer <sailer@ife.ee.ethz.ch>
887 Andrea Arcangeli <andrea@suse.de> 887 Andrea Arcangeli <andrea@suse.de>
888 Jens Axboe <axboe@suse.de> 888 Jens Axboe <axboe@suse.de>
889 David Mosberger-Tang <davidm@hpl.hp.com> 889 David Mosberger-Tang <davidm@hpl.hp.com>
890 890
Documentation/DocBook/libata.tmpl
1 <?xml version="1.0" encoding="UTF-8"?> 1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> 3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
4 4
5 <book id="libataDevGuide"> 5 <book id="libataDevGuide">
6 <bookinfo> 6 <bookinfo>
7 <title>libATA Developer's Guide</title> 7 <title>libATA Developer's Guide</title>
8 8
9 <authorgroup> 9 <authorgroup>
10 <author> 10 <author>
11 <firstname>Jeff</firstname> 11 <firstname>Jeff</firstname>
12 <surname>Garzik</surname> 12 <surname>Garzik</surname>
13 </author> 13 </author>
14 </authorgroup> 14 </authorgroup>
15 15
16 <copyright> 16 <copyright>
17 <year>2003-2005</year> 17 <year>2003-2005</year>
18 <holder>Jeff Garzik</holder> 18 <holder>Jeff Garzik</holder>
19 </copyright> 19 </copyright>
20 20
21 <legalnotice> 21 <legalnotice>
22 <para> 22 <para>
23 The contents of this file are subject to the Open 23 The contents of this file are subject to the Open
24 Software License version 1.1 that can be found at 24 Software License version 1.1 that can be found at
25 <ulink url="http://www.opensource.org/licenses/osl-1.1.txt">http://www.opensource.org/licenses/osl-1.1.txt</ulink> and is included herein 25 <ulink url="http://www.opensource.org/licenses/osl-1.1.txt">http://www.opensource.org/licenses/osl-1.1.txt</ulink> and is included herein
26 by reference. 26 by reference.
27 </para> 27 </para>
28 28
29 <para> 29 <para>
30 Alternatively, the contents of this file may be used under the terms 30 Alternatively, the contents of this file may be used under the terms
31 of the GNU General Public License version 2 (the "GPL") as distributed 31 of the GNU General Public License version 2 (the "GPL") as distributed
32 in the kernel source COPYING file, in which case the provisions of 32 in the kernel source COPYING file, in which case the provisions of
33 the GPL are applicable instead of the above. If you wish to allow 33 the GPL are applicable instead of the above. If you wish to allow
34 the use of your version of this file only under the terms of the 34 the use of your version of this file only under the terms of the
35 GPL and not to allow others to use your version of this file under 35 GPL and not to allow others to use your version of this file under
36 the OSL, indicate your decision by deleting the provisions above and 36 the OSL, indicate your decision by deleting the provisions above and
37 replace them with the notice and other provisions required by the GPL. 37 replace them with the notice and other provisions required by the GPL.
38 If you do not delete the provisions above, a recipient may use your 38 If you do not delete the provisions above, a recipient may use your
39 version of this file under either the OSL or the GPL. 39 version of this file under either the OSL or the GPL.
40 </para> 40 </para>
41 41
42 </legalnotice> 42 </legalnotice>
43 </bookinfo> 43 </bookinfo>
44 44
45 <toc></toc> 45 <toc></toc>
46 46
47 <chapter id="libataIntroduction"> 47 <chapter id="libataIntroduction">
48 <title>Introduction</title> 48 <title>Introduction</title>
49 <para> 49 <para>
50 libATA is a library used inside the Linux kernel to support ATA host 50 libATA is a library used inside the Linux kernel to support ATA host
51 controllers and devices. libATA provides an ATA driver API, class 51 controllers and devices. libATA provides an ATA driver API, class
52 transports for ATA and ATAPI devices, and SCSI&lt;-&gt;ATA translation 52 transports for ATA and ATAPI devices, and SCSI&lt;-&gt;ATA translation
53 for ATA devices according to the T10 SAT specification. 53 for ATA devices according to the T10 SAT specification.
54 </para> 54 </para>
55 <para> 55 <para>
56 This Guide documents the libATA driver API, library functions, library 56 This Guide documents the libATA driver API, library functions, library
57 internals, and a couple sample ATA low-level drivers. 57 internals, and a couple sample ATA low-level drivers.
58 </para> 58 </para>
59 </chapter> 59 </chapter>
60 60
61 <chapter id="libataDriverApi"> 61 <chapter id="libataDriverApi">
62 <title>libata Driver API</title> 62 <title>libata Driver API</title>
63 <para> 63 <para>
64 struct ata_port_operations is defined for every low-level libata 64 struct ata_port_operations is defined for every low-level libata
65 hardware driver, and it controls how the low-level driver 65 hardware driver, and it controls how the low-level driver
66 interfaces with the ATA and SCSI layers. 66 interfaces with the ATA and SCSI layers.
67 </para> 67 </para>
68 <para> 68 <para>
69 FIS-based drivers will hook into the system with ->qc_prep() and 69 FIS-based drivers will hook into the system with ->qc_prep() and
70 ->qc_issue() high-level hooks. Hardware which behaves in a manner 70 ->qc_issue() high-level hooks. Hardware which behaves in a manner
71 similar to PCI IDE hardware may utilize several generic helpers, 71 similar to PCI IDE hardware may utilize several generic helpers,
72 defining at a bare minimum the bus I/O addresses of the ATA shadow 72 defining at a bare minimum the bus I/O addresses of the ATA shadow
73 register blocks. 73 register blocks.
74 </para> 74 </para>
75 <sect1> 75 <sect1>
76 <title>struct ata_port_operations</title> 76 <title>struct ata_port_operations</title>
77 77
78 <sect2><title>Disable ATA port</title> 78 <sect2><title>Disable ATA port</title>
79 <programlisting> 79 <programlisting>
80 void (*port_disable) (struct ata_port *); 80 void (*port_disable) (struct ata_port *);
81 </programlisting> 81 </programlisting>
82 82
83 <para> 83 <para>
84 Called from ata_bus_probe() and ata_bus_reset() error paths, 84 Called from ata_bus_probe() and ata_bus_reset() error paths,
85 as well as when unregistering from the SCSI module (rmmod, hot 85 as well as when unregistering from the SCSI module (rmmod, hot
86 unplug). 86 unplug).
87 This function should do whatever needs to be done to take the 87 This function should do whatever needs to be done to take the
88 port out of use. In most cases, ata_port_disable() can be used 88 port out of use. In most cases, ata_port_disable() can be used
89 as this hook. 89 as this hook.
90 </para> 90 </para>
91 <para> 91 <para>
92 Called from ata_bus_probe() on a failed probe. 92 Called from ata_bus_probe() on a failed probe.
93 Called from ata_bus_reset() on a failed bus reset. 93 Called from ata_bus_reset() on a failed bus reset.
94 Called from ata_scsi_release(). 94 Called from ata_scsi_release().
95 </para> 95 </para>
96 96
97 </sect2> 97 </sect2>
98 98
99 <sect2><title>Post-IDENTIFY device configuration</title> 99 <sect2><title>Post-IDENTIFY device configuration</title>
100 <programlisting> 100 <programlisting>
101 void (*dev_config) (struct ata_port *, struct ata_device *); 101 void (*dev_config) (struct ata_port *, struct ata_device *);
102 </programlisting> 102 </programlisting>
103 103
104 <para> 104 <para>
105 Called after IDENTIFY [PACKET] DEVICE is issued to each device 105 Called after IDENTIFY [PACKET] DEVICE is issued to each device
106 found. Typically used to apply device-specific fixups prior to 106 found. Typically used to apply device-specific fixups prior to
107 issue of SET FEATURES - XFER MODE, and prior to operation. 107 issue of SET FEATURES - XFER MODE, and prior to operation.
108 </para> 108 </para>
109 <para> 109 <para>
110 Called by ata_device_add() after ata_dev_identify() determines 110 Called by ata_device_add() after ata_dev_identify() determines
111 a device is present. 111 a device is present.
112 </para> 112 </para>
113 <para> 113 <para>
114 This entry may be specified as NULL in ata_port_operations. 114 This entry may be specified as NULL in ata_port_operations.
115 </para> 115 </para>
116 116
117 </sect2> 117 </sect2>
118 118
119 <sect2><title>Set PIO/DMA mode</title> 119 <sect2><title>Set PIO/DMA mode</title>
120 <programlisting> 120 <programlisting>
121 void (*set_piomode) (struct ata_port *, struct ata_device *); 121 void (*set_piomode) (struct ata_port *, struct ata_device *);
122 void (*set_dmamode) (struct ata_port *, struct ata_device *); 122 void (*set_dmamode) (struct ata_port *, struct ata_device *);
123 void (*post_set_mode) (struct ata_port *); 123 void (*post_set_mode) (struct ata_port *);
124 unsigned int (*mode_filter) (struct ata_port *, struct ata_device *, unsigned int); 124 unsigned int (*mode_filter) (struct ata_port *, struct ata_device *, unsigned int);
125 </programlisting> 125 </programlisting>
126 126
127 <para> 127 <para>
128 Hooks called prior to the issue of SET FEATURES - XFER MODE 128 Hooks called prior to the issue of SET FEATURES - XFER MODE
129 command. The optional ->mode_filter() hook is called when libata 129 command. The optional ->mode_filter() hook is called when libata
130 has built a mask of the possible modes. This is passed to the 130 has built a mask of the possible modes. This is passed to the
131 ->mode_filter() function which should return a mask of valid modes 131 ->mode_filter() function which should return a mask of valid modes
132 after filtering those unsuitable due to hardware limits. It is not 132 after filtering those unsuitable due to hardware limits. It is not
133 valid to use this interface to add modes. 133 valid to use this interface to add modes.
134 </para> 134 </para>
135 <para> 135 <para>
136 dev->pio_mode and dev->dma_mode are guaranteed to be valid when 136 dev->pio_mode and dev->dma_mode are guaranteed to be valid when
137 ->set_piomode() and when ->set_dmamode() is called. The timings for 137 ->set_piomode() and when ->set_dmamode() is called. The timings for
138 any other drive sharing the cable will also be valid at this point. 138 any other drive sharing the cable will also be valid at this point.
139 That is the library records the decisions for the modes of each 139 That is the library records the decisions for the modes of each
140 drive on a channel before it attempts to set any of them. 140 drive on a channel before it attempts to set any of them.
141 </para> 141 </para>
142 <para> 142 <para>
143 ->post_set_mode() is 143 ->post_set_mode() is
144 called unconditionally, after the SET FEATURES - XFER MODE 144 called unconditionally, after the SET FEATURES - XFER MODE
145 command completes successfully. 145 command completes successfully.
146 </para> 146 </para>
147 147
148 <para> 148 <para>
149 ->set_piomode() is always called (if present), but 149 ->set_piomode() is always called (if present), but
150 ->set_dma_mode() is only called if DMA is possible. 150 ->set_dma_mode() is only called if DMA is possible.
151 </para> 151 </para>
152 152
153 </sect2> 153 </sect2>
154 154
155 <sect2><title>Taskfile read/write</title> 155 <sect2><title>Taskfile read/write</title>
156 <programlisting> 156 <programlisting>
157 void (*tf_load) (struct ata_port *ap, struct ata_taskfile *tf); 157 void (*tf_load) (struct ata_port *ap, struct ata_taskfile *tf);
158 void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf); 158 void (*tf_read) (struct ata_port *ap, struct ata_taskfile *tf);
159 </programlisting> 159 </programlisting>
160 160
161 <para> 161 <para>
162 ->tf_load() is called to load the given taskfile into hardware 162 ->tf_load() is called to load the given taskfile into hardware
163 registers / DMA buffers. ->tf_read() is called to read the 163 registers / DMA buffers. ->tf_read() is called to read the
164 hardware registers / DMA buffers, to obtain the current set of 164 hardware registers / DMA buffers, to obtain the current set of
165 taskfile register values. 165 taskfile register values.
166 Most drivers for taskfile-based hardware (PIO or MMIO) use 166 Most drivers for taskfile-based hardware (PIO or MMIO) use
167 ata_tf_load() and ata_tf_read() for these hooks. 167 ata_tf_load() and ata_tf_read() for these hooks.
168 </para> 168 </para>
169 169
170 </sect2> 170 </sect2>
171 171
172 <sect2><title>PIO data read/write</title> 172 <sect2><title>PIO data read/write</title>
173 <programlisting> 173 <programlisting>
174 void (*data_xfer) (struct ata_device *, unsigned char *, unsigned int, int); 174 void (*data_xfer) (struct ata_device *, unsigned char *, unsigned int, int);
175 </programlisting> 175 </programlisting>
176 176
177 <para> 177 <para>
178 All bmdma-style drivers must implement this hook. This is the low-level 178 All bmdma-style drivers must implement this hook. This is the low-level
179 operation that actually copies the data bytes during a PIO data 179 operation that actually copies the data bytes during a PIO data
180 transfer. 180 transfer.
181 Typically the driver 181 Typically the driver
182 will choose one of ata_pio_data_xfer_noirq(), ata_pio_data_xfer(), or 182 will choose one of ata_pio_data_xfer_noirq(), ata_pio_data_xfer(), or
183 ata_mmio_data_xfer(). 183 ata_mmio_data_xfer().
184 </para> 184 </para>
185 185
186 </sect2> 186 </sect2>
187 187
188 <sect2><title>ATA command execute</title> 188 <sect2><title>ATA command execute</title>
189 <programlisting> 189 <programlisting>
190 void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf); 190 void (*exec_command)(struct ata_port *ap, struct ata_taskfile *tf);
191 </programlisting> 191 </programlisting>
192 192
193 <para> 193 <para>
194 causes an ATA command, previously loaded with 194 causes an ATA command, previously loaded with
195 ->tf_load(), to be initiated in hardware. 195 ->tf_load(), to be initiated in hardware.
196 Most drivers for taskfile-based hardware use ata_exec_command() 196 Most drivers for taskfile-based hardware use ata_exec_command()
197 for this hook. 197 for this hook.
198 </para> 198 </para>
199 199
200 </sect2> 200 </sect2>
201 201
202 <sect2><title>Per-cmd ATAPI DMA capabilities filter</title> 202 <sect2><title>Per-cmd ATAPI DMA capabilities filter</title>
203 <programlisting> 203 <programlisting>
204 int (*check_atapi_dma) (struct ata_queued_cmd *qc); 204 int (*check_atapi_dma) (struct ata_queued_cmd *qc);
205 </programlisting> 205 </programlisting>
206 206
207 <para> 207 <para>
208 Allow low-level driver to filter ATA PACKET commands, returning a status 208 Allow low-level driver to filter ATA PACKET commands, returning a status
209 indicating whether or not it is OK to use DMA for the supplied PACKET 209 indicating whether or not it is OK to use DMA for the supplied PACKET
210 command. 210 command.
211 </para> 211 </para>
212 <para> 212 <para>
213 This hook may be specified as NULL, in which case libata will 213 This hook may be specified as NULL, in which case libata will
214 assume that atapi dma can be supported. 214 assume that atapi dma can be supported.
215 </para> 215 </para>
216 216
217 </sect2> 217 </sect2>
218 218
219 <sect2><title>Read specific ATA shadow registers</title> 219 <sect2><title>Read specific ATA shadow registers</title>
220 <programlisting> 220 <programlisting>
221 u8 (*check_status)(struct ata_port *ap); 221 u8 (*check_status)(struct ata_port *ap);
222 u8 (*check_altstatus)(struct ata_port *ap); 222 u8 (*check_altstatus)(struct ata_port *ap);
223 </programlisting> 223 </programlisting>
224 224
225 <para> 225 <para>
226 Reads the Status/AltStatus ATA shadow register from 226 Reads the Status/AltStatus ATA shadow register from
227 hardware. On some hardware, reading the Status register has 227 hardware. On some hardware, reading the Status register has
228 the side effect of clearing the interrupt condition. 228 the side effect of clearing the interrupt condition.
229 Most drivers for taskfile-based hardware use 229 Most drivers for taskfile-based hardware use
230 ata_check_status() for this hook. 230 ata_check_status() for this hook.
231 </para> 231 </para>
232 <para> 232 <para>
233 Note that because this is called from ata_device_add(), at 233 Note that because this is called from ata_device_add(), at
234 least a dummy function that clears device interrupts must be 234 least a dummy function that clears device interrupts must be
235 provided for all drivers, even if the controller doesn't 235 provided for all drivers, even if the controller doesn't
236 actually have a taskfile status register. 236 actually have a taskfile status register.
237 </para> 237 </para>
238 238
239 </sect2> 239 </sect2>
240 240
241 <sect2><title>Select ATA device on bus</title> 241 <sect2><title>Select ATA device on bus</title>
242 <programlisting> 242 <programlisting>
243 void (*dev_select)(struct ata_port *ap, unsigned int device); 243 void (*dev_select)(struct ata_port *ap, unsigned int device);
244 </programlisting> 244 </programlisting>
245 245
246 <para> 246 <para>
247 Issues the low-level hardware command(s) that causes one of N 247 Issues the low-level hardware command(s) that causes one of N
248 hardware devices to be considered 'selected' (active and 248 hardware devices to be considered 'selected' (active and
249 available for use) on the ATA bus. This generally has no 249 available for use) on the ATA bus. This generally has no
250 meaning on FIS-based devices. 250 meaning on FIS-based devices.
251 </para> 251 </para>
252 <para> 252 <para>
253 Most drivers for taskfile-based hardware use 253 Most drivers for taskfile-based hardware use
254 ata_std_dev_select() for this hook. Controllers which do not 254 ata_std_dev_select() for this hook. Controllers which do not
255 support second drives on a port (such as SATA contollers) will 255 support second drives on a port (such as SATA contollers) will
256 use ata_noop_dev_select(). 256 use ata_noop_dev_select().
257 </para> 257 </para>
258 258
259 </sect2> 259 </sect2>
260 260
261 <sect2><title>Private tuning method</title> 261 <sect2><title>Private tuning method</title>
262 <programlisting> 262 <programlisting>
263 void (*set_mode) (struct ata_port *ap); 263 void (*set_mode) (struct ata_port *ap);
264 </programlisting> 264 </programlisting>
265 265
266 <para> 266 <para>
267 By default libata performs drive and controller tuning in 267 By default libata performs drive and controller tuning in
268 accordance with the ATA timing rules and also applies blacklists 268 accordance with the ATA timing rules and also applies blacklists
269 and cable limits. Some controllers need special handling and have 269 and cable limits. Some controllers need special handling and have
270 custom tuning rules, typically raid controllers that use ATA 270 custom tuning rules, typically raid controllers that use ATA
271 commands but do not actually do drive timing. 271 commands but do not actually do drive timing.
272 </para> 272 </para>
273 273
274 <warning> 274 <warning>
275 <para> 275 <para>
276 This hook should not be used to replace the standard controller 276 This hook should not be used to replace the standard controller
277 tuning logic when a controller has quirks. Replacing the default 277 tuning logic when a controller has quirks. Replacing the default
278 tuning logic in that case would bypass handling for drive and 278 tuning logic in that case would bypass handling for drive and
279 bridge quirks that may be important to data reliability. If a 279 bridge quirks that may be important to data reliability. If a
280 controller needs to filter the mode selection it should use the 280 controller needs to filter the mode selection it should use the
281 mode_filter hook instead. 281 mode_filter hook instead.
282 </para> 282 </para>
283 </warning> 283 </warning>
284 284
285 </sect2> 285 </sect2>
286 286
287 <sect2><title>Control PCI IDE BMDMA engine</title> 287 <sect2><title>Control PCI IDE BMDMA engine</title>
288 <programlisting> 288 <programlisting>
289 void (*bmdma_setup) (struct ata_queued_cmd *qc); 289 void (*bmdma_setup) (struct ata_queued_cmd *qc);
290 void (*bmdma_start) (struct ata_queued_cmd *qc); 290 void (*bmdma_start) (struct ata_queued_cmd *qc);
291 void (*bmdma_stop) (struct ata_port *ap); 291 void (*bmdma_stop) (struct ata_port *ap);
292 u8 (*bmdma_status) (struct ata_port *ap); 292 u8 (*bmdma_status) (struct ata_port *ap);
293 </programlisting> 293 </programlisting>
294 294
295 <para> 295 <para>
296 When setting up an IDE BMDMA transaction, these hooks arm 296 When setting up an IDE BMDMA transaction, these hooks arm
297 (->bmdma_setup), fire (->bmdma_start), and halt (->bmdma_stop) 297 (->bmdma_setup), fire (->bmdma_start), and halt (->bmdma_stop)
298 the hardware's DMA engine. ->bmdma_status is used to read the standard 298 the hardware's DMA engine. ->bmdma_status is used to read the standard
299 PCI IDE DMA Status register. 299 PCI IDE DMA Status register.
300 </para> 300 </para>
301 301
302 <para> 302 <para>
303 These hooks are typically either no-ops, or simply not implemented, in 303 These hooks are typically either no-ops, or simply not implemented, in
304 FIS-based drivers. 304 FIS-based drivers.
305 </para> 305 </para>
306 <para> 306 <para>
307 Most legacy IDE drivers use ata_bmdma_setup() for the bmdma_setup() 307 Most legacy IDE drivers use ata_bmdma_setup() for the bmdma_setup()
308 hook. ata_bmdma_setup() will write the pointer to the PRD table to 308 hook. ata_bmdma_setup() will write the pointer to the PRD table to
309 the IDE PRD Table Address register, enable DMA in the DMA Command 309 the IDE PRD Table Address register, enable DMA in the DMA Command
310 register, and call exec_command() to begin the transfer. 310 register, and call exec_command() to begin the transfer.
311 </para> 311 </para>
312 <para> 312 <para>
313 Most legacy IDE drivers use ata_bmdma_start() for the bmdma_start() 313 Most legacy IDE drivers use ata_bmdma_start() for the bmdma_start()
314 hook. ata_bmdma_start() will write the ATA_DMA_START flag to the DMA 314 hook. ata_bmdma_start() will write the ATA_DMA_START flag to the DMA
315 Command register. 315 Command register.
316 </para> 316 </para>
317 <para> 317 <para>
318 Many legacy IDE drivers use ata_bmdma_stop() for the bmdma_stop() 318 Many legacy IDE drivers use ata_bmdma_stop() for the bmdma_stop()
319 hook. ata_bmdma_stop() clears the ATA_DMA_START flag in the DMA 319 hook. ata_bmdma_stop() clears the ATA_DMA_START flag in the DMA
320 command register. 320 command register.
321 </para> 321 </para>
322 <para> 322 <para>
323 Many legacy IDE drivers use ata_bmdma_status() as the bmdma_status() hook. 323 Many legacy IDE drivers use ata_bmdma_status() as the bmdma_status() hook.
324 </para> 324 </para>
325 325
326 </sect2> 326 </sect2>
327 327
328 <sect2><title>High-level taskfile hooks</title> 328 <sect2><title>High-level taskfile hooks</title>
329 <programlisting> 329 <programlisting>
330 void (*qc_prep) (struct ata_queued_cmd *qc); 330 void (*qc_prep) (struct ata_queued_cmd *qc);
331 int (*qc_issue) (struct ata_queued_cmd *qc); 331 int (*qc_issue) (struct ata_queued_cmd *qc);
332 </programlisting> 332 </programlisting>
333 333
334 <para> 334 <para>
335 Higher-level hooks, these two hooks can potentially supercede 335 Higher-level hooks, these two hooks can potentially supercede
336 several of the above taskfile/DMA engine hooks. ->qc_prep is 336 several of the above taskfile/DMA engine hooks. ->qc_prep is
337 called after the buffers have been DMA-mapped, and is typically 337 called after the buffers have been DMA-mapped, and is typically
338 used to populate the hardware's DMA scatter-gather table. 338 used to populate the hardware's DMA scatter-gather table.
339 Most drivers use the standard ata_qc_prep() helper function, but 339 Most drivers use the standard ata_qc_prep() helper function, but
340 more advanced drivers roll their own. 340 more advanced drivers roll their own.
341 </para> 341 </para>
342 <para> 342 <para>
343 ->qc_issue is used to make a command active, once the hardware 343 ->qc_issue is used to make a command active, once the hardware
344 and S/G tables have been prepared. IDE BMDMA drivers use the 344 and S/G tables have been prepared. IDE BMDMA drivers use the
345 helper function ata_qc_issue_prot() for taskfile protocol-based 345 helper function ata_qc_issue_prot() for taskfile protocol-based
346 dispatch. More advanced drivers implement their own ->qc_issue. 346 dispatch. More advanced drivers implement their own ->qc_issue.
347 </para> 347 </para>
348 <para> 348 <para>
349 ata_qc_issue_prot() calls ->tf_load(), ->bmdma_setup(), and 349 ata_qc_issue_prot() calls ->tf_load(), ->bmdma_setup(), and
350 ->bmdma_start() as necessary to initiate a transfer. 350 ->bmdma_start() as necessary to initiate a transfer.
351 </para> 351 </para>
352 352
353 </sect2> 353 </sect2>
354 354
355 <sect2><title>Exception and probe handling (EH)</title> 355 <sect2><title>Exception and probe handling (EH)</title>
356 <programlisting> 356 <programlisting>
357 void (*eng_timeout) (struct ata_port *ap); 357 void (*eng_timeout) (struct ata_port *ap);
358 void (*phy_reset) (struct ata_port *ap); 358 void (*phy_reset) (struct ata_port *ap);
359 </programlisting> 359 </programlisting>
360 360
361 <para> 361 <para>
362 Deprecated. Use ->error_handler() instead. 362 Deprecated. Use ->error_handler() instead.
363 </para> 363 </para>
364 364
365 <programlisting> 365 <programlisting>
366 void (*freeze) (struct ata_port *ap); 366 void (*freeze) (struct ata_port *ap);
367 void (*thaw) (struct ata_port *ap); 367 void (*thaw) (struct ata_port *ap);
368 </programlisting> 368 </programlisting>
369 369
370 <para> 370 <para>
371 ata_port_freeze() is called when HSM violations or some other 371 ata_port_freeze() is called when HSM violations or some other
372 condition disrupts normal operation of the port. A frozen port 372 condition disrupts normal operation of the port. A frozen port
373 is not allowed to perform any operation until the port is 373 is not allowed to perform any operation until the port is
374 thawed, which usually follows a successful reset. 374 thawed, which usually follows a successful reset.
375 </para> 375 </para>
376 376
377 <para> 377 <para>
378 The optional ->freeze() callback can be used for freezing the port 378 The optional ->freeze() callback can be used for freezing the port
379 hardware-wise (e.g. mask interrupt and stop DMA engine). If a 379 hardware-wise (e.g. mask interrupt and stop DMA engine). If a
380 port cannot be frozen hardware-wise, the interrupt handler 380 port cannot be frozen hardware-wise, the interrupt handler
381 must ack and clear interrupts unconditionally while the port 381 must ack and clear interrupts unconditionally while the port
382 is frozen. 382 is frozen.
383 </para> 383 </para>
384 <para> 384 <para>
385 The optional ->thaw() callback is called to perform the opposite of ->freeze(): 385 The optional ->thaw() callback is called to perform the opposite of ->freeze():
386 prepare the port for normal operation once again. Unmask interrupts, 386 prepare the port for normal operation once again. Unmask interrupts,
387 start DMA engine, etc. 387 start DMA engine, etc.
388 </para> 388 </para>
389 389
390 <programlisting> 390 <programlisting>
391 void (*error_handler) (struct ata_port *ap); 391 void (*error_handler) (struct ata_port *ap);
392 </programlisting> 392 </programlisting>
393 393
394 <para> 394 <para>
395 ->error_handler() is a driver's hook into probe, hotplug, and recovery 395 ->error_handler() is a driver's hook into probe, hotplug, and recovery
396 and other exceptional conditions. The primary responsibility of an 396 and other exceptional conditions. The primary responsibility of an
397 implementation is to call ata_do_eh() or ata_bmdma_drive_eh() with a set 397 implementation is to call ata_do_eh() or ata_bmdma_drive_eh() with a set
398 of EH hooks as arguments: 398 of EH hooks as arguments:
399 </para> 399 </para>
400 400
401 <para> 401 <para>
402 'prereset' hook (may be NULL) is called during an EH reset, before any other actions 402 'prereset' hook (may be NULL) is called during an EH reset, before any other actions
403 are taken. 403 are taken.
404 </para> 404 </para>
405 405
406 <para> 406 <para>
407 'postreset' hook (may be NULL) is called after the EH reset is performed. Based on 407 'postreset' hook (may be NULL) is called after the EH reset is performed. Based on
408 existing conditions, severity of the problem, and hardware capabilities, 408 existing conditions, severity of the problem, and hardware capabilities,
409 </para> 409 </para>
410 410
411 <para> 411 <para>
412 Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be 412 Either 'softreset' (may be NULL) or 'hardreset' (may be NULL) will be
413 called to perform the low-level EH reset. 413 called to perform the low-level EH reset.
414 </para> 414 </para>
415 415
416 <programlisting> 416 <programlisting>
417 void (*post_internal_cmd) (struct ata_queued_cmd *qc); 417 void (*post_internal_cmd) (struct ata_queued_cmd *qc);
418 </programlisting> 418 </programlisting>
419 419
420 <para> 420 <para>
421 Perform any hardware-specific actions necessary to finish processing 421 Perform any hardware-specific actions necessary to finish processing
422 after executing a probe-time or EH-time command via ata_exec_internal(). 422 after executing a probe-time or EH-time command via ata_exec_internal().
423 </para> 423 </para>
424 424
425 </sect2> 425 </sect2>
426 426
427 <sect2><title>Hardware interrupt handling</title> 427 <sect2><title>Hardware interrupt handling</title>
428 <programlisting> 428 <programlisting>
429 irqreturn_t (*irq_handler)(int, void *, struct pt_regs *); 429 irqreturn_t (*irq_handler)(int, void *, struct pt_regs *);
430 void (*irq_clear) (struct ata_port *); 430 void (*irq_clear) (struct ata_port *);
431 </programlisting> 431 </programlisting>
432 432
433 <para> 433 <para>
434 ->irq_handler is the interrupt handling routine registered with 434 ->irq_handler is the interrupt handling routine registered with
435 the system, by libata. ->irq_clear is called during probe just 435 the system, by libata. ->irq_clear is called during probe just
436 before the interrupt handler is registered, to be sure hardware 436 before the interrupt handler is registered, to be sure hardware
437 is quiet. 437 is quiet.
438 </para> 438 </para>
439 <para> 439 <para>
440 The second argument, dev_instance, should be cast to a pointer 440 The second argument, dev_instance, should be cast to a pointer
441 to struct ata_host_set. 441 to struct ata_host_set.
442 </para> 442 </para>
443 <para> 443 <para>
444 Most legacy IDE drivers use ata_interrupt() for the 444 Most legacy IDE drivers use ata_interrupt() for the
445 irq_handler hook, which scans all ports in the host_set, 445 irq_handler hook, which scans all ports in the host_set,
446 determines which queued command was active (if any), and calls 446 determines which queued command was active (if any), and calls
447 ata_host_intr(ap,qc). 447 ata_host_intr(ap,qc).
448 </para> 448 </para>
449 <para> 449 <para>
450 Most legacy IDE drivers use ata_bmdma_irq_clear() for the 450 Most legacy IDE drivers use ata_bmdma_irq_clear() for the
451 irq_clear() hook, which simply clears the interrupt and error 451 irq_clear() hook, which simply clears the interrupt and error
452 flags in the DMA status register. 452 flags in the DMA status register.
453 </para> 453 </para>
454 454
455 </sect2> 455 </sect2>
456 456
457 <sect2><title>SATA phy read/write</title> 457 <sect2><title>SATA phy read/write</title>
458 <programlisting> 458 <programlisting>
459 u32 (*scr_read) (struct ata_port *ap, unsigned int sc_reg); 459 u32 (*scr_read) (struct ata_port *ap, unsigned int sc_reg);
460 void (*scr_write) (struct ata_port *ap, unsigned int sc_reg, 460 void (*scr_write) (struct ata_port *ap, unsigned int sc_reg,
461 u32 val); 461 u32 val);
462 </programlisting> 462 </programlisting>
463 463
464 <para> 464 <para>
465 Read and write standard SATA phy registers. Currently only used 465 Read and write standard SATA phy registers. Currently only used
466 if ->phy_reset hook called the sata_phy_reset() helper function. 466 if ->phy_reset hook called the sata_phy_reset() helper function.
467 sc_reg is one of SCR_STATUS, SCR_CONTROL, SCR_ERROR, or SCR_ACTIVE. 467 sc_reg is one of SCR_STATUS, SCR_CONTROL, SCR_ERROR, or SCR_ACTIVE.
468 </para> 468 </para>
469 469
470 </sect2> 470 </sect2>
471 471
472 <sect2><title>Init and shutdown</title> 472 <sect2><title>Init and shutdown</title>
473 <programlisting> 473 <programlisting>
474 int (*port_start) (struct ata_port *ap); 474 int (*port_start) (struct ata_port *ap);
475 void (*port_stop) (struct ata_port *ap); 475 void (*port_stop) (struct ata_port *ap);
476 void (*host_stop) (struct ata_host_set *host_set); 476 void (*host_stop) (struct ata_host_set *host_set);
477 </programlisting> 477 </programlisting>
478 478
479 <para> 479 <para>
480 ->port_start() is called just after the data structures for each 480 ->port_start() is called just after the data structures for each
481 port are initialized. Typically this is used to alloc per-port 481 port are initialized. Typically this is used to alloc per-port
482 DMA buffers / tables / rings, enable DMA engines, and similar 482 DMA buffers / tables / rings, enable DMA engines, and similar
483 tasks. Some drivers also use this entry point as a chance to 483 tasks. Some drivers also use this entry point as a chance to
484 allocate driver-private memory for ap->private_data. 484 allocate driver-private memory for ap->private_data.
485 </para> 485 </para>
486 <para> 486 <para>
487 Many drivers use ata_port_start() as this hook or call 487 Many drivers use ata_port_start() as this hook or call
488 it from their own port_start() hooks. ata_port_start() 488 it from their own port_start() hooks. ata_port_start()
489 allocates space for a legacy IDE PRD table and returns. 489 allocates space for a legacy IDE PRD table and returns.
490 </para> 490 </para>
491 <para> 491 <para>
492 ->port_stop() is called after ->host_stop(). It's sole function 492 ->port_stop() is called after ->host_stop(). It's sole function
493 is to release DMA/memory resources, now that they are no longer 493 is to release DMA/memory resources, now that they are no longer
494 actively being used. Many drivers also free driver-private 494 actively being used. Many drivers also free driver-private
495 data from port at this time. 495 data from port at this time.
496 </para> 496 </para>
497 <para> 497 <para>
498 Many drivers use ata_port_stop() as this hook, which frees the 498 Many drivers use ata_port_stop() as this hook, which frees the
499 PRD table. 499 PRD table.
500 </para> 500 </para>
501 <para> 501 <para>
502 ->host_stop() is called after all ->port_stop() calls 502 ->host_stop() is called after all ->port_stop() calls
503 have completed. The hook must finalize hardware shutdown, release DMA 503 have completed. The hook must finalize hardware shutdown, release DMA
504 and other resources, etc. 504 and other resources, etc.
505 This hook may be specified as NULL, in which case it is not called. 505 This hook may be specified as NULL, in which case it is not called.
506 </para> 506 </para>
507 507
508 </sect2> 508 </sect2>
509 509
510 </sect1> 510 </sect1>
511 </chapter> 511 </chapter>
512 512
513 <chapter id="libataEH"> 513 <chapter id="libataEH">
514 <title>Error handling</title> 514 <title>Error handling</title>
515 515
516 <para> 516 <para>
517 This chapter describes how errors are handled under libata. 517 This chapter describes how errors are handled under libata.
518 Readers are advised to read SCSI EH 518 Readers are advised to read SCSI EH
519 (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first. 519 (Documentation/scsi/scsi_eh.txt) and ATA exceptions doc first.
520 </para> 520 </para>
521 521
522 <sect1><title>Origins of commands</title> 522 <sect1><title>Origins of commands</title>
523 <para> 523 <para>
524 In libata, a command is represented with struct ata_queued_cmd 524 In libata, a command is represented with struct ata_queued_cmd
525 or qc. qc's are preallocated during port initialization and 525 or qc. qc's are preallocated during port initialization and
526 repetitively used for command executions. Currently only one 526 repetitively used for command executions. Currently only one
527 qc is allocated per port but yet-to-be-merged NCQ branch 527 qc is allocated per port but yet-to-be-merged NCQ branch
528 allocates one for each tag and maps each qc to NCQ tag 1-to-1. 528 allocates one for each tag and maps each qc to NCQ tag 1-to-1.
529 </para> 529 </para>
530 <para> 530 <para>
531 libata commands can originate from two sources - libata itself 531 libata commands can originate from two sources - libata itself
532 and SCSI midlayer. libata internal commands are used for 532 and SCSI midlayer. libata internal commands are used for
533 initialization and error handling. All normal blk requests 533 initialization and error handling. All normal blk requests
534 and commands for SCSI emulation are passed as SCSI commands 534 and commands for SCSI emulation are passed as SCSI commands
535 through queuecommand callback of SCSI host template. 535 through queuecommand callback of SCSI host template.
536 </para> 536 </para>
537 </sect1> 537 </sect1>
538 538
539 <sect1><title>How commands are issued</title> 539 <sect1><title>How commands are issued</title>
540 540
541 <variablelist> 541 <variablelist>
542 542
543 <varlistentry><term>Internal commands</term> 543 <varlistentry><term>Internal commands</term>
544 <listitem> 544 <listitem>
545 <para> 545 <para>
546 First, qc is allocated and initialized using 546 First, qc is allocated and initialized using
547 ata_qc_new_init(). Although ata_qc_new_init() doesn't 547 ata_qc_new_init(). Although ata_qc_new_init() doesn't
548 implement any wait or retry mechanism when qc is not 548 implement any wait or retry mechanism when qc is not
549 available, internal commands are currently issued only during 549 available, internal commands are currently issued only during
550 initialization and error recovery, so no other command is 550 initialization and error recovery, so no other command is
551 active and allocation is guaranteed to succeed. 551 active and allocation is guaranteed to succeed.
552 </para> 552 </para>
553 <para> 553 <para>
554 Once allocated qc's taskfile is initialized for the command to 554 Once allocated qc's taskfile is initialized for the command to
555 be executed. qc currently has two mechanisms to notify 555 be executed. qc currently has two mechanisms to notify
556 completion. One is via qc->complete_fn() callback and the 556 completion. One is via qc->complete_fn() callback and the
557 other is completion qc->waiting. qc->complete_fn() callback 557 other is completion qc->waiting. qc->complete_fn() callback
558 is the asynchronous path used by normal SCSI translated 558 is the asynchronous path used by normal SCSI translated
559 commands and qc->waiting is the synchronous (issuer sleeps in 559 commands and qc->waiting is the synchronous (issuer sleeps in
560 process context) path used by internal commands. 560 process context) path used by internal commands.
561 </para> 561 </para>
562 <para> 562 <para>
563 Once initialization is complete, host_set lock is acquired 563 Once initialization is complete, host_set lock is acquired
564 and the qc is issued. 564 and the qc is issued.
565 </para> 565 </para>
566 </listitem> 566 </listitem>
567 </varlistentry> 567 </varlistentry>
568 568
569 <varlistentry><term>SCSI commands</term> 569 <varlistentry><term>SCSI commands</term>
570 <listitem> 570 <listitem>
571 <para> 571 <para>
572 All libata drivers use ata_scsi_queuecmd() as 572 All libata drivers use ata_scsi_queuecmd() as
573 hostt->queuecommand callback. scmds can either be simulated 573 hostt->queuecommand callback. scmds can either be simulated
574 or translated. No qc is involved in processing a simulated 574 or translated. No qc is involved in processing a simulated
575 scmd. The result is computed right away and the scmd is 575 scmd. The result is computed right away and the scmd is
576 completed. 576 completed.
577 </para> 577 </para>
578 <para> 578 <para>
579 For a translated scmd, ata_qc_new_init() is invoked to 579 For a translated scmd, ata_qc_new_init() is invoked to
580 allocate a qc and the scmd is translated into the qc. SCSI 580 allocate a qc and the scmd is translated into the qc. SCSI
581 midlayer's completion notification function pointer is stored 581 midlayer's completion notification function pointer is stored
582 into qc->scsidone. 582 into qc->scsidone.
583 </para> 583 </para>
584 <para> 584 <para>
585 qc->complete_fn() callback is used for completion 585 qc->complete_fn() callback is used for completion
586 notification. ATA commands use ata_scsi_qc_complete() while 586 notification. ATA commands use ata_scsi_qc_complete() while
587 ATAPI commands use atapi_qc_complete(). Both functions end up 587 ATAPI commands use atapi_qc_complete(). Both functions end up
588 calling qc->scsidone to notify upper layer when the qc is 588 calling qc->scsidone to notify upper layer when the qc is
589 finished. After translation is completed, the qc is issued 589 finished. After translation is completed, the qc is issued
590 with ata_qc_issue(). 590 with ata_qc_issue().
591 </para> 591 </para>
592 <para> 592 <para>
593 Note that SCSI midlayer invokes hostt->queuecommand while 593 Note that SCSI midlayer invokes hostt->queuecommand while
594 holding host_set lock, so all above occur while holding 594 holding host_set lock, so all above occur while holding
595 host_set lock. 595 host_set lock.
596 </para> 596 </para>
597 </listitem> 597 </listitem>
598 </varlistentry> 598 </varlistentry>
599 599
600 </variablelist> 600 </variablelist>
601 </sect1> 601 </sect1>
602 602
603 <sect1><title>How commands are processed</title> 603 <sect1><title>How commands are processed</title>
604 <para> 604 <para>
605 Depending on which protocol and which controller are used, 605 Depending on which protocol and which controller are used,
606 commands are processed differently. For the purpose of 606 commands are processed differently. For the purpose of
607 discussion, a controller which uses taskfile interface and all 607 discussion, a controller which uses taskfile interface and all
608 standard callbacks is assumed. 608 standard callbacks is assumed.
609 </para> 609 </para>
610 <para> 610 <para>
611 Currently 6 ATA command protocols are used. They can be 611 Currently 6 ATA command protocols are used. They can be
612 sorted into the following four categories according to how 612 sorted into the following four categories according to how
613 they are processed. 613 they are processed.
614 </para> 614 </para>
615 615
616 <variablelist> 616 <variablelist>
617 <varlistentry><term>ATA NO DATA or DMA</term> 617 <varlistentry><term>ATA NO DATA or DMA</term>
618 <listitem> 618 <listitem>
619 <para> 619 <para>
620 ATA_PROT_NODATA and ATA_PROT_DMA fall into this category. 620 ATA_PROT_NODATA and ATA_PROT_DMA fall into this category.
621 These types of commands don't require any software 621 These types of commands don't require any software
622 intervention once issued. Device will raise interrupt on 622 intervention once issued. Device will raise interrupt on
623 completion. 623 completion.
624 </para> 624 </para>
625 </listitem> 625 </listitem>
626 </varlistentry> 626 </varlistentry>
627 627
628 <varlistentry><term>ATA PIO</term> 628 <varlistentry><term>ATA PIO</term>
629 <listitem> 629 <listitem>
630 <para> 630 <para>
631 ATA_PROT_PIO is in this category. libata currently 631 ATA_PROT_PIO is in this category. libata currently
632 implements PIO with polling. ATA_NIEN bit is set to turn 632 implements PIO with polling. ATA_NIEN bit is set to turn
633 off interrupt and pio_task on ata_wq performs polling and 633 off interrupt and pio_task on ata_wq performs polling and
634 IO. 634 IO.
635 </para> 635 </para>
636 </listitem> 636 </listitem>
637 </varlistentry> 637 </varlistentry>
638 638
639 <varlistentry><term>ATAPI NODATA or DMA</term> 639 <varlistentry><term>ATAPI NODATA or DMA</term>
640 <listitem> 640 <listitem>
641 <para> 641 <para>
642 ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this 642 ATA_PROT_ATAPI_NODATA and ATA_PROT_ATAPI_DMA are in this
643 category. packet_task is used to poll BSY bit after 643 category. packet_task is used to poll BSY bit after
644 issuing PACKET command. Once BSY is turned off by the 644 issuing PACKET command. Once BSY is turned off by the
645 device, packet_task transfers CDB and hands off processing 645 device, packet_task transfers CDB and hands off processing
646 to interrupt handler. 646 to interrupt handler.
647 </para> 647 </para>
648 </listitem> 648 </listitem>
649 </varlistentry> 649 </varlistentry>
650 650
651 <varlistentry><term>ATAPI PIO</term> 651 <varlistentry><term>ATAPI PIO</term>
652 <listitem> 652 <listitem>
653 <para> 653 <para>
654 ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set 654 ATA_PROT_ATAPI is in this category. ATA_NIEN bit is set
655 and, as in ATAPI NODATA or DMA, packet_task submits cdb. 655 and, as in ATAPI NODATA or DMA, packet_task submits cdb.
656 However, after submitting cdb, further processing (data 656 However, after submitting cdb, further processing (data
657 transfer) is handed off to pio_task. 657 transfer) is handed off to pio_task.
658 </para> 658 </para>
659 </listitem> 659 </listitem>
660 </varlistentry> 660 </varlistentry>
661 </variablelist> 661 </variablelist>
662 </sect1> 662 </sect1>
663 663
664 <sect1><title>How commands are completed</title> 664 <sect1><title>How commands are completed</title>
665 <para> 665 <para>
666 Once issued, all qc's are either completed with 666 Once issued, all qc's are either completed with
667 ata_qc_complete() or time out. For commands which are handled 667 ata_qc_complete() or time out. For commands which are handled
668 by interrupts, ata_host_intr() invokes ata_qc_complete(), and, 668 by interrupts, ata_host_intr() invokes ata_qc_complete(), and,
669 for PIO tasks, pio_task invokes ata_qc_complete(). In error 669 for PIO tasks, pio_task invokes ata_qc_complete(). In error
670 cases, packet_task may also complete commands. 670 cases, packet_task may also complete commands.
671 </para> 671 </para>
672 <para> 672 <para>
673 ata_qc_complete() does the following. 673 ata_qc_complete() does the following.
674 </para> 674 </para>
675 675
676 <orderedlist> 676 <orderedlist>
677 677
678 <listitem> 678 <listitem>
679 <para> 679 <para>
680 DMA memory is unmapped. 680 DMA memory is unmapped.
681 </para> 681 </para>
682 </listitem> 682 </listitem>
683 683
684 <listitem> 684 <listitem>
685 <para> 685 <para>
686 ATA_QCFLAG_ACTIVE is clared from qc->flags. 686 ATA_QCFLAG_ACTIVE is clared from qc->flags.
687 </para> 687 </para>
688 </listitem> 688 </listitem>
689 689
690 <listitem> 690 <listitem>
691 <para> 691 <para>
692 qc->complete_fn() callback is invoked. If the return value of 692 qc->complete_fn() callback is invoked. If the return value of
693 the callback is not zero. Completion is short circuited and 693 the callback is not zero. Completion is short circuited and
694 ata_qc_complete() returns. 694 ata_qc_complete() returns.
695 </para> 695 </para>
696 </listitem> 696 </listitem>
697 697
698 <listitem> 698 <listitem>
699 <para> 699 <para>
700 __ata_qc_complete() is called, which does 700 __ata_qc_complete() is called, which does
701 <orderedlist> 701 <orderedlist>
702 702
703 <listitem> 703 <listitem>
704 <para> 704 <para>
705 qc->flags is cleared to zero. 705 qc->flags is cleared to zero.
706 </para> 706 </para>
707 </listitem> 707 </listitem>
708 708
709 <listitem> 709 <listitem>
710 <para> 710 <para>
711 ap->active_tag and qc->tag are poisoned. 711 ap->active_tag and qc->tag are poisoned.
712 </para> 712 </para>
713 </listitem> 713 </listitem>
714 714
715 <listitem> 715 <listitem>
716 <para> 716 <para>
717 qc->waiting is claread &amp; completed (in that order). 717 qc->waiting is claread &amp; completed (in that order).
718 </para> 718 </para>
719 </listitem> 719 </listitem>
720 720
721 <listitem> 721 <listitem>
722 <para> 722 <para>
723 qc is deallocated by clearing appropriate bit in ap->qactive. 723 qc is deallocated by clearing appropriate bit in ap->qactive.
724 </para> 724 </para>
725 </listitem> 725 </listitem>
726 726
727 </orderedlist> 727 </orderedlist>
728 </para> 728 </para>
729 </listitem> 729 </listitem>
730 730
731 </orderedlist> 731 </orderedlist>
732 732
733 <para> 733 <para>
734 So, it basically notifies upper layer and deallocates qc. One 734 So, it basically notifies upper layer and deallocates qc. One
735 exception is short-circuit path in #3 which is used by 735 exception is short-circuit path in #3 which is used by
736 atapi_qc_complete(). 736 atapi_qc_complete().
737 </para> 737 </para>
738 <para> 738 <para>
739 For all non-ATAPI commands, whether it fails or not, almost 739 For all non-ATAPI commands, whether it fails or not, almost
740 the same code path is taken and very little error handling 740 the same code path is taken and very little error handling
741 takes place. A qc is completed with success status if it 741 takes place. A qc is completed with success status if it
742 succeeded, with failed status otherwise. 742 succeeded, with failed status otherwise.
743 </para> 743 </para>
744 <para> 744 <para>
745 However, failed ATAPI commands require more handling as 745 However, failed ATAPI commands require more handling as
746 REQUEST SENSE is needed to acquire sense data. If an ATAPI 746 REQUEST SENSE is needed to acquire sense data. If an ATAPI
747 command fails, ata_qc_complete() is invoked with error status, 747 command fails, ata_qc_complete() is invoked with error status,
748 which in turn invokes atapi_qc_complete() via 748 which in turn invokes atapi_qc_complete() via
749 qc->complete_fn() callback. 749 qc->complete_fn() callback.
750 </para> 750 </para>
751 <para> 751 <para>
752 This makes atapi_qc_complete() set scmd->result to 752 This makes atapi_qc_complete() set scmd->result to
753 SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As 753 SAM_STAT_CHECK_CONDITION, complete the scmd and return 1. As
754 the sense data is empty but scmd->result is CHECK CONDITION, 754 the sense data is empty but scmd->result is CHECK CONDITION,
755 SCSI midlayer will invoke EH for the scmd, and returning 1 755 SCSI midlayer will invoke EH for the scmd, and returning 1
756 makes ata_qc_complete() to return without deallocating the qc. 756 makes ata_qc_complete() to return without deallocating the qc.
757 This leads us to ata_scsi_error() with partially completed qc. 757 This leads us to ata_scsi_error() with partially completed qc.
758 </para> 758 </para>
759 759
760 </sect1> 760 </sect1>
761 761
762 <sect1><title>ata_scsi_error()</title> 762 <sect1><title>ata_scsi_error()</title>
763 <para> 763 <para>
764 ata_scsi_error() is the current transportt->eh_strategy_handler() 764 ata_scsi_error() is the current transportt->eh_strategy_handler()
765 for libata. As discussed above, this will be entered in two 765 for libata. As discussed above, this will be entered in two
766 cases - timeout and ATAPI error completion. This function 766 cases - timeout and ATAPI error completion. This function
767 calls low level libata driver's eng_timeout() callback, the 767 calls low level libata driver's eng_timeout() callback, the
768 standard callback for which is ata_eng_timeout(). It checks 768 standard callback for which is ata_eng_timeout(). It checks
769 if a qc is active and calls ata_qc_timeout() on the qc if so. 769 if a qc is active and calls ata_qc_timeout() on the qc if so.
770 Actual error handling occurs in ata_qc_timeout(). 770 Actual error handling occurs in ata_qc_timeout().
771 </para> 771 </para>
772 <para> 772 <para>
773 If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and 773 If EH is invoked for timeout, ata_qc_timeout() stops BMDMA and
774 completes the qc. Note that as we're currently in EH, we 774 completes the qc. Note that as we're currently in EH, we
775 cannot call scsi_done. As described in SCSI EH doc, a 775 cannot call scsi_done. As described in SCSI EH doc, a
776 recovered scmd should be either retried with 776 recovered scmd should be either retried with
777 scsi_queue_insert() or finished with scsi_finish_command(). 777 scsi_queue_insert() or finished with scsi_finish_command().
778 Here, we override qc->scsidone with scsi_finish_command() and 778 Here, we override qc->scsidone with scsi_finish_command() and
779 calls ata_qc_complete(). 779 calls ata_qc_complete().
780 </para> 780 </para>
781 <para> 781 <para>
782 If EH is invoked due to a failed ATAPI qc, the qc here is 782 If EH is invoked due to a failed ATAPI qc, the qc here is
783 completed but not deallocated. The purpose of this 783 completed but not deallocated. The purpose of this
784 half-completion is to use the qc as place holder to make EH 784 half-completion is to use the qc as place holder to make EH
785 code reach this place. This is a bit hackish, but it works. 785 code reach this place. This is a bit hackish, but it works.
786 </para> 786 </para>
787 <para> 787 <para>
788 Once control reaches here, the qc is deallocated by invoking 788 Once control reaches here, the qc is deallocated by invoking
789 __ata_qc_complete() explicitly. Then, internal qc for REQUEST 789 __ata_qc_complete() explicitly. Then, internal qc for REQUEST
790 SENSE is issued. Once sense data is acquired, scmd is 790 SENSE is issued. Once sense data is acquired, scmd is
791 finished by directly invoking scsi_finish_command() on the 791 finished by directly invoking scsi_finish_command() on the
792 scmd. Note that as we already have completed and deallocated 792 scmd. Note that as we already have completed and deallocated
793 the qc which was associated with the scmd, we don't need 793 the qc which was associated with the scmd, we don't need
794 to/cannot call ata_qc_complete() again. 794 to/cannot call ata_qc_complete() again.
795 </para> 795 </para>
796 796
797 </sect1> 797 </sect1>
798 798
799 <sect1><title>Problems with the current EH</title> 799 <sect1><title>Problems with the current EH</title>
800 800
801 <itemizedlist> 801 <itemizedlist>
802 802
803 <listitem> 803 <listitem>
804 <para> 804 <para>
805 Error representation is too crude. Currently any and all 805 Error representation is too crude. Currently any and all
806 error conditions are represented with ATA STATUS and ERROR 806 error conditions are represented with ATA STATUS and ERROR
807 registers. Errors which aren't ATA device errors are treated 807 registers. Errors which aren't ATA device errors are treated
808 as ATA device errors by setting ATA_ERR bit. Better error 808 as ATA device errors by setting ATA_ERR bit. Better error
809 descriptor which can properly represent ATA and other 809 descriptor which can properly represent ATA and other
810 errors/exceptions is needed. 810 errors/exceptions is needed.
811 </para> 811 </para>
812 </listitem> 812 </listitem>
813 813
814 <listitem> 814 <listitem>
815 <para> 815 <para>
816 When handling timeouts, no action is taken to make device 816 When handling timeouts, no action is taken to make device
817 forget about the timed out command and ready for new commands. 817 forget about the timed out command and ready for new commands.
818 </para> 818 </para>
819 </listitem> 819 </listitem>
820 820
821 <listitem> 821 <listitem>
822 <para> 822 <para>
823 EH handling via ata_scsi_error() is not properly protected 823 EH handling via ata_scsi_error() is not properly protected
824 from usual command processing. On EH entrance, the device is 824 from usual command processing. On EH entrance, the device is
825 not in quiescent state. Timed out commands may succeed or 825 not in quiescent state. Timed out commands may succeed or
826 fail any time. pio_task and atapi_task may still be running. 826 fail any time. pio_task and atapi_task may still be running.
827 </para> 827 </para>
828 </listitem> 828 </listitem>
829 829
830 <listitem> 830 <listitem>
831 <para> 831 <para>
832 Too weak error recovery. Devices / controllers causing HSM 832 Too weak error recovery. Devices / controllers causing HSM
833 mismatch errors and other errors quite often require reset to 833 mismatch errors and other errors quite often require reset to
834 return to known state. Also, advanced error handling is 834 return to known state. Also, advanced error handling is
835 necessary to support features like NCQ and hotplug. 835 necessary to support features like NCQ and hotplug.
836 </para> 836 </para>
837 </listitem> 837 </listitem>
838 838
839 <listitem> 839 <listitem>
840 <para> 840 <para>
841 ATA errors are directly handled in the interrupt handler and 841 ATA errors are directly handled in the interrupt handler and
842 PIO errors in pio_task. This is problematic for advanced 842 PIO errors in pio_task. This is problematic for advanced
843 error handling for the following reasons. 843 error handling for the following reasons.
844 </para> 844 </para>
845 <para> 845 <para>
846 First, advanced error handling often requires context and 846 First, advanced error handling often requires context and
847 internal qc execution. 847 internal qc execution.
848 </para> 848 </para>
849 <para> 849 <para>
850 Second, even a simple failure (say, CRC error) needs 850 Second, even a simple failure (say, CRC error) needs
851 information gathering and could trigger complex error handling 851 information gathering and could trigger complex error handling
852 (say, resetting &amp; reconfiguring). Having multiple code 852 (say, resetting &amp; reconfiguring). Having multiple code
853 paths to gather information, enter EH and trigger actions 853 paths to gather information, enter EH and trigger actions
854 makes life painful. 854 makes life painful.
855 </para> 855 </para>
856 <para> 856 <para>
857 Third, scattered EH code makes implementing low level drivers 857 Third, scattered EH code makes implementing low level drivers
858 difficult. Low level drivers override libata callbacks. If 858 difficult. Low level drivers override libata callbacks. If
859 EH is scattered over several places, each affected callbacks 859 EH is scattered over several places, each affected callbacks
860 should perform its part of error handling. This can be error 860 should perform its part of error handling. This can be error
861 prone and painful. 861 prone and painful.
862 </para> 862 </para>
863 </listitem> 863 </listitem>
864 864
865 </itemizedlist> 865 </itemizedlist>
866 </sect1> 866 </sect1>
867 </chapter> 867 </chapter>
868 868
869 <chapter id="libataExt"> 869 <chapter id="libataExt">
870 <title>libata Library</title> 870 <title>libata Library</title>
871 !Edrivers/ata/libata-core.c 871 !Edrivers/ata/libata-core.c
872 </chapter> 872 </chapter>
873 873
874 <chapter id="libataInt"> 874 <chapter id="libataInt">
875 <title>libata Core Internals</title> 875 <title>libata Core Internals</title>
876 !Idrivers/ata/libata-core.c 876 !Idrivers/ata/libata-core.c
877 </chapter> 877 </chapter>
878 878
879 <chapter id="libataScsiInt"> 879 <chapter id="libataScsiInt">
880 <title>libata SCSI translation/emulation</title> 880 <title>libata SCSI translation/emulation</title>
881 !Edrivers/ata/libata-scsi.c 881 !Edrivers/ata/libata-scsi.c
882 !Idrivers/ata/libata-scsi.c 882 !Idrivers/ata/libata-scsi.c
883 </chapter> 883 </chapter>
884 884
885 <chapter id="ataExceptions"> 885 <chapter id="ataExceptions">
886 <title>ATA errors &amp; exceptions</title> 886 <title>ATA errors &amp; exceptions</title>
887 887
888 <para> 888 <para>
889 This chapter tries to identify what error/exception conditions exist 889 This chapter tries to identify what error/exception conditions exist
890 for ATA/ATAPI devices and describe how they should be handled in 890 for ATA/ATAPI devices and describe how they should be handled in
891 implementation-neutral way. 891 implementation-neutral way.
892 </para> 892 </para>
893 893
894 <para> 894 <para>
895 The term 'error' is used to describe conditions where either an 895 The term 'error' is used to describe conditions where either an
896 explicit error condition is reported from device or a command has 896 explicit error condition is reported from device or a command has
897 timed out. 897 timed out.
898 </para> 898 </para>
899 899
900 <para> 900 <para>
901 The term 'exception' is either used to describe exceptional 901 The term 'exception' is either used to describe exceptional
902 conditions which are not errors (say, power or hotplug events), or 902 conditions which are not errors (say, power or hotplug events), or
903 to describe both errors and non-error exceptional conditions. Where 903 to describe both errors and non-error exceptional conditions. Where
904 explicit distinction between error and exception is necessary, the 904 explicit distinction between error and exception is necessary, the
905 term 'non-error exception' is used. 905 term 'non-error exception' is used.
906 </para> 906 </para>
907 907
908 <sect1 id="excat"> 908 <sect1 id="excat">
909 <title>Exception categories</title> 909 <title>Exception categories</title>
910 <para> 910 <para>
911 Exceptions are described primarily with respect to legacy 911 Exceptions are described primarily with respect to legacy
912 taskfile + bus master IDE interface. If a controller provides 912 taskfile + bus master IDE interface. If a controller provides
913 other better mechanism for error reporting, mapping those into 913 other better mechanism for error reporting, mapping those into
914 categories described below shouldn't be difficult. 914 categories described below shouldn't be difficult.
915 </para> 915 </para>
916 916
917 <para> 917 <para>
918 In the following sections, two recovery actions - reset and 918 In the following sections, two recovery actions - reset and
919 reconfiguring transport - are mentioned. These are described 919 reconfiguring transport - are mentioned. These are described
920 further in <xref linkend="exrec"/>. 920 further in <xref linkend="exrec"/>.
921 </para> 921 </para>
922 922
923 <sect2 id="excatHSMviolation"> 923 <sect2 id="excatHSMviolation">
924 <title>HSM violation</title> 924 <title>HSM violation</title>
925 <para> 925 <para>
926 This error is indicated when STATUS value doesn't match HSM 926 This error is indicated when STATUS value doesn't match HSM
927 requirement during issuing or excution any ATA/ATAPI command. 927 requirement during issuing or excution any ATA/ATAPI command.
928 </para> 928 </para>
929 929
930 <itemizedlist> 930 <itemizedlist>
931 <title>Examples</title> 931 <title>Examples</title>
932 932
933 <listitem> 933 <listitem>
934 <para> 934 <para>
935 ATA_STATUS doesn't contain !BSY &amp;&amp; DRDY &amp;&amp; !DRQ while trying 935 ATA_STATUS doesn't contain !BSY &amp;&amp; DRDY &amp;&amp; !DRQ while trying
936 to issue a command. 936 to issue a command.
937 </para> 937 </para>
938 </listitem> 938 </listitem>
939 939
940 <listitem> 940 <listitem>
941 <para> 941 <para>
942 !BSY &amp;&amp; !DRQ during PIO data transfer. 942 !BSY &amp;&amp; !DRQ during PIO data transfer.
943 </para> 943 </para>
944 </listitem> 944 </listitem>
945 945
946 <listitem> 946 <listitem>
947 <para> 947 <para>
948 DRQ on command completion. 948 DRQ on command completion.
949 </para> 949 </para>
950 </listitem> 950 </listitem>
951 951
952 <listitem> 952 <listitem>
953 <para> 953 <para>
954 !BSY &amp;&amp; ERR after CDB tranfer starts but before the 954 !BSY &amp;&amp; ERR after CDB tranfer starts but before the
955 last byte of CDB is transferred. ATA/ATAPI standard states 955 last byte of CDB is transferred. ATA/ATAPI standard states
956 that &quot;The device shall not terminate the PACKET command 956 that &quot;The device shall not terminate the PACKET command
957 with an error before the last byte of the command packet has 957 with an error before the last byte of the command packet has
958 been written&quot; in the error outputs description of PACKET 958 been written&quot; in the error outputs description of PACKET
959 command and the state diagram doesn't include such 959 command and the state diagram doesn't include such
960 transitions. 960 transitions.
961 </para> 961 </para>
962 </listitem> 962 </listitem>
963 963
964 </itemizedlist> 964 </itemizedlist>
965 965
966 <para> 966 <para>
967 In these cases, HSM is violated and not much information 967 In these cases, HSM is violated and not much information
968 regarding the error can be acquired from STATUS or ERROR 968 regarding the error can be acquired from STATUS or ERROR
969 register. IOW, this error can be anything - driver bug, 969 register. IOW, this error can be anything - driver bug,
970 faulty device, controller and/or cable. 970 faulty device, controller and/or cable.
971 </para> 971 </para>
972 972
973 <para> 973 <para>
974 As HSM is violated, reset is necessary to restore known state. 974 As HSM is violated, reset is necessary to restore known state.
975 Reconfiguring transport for lower speed might be helpful too 975 Reconfiguring transport for lower speed might be helpful too
976 as transmission errors sometimes cause this kind of errors. 976 as transmission errors sometimes cause this kind of errors.
977 </para> 977 </para>
978 </sect2> 978 </sect2>
979 979
980 <sect2 id="excatDevErr"> 980 <sect2 id="excatDevErr">
981 <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title> 981 <title>ATA/ATAPI device error (non-NCQ / non-CHECK CONDITION)</title>
982 982
983 <para> 983 <para>
984 These are errors detected and reported by ATA/ATAPI devices 984 These are errors detected and reported by ATA/ATAPI devices
985 indicating device problems. For this type of errors, STATUS 985 indicating device problems. For this type of errors, STATUS
986 and ERROR register values are valid and describe error 986 and ERROR register values are valid and describe error
987 condition. Note that some of ATA bus errors are detected by 987 condition. Note that some of ATA bus errors are detected by
988 ATA/ATAPI devices and reported using the same mechanism as 988 ATA/ATAPI devices and reported using the same mechanism as
989 device errors. Those cases are described later in this 989 device errors. Those cases are described later in this
990 section. 990 section.
991 </para> 991 </para>
992 992
993 <para> 993 <para>
994 For ATA commands, this type of errors are indicated by !BSY 994 For ATA commands, this type of errors are indicated by !BSY
995 &amp;&amp; ERR during command execution and on completion. 995 &amp;&amp; ERR during command execution and on completion.
996 </para> 996 </para>
997 997
998 <para>For ATAPI commands,</para> 998 <para>For ATAPI commands,</para>
999 999
1000 <itemizedlist> 1000 <itemizedlist>
1001 1001
1002 <listitem> 1002 <listitem>
1003 <para> 1003 <para>
1004 !BSY &amp;&amp; ERR &amp;&amp; ABRT right after issuing PACKET 1004 !BSY &amp;&amp; ERR &amp;&amp; ABRT right after issuing PACKET
1005 indicates that PACKET command is not supported and falls in 1005 indicates that PACKET command is not supported and falls in
1006 this category. 1006 this category.
1007 </para> 1007 </para>
1008 </listitem> 1008 </listitem>
1009 1009
1010 <listitem> 1010 <listitem>
1011 <para> 1011 <para>
1012 !BSY &amp;&amp; ERR(==CHK) &amp;&amp; !ABRT after the last 1012 !BSY &amp;&amp; ERR(==CHK) &amp;&amp; !ABRT after the last
1013 byte of CDB is transferred indicates CHECK CONDITION and 1013 byte of CDB is transferred indicates CHECK CONDITION and
1014 doesn't fall in this category. 1014 doesn't fall in this category.
1015 </para> 1015 </para>
1016 </listitem> 1016 </listitem>
1017 1017
1018 <listitem> 1018 <listitem>
1019 <para> 1019 <para>
1020 !BSY &amp;&amp; ERR(==CHK) &amp;&amp; ABRT after the last byte 1020 !BSY &amp;&amp; ERR(==CHK) &amp;&amp; ABRT after the last byte
1021 of CDB is transferred *probably* indicates CHECK CONDITION and 1021 of CDB is transferred *probably* indicates CHECK CONDITION and
1022 doesn't fall in this category. 1022 doesn't fall in this category.
1023 </para> 1023 </para>
1024 </listitem> 1024 </listitem>
1025 1025
1026 </itemizedlist> 1026 </itemizedlist>
1027 1027
1028 <para> 1028 <para>
1029 Of errors detected as above, the followings are not ATA/ATAPI 1029 Of errors detected as above, the followings are not ATA/ATAPI
1030 device errors but ATA bus errors and should be handled 1030 device errors but ATA bus errors and should be handled
1031 according to <xref linkend="excatATAbusErr"/>. 1031 according to <xref linkend="excatATAbusErr"/>.
1032 </para> 1032 </para>
1033 1033
1034 <variablelist> 1034 <variablelist>
1035 1035
1036 <varlistentry> 1036 <varlistentry>
1037 <term>CRC error during data transfer</term> 1037 <term>CRC error during data transfer</term>
1038 <listitem> 1038 <listitem>
1039 <para> 1039 <para>
1040 This is indicated by ICRC bit in the ERROR register and 1040 This is indicated by ICRC bit in the ERROR register and
1041 means that corruption occurred during data transfer. Upto 1041 means that corruption occurred during data transfer. Upto
1042 ATA/ATAPI-7, the standard specifies that this bit is only 1042 ATA/ATAPI-7, the standard specifies that this bit is only
1043 applicable to UDMA transfers but ATA/ATAPI-8 draft revision 1043 applicable to UDMA transfers but ATA/ATAPI-8 draft revision
1044 1f says that the bit may be applicable to multiword DMA and 1044 1f says that the bit may be applicable to multiword DMA and
1045 PIO. 1045 PIO.
1046 </para> 1046 </para>
1047 </listitem> 1047 </listitem>
1048 </varlistentry> 1048 </varlistentry>
1049 1049
1050 <varlistentry> 1050 <varlistentry>
1051 <term>ABRT error during data transfer or on completion</term> 1051 <term>ABRT error during data transfer or on completion</term>
1052 <listitem> 1052 <listitem>
1053 <para> 1053 <para>
1054 Upto ATA/ATAPI-7, the standard specifies that ABRT could be 1054 Upto ATA/ATAPI-7, the standard specifies that ABRT could be
1055 set on ICRC errors and on cases where a device is not able 1055 set on ICRC errors and on cases where a device is not able
1056 to complete a command. Combined with the fact that MWDMA 1056 to complete a command. Combined with the fact that MWDMA
1057 and PIO transfer errors aren't allowed to use ICRC bit upto 1057 and PIO transfer errors aren't allowed to use ICRC bit upto
1058 ATA/ATAPI-7, it seems to imply that ABRT bit alone could 1058 ATA/ATAPI-7, it seems to imply that ABRT bit alone could
1059 indicate tranfer errors. 1059 indicate tranfer errors.
1060 </para> 1060 </para>
1061 <para> 1061 <para>
1062 However, ATA/ATAPI-8 draft revision 1f removes the part 1062 However, ATA/ATAPI-8 draft revision 1f removes the part
1063 that ICRC errors can turn on ABRT. So, this is kind of 1063 that ICRC errors can turn on ABRT. So, this is kind of
1064 gray area. Some heuristics are needed here. 1064 gray area. Some heuristics are needed here.
1065 </para> 1065 </para>
1066 </listitem> 1066 </listitem>
1067 </varlistentry> 1067 </varlistentry>
1068 1068
1069 </variablelist> 1069 </variablelist>
1070 1070
1071 <para> 1071 <para>
1072 ATA/ATAPI device errors can be further categorized as follows. 1072 ATA/ATAPI device errors can be further categorized as follows.
1073 </para> 1073 </para>
1074 1074
1075 <variablelist> 1075 <variablelist>
1076 1076
1077 <varlistentry> 1077 <varlistentry>
1078 <term>Media errors</term> 1078 <term>Media errors</term>
1079 <listitem> 1079 <listitem>
1080 <para> 1080 <para>
1081 This is indicated by UNC bit in the ERROR register. ATA 1081 This is indicated by UNC bit in the ERROR register. ATA
1082 devices reports UNC error only after certain number of 1082 devices reports UNC error only after certain number of
1083 retries cannot recover the data, so there's nothing much 1083 retries cannot recover the data, so there's nothing much
1084 else to do other than notifying upper layer. 1084 else to do other than notifying upper layer.
1085 </para> 1085 </para>
1086 <para> 1086 <para>
1087 READ and WRITE commands report CHS or LBA of the first 1087 READ and WRITE commands report CHS or LBA of the first
1088 failed sector but ATA/ATAPI standard specifies that the 1088 failed sector but ATA/ATAPI standard specifies that the
1089 amount of transferred data on error completion is 1089 amount of transferred data on error completion is
1090 indeterminate, so we cannot assume that sectors preceding 1090 indeterminate, so we cannot assume that sectors preceding
1091 the failed sector have been transferred and thus cannot 1091 the failed sector have been transferred and thus cannot
1092 complete those sectors successfully as SCSI does. 1092 complete those sectors successfully as SCSI does.
1093 </para> 1093 </para>
1094 </listitem> 1094 </listitem>
1095 </varlistentry> 1095 </varlistentry>
1096 1096
1097 <varlistentry> 1097 <varlistentry>
1098 <term>Media changed / media change requested error</term> 1098 <term>Media changed / media change requested error</term>
1099 <listitem> 1099 <listitem>
1100 <para> 1100 <para>
1101 &lt;&lt;TODO: fill here&gt;&gt; 1101 &lt;&lt;TODO: fill here&gt;&gt;
1102 </para> 1102 </para>
1103 </listitem> 1103 </listitem>
1104 </varlistentry> 1104 </varlistentry>
1105 1105
1106 <varlistentry><term>Address error</term> 1106 <varlistentry><term>Address error</term>
1107 <listitem> 1107 <listitem>
1108 <para> 1108 <para>
1109 This is indicated by IDNF bit in the ERROR register. 1109 This is indicated by IDNF bit in the ERROR register.
1110 Report to upper layer. 1110 Report to upper layer.
1111 </para> 1111 </para>
1112 </listitem> 1112 </listitem>
1113 </varlistentry> 1113 </varlistentry>
1114 1114
1115 <varlistentry><term>Other errors</term> 1115 <varlistentry><term>Other errors</term>
1116 <listitem> 1116 <listitem>
1117 <para> 1117 <para>
1118 This can be invalid command or parameter indicated by ABRT 1118 This can be invalid command or parameter indicated by ABRT
1119 ERROR bit or some other error condition. Note that ABRT 1119 ERROR bit or some other error condition. Note that ABRT
1120 bit can indicate a lot of things including ICRC and Address 1120 bit can indicate a lot of things including ICRC and Address
1121 errors. Heuristics needed. 1121 errors. Heuristics needed.
1122 </para> 1122 </para>
1123 </listitem> 1123 </listitem>
1124 </varlistentry> 1124 </varlistentry>
1125 1125
1126 </variablelist> 1126 </variablelist>
1127 1127
1128 <para> 1128 <para>
1129 Depending on commands, not all STATUS/ERROR bits are 1129 Depending on commands, not all STATUS/ERROR bits are
1130 applicable. These non-applicable bits are marked with 1130 applicable. These non-applicable bits are marked with
1131 &quot;na&quot; in the output descriptions but upto ATA/ATAPI-7 1131 &quot;na&quot; in the output descriptions but upto ATA/ATAPI-7
1132 no definition of &quot;na&quot; can be found. However, 1132 no definition of &quot;na&quot; can be found. However,
1133 ATA/ATAPI-8 draft revision 1f describes &quot;N/A&quot; as 1133 ATA/ATAPI-8 draft revision 1f describes &quot;N/A&quot; as
1134 follows. 1134 follows.
1135 </para> 1135 </para>
1136 1136
1137 <blockquote> 1137 <blockquote>
1138 <variablelist> 1138 <variablelist>
1139 <varlistentry><term>3.2.3.3a N/A</term> 1139 <varlistentry><term>3.2.3.3a N/A</term>
1140 <listitem> 1140 <listitem>
1141 <para> 1141 <para>
1142 A keyword the indicates a field has no defined value in 1142 A keyword the indicates a field has no defined value in
1143 this standard and should not be checked by the host or 1143 this standard and should not be checked by the host or
1144 device. N/A fields should be cleared to zero. 1144 device. N/A fields should be cleared to zero.
1145 </para> 1145 </para>
1146 </listitem> 1146 </listitem>
1147 </varlistentry> 1147 </varlistentry>
1148 </variablelist> 1148 </variablelist>
1149 </blockquote> 1149 </blockquote>
1150 1150
1151 <para> 1151 <para>
1152 So, it seems reasonable to assume that &quot;na&quot; bits are 1152 So, it seems reasonable to assume that &quot;na&quot; bits are
1153 cleared to zero by devices and thus need no explicit masking. 1153 cleared to zero by devices and thus need no explicit masking.
1154 </para> 1154 </para>
1155 1155
1156 </sect2> 1156 </sect2>
1157 1157
1158 <sect2 id="excatATAPIcc"> 1158 <sect2 id="excatATAPIcc">
1159 <title>ATAPI device CHECK CONDITION</title> 1159 <title>ATAPI device CHECK CONDITION</title>
1160 1160
1161 <para> 1161 <para>
1162 ATAPI device CHECK CONDITION error is indicated by set CHK bit 1162 ATAPI device CHECK CONDITION error is indicated by set CHK bit
1163 (ERR bit) in the STATUS register after the last byte of CDB is 1163 (ERR bit) in the STATUS register after the last byte of CDB is
1164 transferred for a PACKET command. For this kind of errors, 1164 transferred for a PACKET command. For this kind of errors,
1165 sense data should be acquired to gather information regarding 1165 sense data should be acquired to gather information regarding
1166 the errors. REQUEST SENSE packet command should be used to 1166 the errors. REQUEST SENSE packet command should be used to
1167 acquire sense data. 1167 acquire sense data.
1168 </para> 1168 </para>
1169 1169
1170 <para> 1170 <para>
1171 Once sense data is acquired, this type of errors can be 1171 Once sense data is acquired, this type of errors can be
1172 handled similary to other SCSI errors. Note that sense data 1172 handled similary to other SCSI errors. Note that sense data
1173 may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR 1173 may indicate ATA bus error (e.g. Sense Key 04h HARDWARE ERROR
1174 &amp;&amp; ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such 1174 &amp;&amp; ASC/ASCQ 47h/00h SCSI PARITY ERROR). In such
1175 cases, the error should be considered as an ATA bus error and 1175 cases, the error should be considered as an ATA bus error and
1176 handled according to <xref linkend="excatATAbusErr"/>. 1176 handled according to <xref linkend="excatATAbusErr"/>.
1177 </para> 1177 </para>
1178 1178
1179 </sect2> 1179 </sect2>
1180 1180
1181 <sect2 id="excatNCQerr"> 1181 <sect2 id="excatNCQerr">
1182 <title>ATA device error (NCQ)</title> 1182 <title>ATA device error (NCQ)</title>
1183 1183
1184 <para> 1184 <para>
1185 NCQ command error is indicated by cleared BSY and set ERR bit 1185 NCQ command error is indicated by cleared BSY and set ERR bit
1186 during NCQ command phase (one or more NCQ commands 1186 during NCQ command phase (one or more NCQ commands
1187 outstanding). Although STATUS and ERROR registers will 1187 outstanding). Although STATUS and ERROR registers will
1188 contain valid values describing the error, READ LOG EXT is 1188 contain valid values describing the error, READ LOG EXT is
1189 required to clear the error condition, determine which command 1189 required to clear the error condition, determine which command
1190 has failed and acquire more information. 1190 has failed and acquire more information.
1191 </para> 1191 </para>
1192 1192
1193 <para> 1193 <para>
1194 READ LOG EXT Log Page 10h reports which tag has failed and 1194 READ LOG EXT Log Page 10h reports which tag has failed and
1195 taskfile register values describing the error. With this 1195 taskfile register values describing the error. With this
1196 information the failed command can be handled as a normal ATA 1196 information the failed command can be handled as a normal ATA
1197 command error as in <xref linkend="excatDevErr"/> and all 1197 command error as in <xref linkend="excatDevErr"/> and all
1198 other in-flight commands must be retried. Note that this 1198 other in-flight commands must be retried. Note that this
1199 retry should not be counted - it's likely that commands 1199 retry should not be counted - it's likely that commands
1200 retried this way would have completed normally if it were not 1200 retried this way would have completed normally if it were not
1201 for the failed command. 1201 for the failed command.
1202 </para> 1202 </para>
1203 1203
1204 <para> 1204 <para>
1205 Note that ATA bus errors can be reported as ATA device NCQ 1205 Note that ATA bus errors can be reported as ATA device NCQ
1206 errors. This should be handled as described in <xref 1206 errors. This should be handled as described in <xref
1207 linkend="excatATAbusErr"/>. 1207 linkend="excatATAbusErr"/>.
1208 </para> 1208 </para>
1209 1209
1210 <para> 1210 <para>
1211 If READ LOG EXT Log Page 10h fails or reports NQ, we're 1211 If READ LOG EXT Log Page 10h fails or reports NQ, we're
1212 thoroughly screwed. This condition should be treated 1212 thoroughly screwed. This condition should be treated
1213 according to <xref linkend="excatHSMviolation"/>. 1213 according to <xref linkend="excatHSMviolation"/>.
1214 </para> 1214 </para>
1215 1215
1216 </sect2> 1216 </sect2>
1217 1217
1218 <sect2 id="excatATAbusErr"> 1218 <sect2 id="excatATAbusErr">
1219 <title>ATA bus error</title> 1219 <title>ATA bus error</title>
1220 1220
1221 <para> 1221 <para>
1222 ATA bus error means that data corruption occurred during 1222 ATA bus error means that data corruption occurred during
1223 transmission over ATA bus (SATA or PATA). This type of errors 1223 transmission over ATA bus (SATA or PATA). This type of errors
1224 can be indicated by 1224 can be indicated by
1225 </para> 1225 </para>
1226 1226
1227 <itemizedlist> 1227 <itemizedlist>
1228 1228
1229 <listitem> 1229 <listitem>
1230 <para> 1230 <para>
1231 ICRC or ABRT error as described in <xref linkend="excatDevErr"/>. 1231 ICRC or ABRT error as described in <xref linkend="excatDevErr"/>.
1232 </para> 1232 </para>
1233 </listitem> 1233 </listitem>
1234 1234
1235 <listitem> 1235 <listitem>
1236 <para> 1236 <para>
1237 Controller-specific error completion with error information 1237 Controller-specific error completion with error information
1238 indicating transmission error. 1238 indicating transmission error.
1239 </para> 1239 </para>
1240 </listitem> 1240 </listitem>
1241 1241
1242 <listitem> 1242 <listitem>
1243 <para> 1243 <para>
1244 On some controllers, command timeout. In this case, there may 1244 On some controllers, command timeout. In this case, there may
1245 be a mechanism to determine that the timeout is due to 1245 be a mechanism to determine that the timeout is due to
1246 transmission error. 1246 transmission error.
1247 </para> 1247 </para>
1248 </listitem> 1248 </listitem>
1249 1249
1250 <listitem> 1250 <listitem>
1251 <para> 1251 <para>
1252 Unknown/random errors, timeouts and all sorts of weirdities. 1252 Unknown/random errors, timeouts and all sorts of weirdities.
1253 </para> 1253 </para>
1254 </listitem> 1254 </listitem>
1255 1255
1256 </itemizedlist> 1256 </itemizedlist>
1257 1257
1258 <para> 1258 <para>
1259 As described above, transmission errors can cause wide variety 1259 As described above, transmission errors can cause wide variety
1260 of symptoms ranging from device ICRC error to random device 1260 of symptoms ranging from device ICRC error to random device
1261 lockup, and, for many cases, there is no way to tell if an 1261 lockup, and, for many cases, there is no way to tell if an
1262 error condition is due to transmission error or not; 1262 error condition is due to transmission error or not;
1263 therefore, it's necessary to employ some kind of heuristic 1263 therefore, it's necessary to employ some kind of heuristic
1264 when dealing with errors and timeouts. For example, 1264 when dealing with errors and timeouts. For example,
1265 encountering repetitive ABRT errors for known supported 1265 encountering repetitive ABRT errors for known supported
1266 command is likely to indicate ATA bus error. 1266 command is likely to indicate ATA bus error.
1267 </para> 1267 </para>
1268 1268
1269 <para> 1269 <para>
1270 Once it's determined that ATA bus errors have possibly 1270 Once it's determined that ATA bus errors have possibly
1271 occurred, lowering ATA bus transmission speed is one of 1271 occurred, lowering ATA bus transmission speed is one of
1272 actions which may alleviate the problem. See <xref 1272 actions which may alleviate the problem. See <xref
1273 linkend="exrecReconf"/> for more information. 1273 linkend="exrecReconf"/> for more information.
1274 </para> 1274 </para>
1275 1275
1276 </sect2> 1276 </sect2>
1277 1277
1278 <sect2 id="excatPCIbusErr"> 1278 <sect2 id="excatPCIbusErr">
1279 <title>PCI bus error</title> 1279 <title>PCI bus error</title>
1280 1280
1281 <para> 1281 <para>
1282 Data corruption or other failures during transmission over PCI 1282 Data corruption or other failures during transmission over PCI
1283 (or other system bus). For standard BMDMA, this is indicated 1283 (or other system bus). For standard BMDMA, this is indicated
1284 by Error bit in the BMDMA Status register. This type of 1284 by Error bit in the BMDMA Status register. This type of
1285 errors must be logged as it indicates something is very wrong 1285 errors must be logged as it indicates something is very wrong
1286 with the system. Resetting host controller is recommended. 1286 with the system. Resetting host controller is recommended.
1287 </para> 1287 </para>
1288 1288
1289 </sect2> 1289 </sect2>
1290 1290
1291 <sect2 id="excatLateCompletion"> 1291 <sect2 id="excatLateCompletion">
1292 <title>Late completion</title> 1292 <title>Late completion</title>
1293 1293
1294 <para> 1294 <para>
1295 This occurs when timeout occurs and the timeout handler finds 1295 This occurs when timeout occurs and the timeout handler finds
1296 out that the timed out command has completed successfully or 1296 out that the timed out command has completed successfully or
1297 with error. This is usually caused by lost interrupts. This 1297 with error. This is usually caused by lost interrupts. This
1298 type of errors must be logged. Resetting host controller is 1298 type of errors must be logged. Resetting host controller is
1299 recommended. 1299 recommended.
1300 </para> 1300 </para>
1301 1301
1302 </sect2> 1302 </sect2>
1303 1303
1304 <sect2 id="excatUnknown"> 1304 <sect2 id="excatUnknown">
1305 <title>Unknown error (timeout)</title> 1305 <title>Unknown error (timeout)</title>
1306 1306
1307 <para> 1307 <para>
1308 This is when timeout occurs and the command is still 1308 This is when timeout occurs and the command is still
1309 processing or the host and device are in unknown state. When 1309 processing or the host and device are in unknown state. When
1310 this occurs, HSM could be in any valid or invalid state. To 1310 this occurs, HSM could be in any valid or invalid state. To
1311 bring the device to known state and make it forget about the 1311 bring the device to known state and make it forget about the
1312 timed out command, resetting is necessary. The timed out 1312 timed out command, resetting is necessary. The timed out
1313 command may be retried. 1313 command may be retried.
1314 </para> 1314 </para>
1315 1315
1316 <para> 1316 <para>
1317 Timeouts can also be caused by transmission errors. Refer to 1317 Timeouts can also be caused by transmission errors. Refer to
1318 <xref linkend="excatATAbusErr"/> for more details. 1318 <xref linkend="excatATAbusErr"/> for more details.
1319 </para> 1319 </para>
1320 1320
1321 </sect2> 1321 </sect2>
1322 1322
1323 <sect2 id="excatHoplugPM"> 1323 <sect2 id="excatHoplugPM">
1324 <title>Hotplug and power management exceptions</title> 1324 <title>Hotplug and power management exceptions</title>
1325 1325
1326 <para> 1326 <para>
1327 &lt;&lt;TODO: fill here&gt;&gt; 1327 &lt;&lt;TODO: fill here&gt;&gt;
1328 </para> 1328 </para>
1329 1329
1330 </sect2> 1330 </sect2>
1331 1331
1332 </sect1> 1332 </sect1>
1333 1333
1334 <sect1 id="exrec"> 1334 <sect1 id="exrec">
1335 <title>EH recovery actions</title> 1335 <title>EH recovery actions</title>
1336 1336
1337 <para> 1337 <para>
1338 This section discusses several important recovery actions. 1338 This section discusses several important recovery actions.
1339 </para> 1339 </para>
1340 1340
1341 <sect2 id="exrecClr"> 1341 <sect2 id="exrecClr">
1342 <title>Clearing error condition</title> 1342 <title>Clearing error condition</title>
1343 1343
1344 <para> 1344 <para>
1345 Many controllers require its error registers to be cleared by 1345 Many controllers require its error registers to be cleared by
1346 error handler. Different controllers may have different 1346 error handler. Different controllers may have different
1347 requirements. 1347 requirements.
1348 </para> 1348 </para>
1349 1349
1350 <para> 1350 <para>
1351 For SATA, it's strongly recommended to clear at least SError 1351 For SATA, it's strongly recommended to clear at least SError
1352 register during error handling. 1352 register during error handling.
1353 </para> 1353 </para>
1354 </sect2> 1354 </sect2>
1355 1355
1356 <sect2 id="exrecRst"> 1356 <sect2 id="exrecRst">
1357 <title>Reset</title> 1357 <title>Reset</title>
1358 1358
1359 <para> 1359 <para>
1360 During EH, resetting is necessary in the following cases. 1360 During EH, resetting is necessary in the following cases.
1361 </para> 1361 </para>
1362 1362
1363 <itemizedlist> 1363 <itemizedlist>
1364 1364
1365 <listitem> 1365 <listitem>
1366 <para> 1366 <para>
1367 HSM is in unknown or invalid state 1367 HSM is in unknown or invalid state
1368 </para> 1368 </para>
1369 </listitem> 1369 </listitem>
1370 1370
1371 <listitem> 1371 <listitem>
1372 <para> 1372 <para>
1373 HBA is in unknown or invalid state 1373 HBA is in unknown or invalid state
1374 </para> 1374 </para>
1375 </listitem> 1375 </listitem>
1376 1376
1377 <listitem> 1377 <listitem>
1378 <para> 1378 <para>
1379 EH needs to make HBA/device forget about in-flight commands 1379 EH needs to make HBA/device forget about in-flight commands
1380 </para> 1380 </para>
1381 </listitem> 1381 </listitem>
1382 1382
1383 <listitem> 1383 <listitem>
1384 <para> 1384 <para>
1385 HBA/device behaves weirdly 1385 HBA/device behaves weirdly
1386 </para> 1386 </para>
1387 </listitem> 1387 </listitem>
1388 1388
1389 </itemizedlist> 1389 </itemizedlist>
1390 1390
1391 <para> 1391 <para>
1392 Resetting during EH might be a good idea regardless of error 1392 Resetting during EH might be a good idea regardless of error
1393 condition to improve EH robustness. Whether to reset both or 1393 condition to improve EH robustness. Whether to reset both or
1394 either one of HBA and device depends on situation but the 1394 either one of HBA and device depends on situation but the
1395 following scheme is recommended. 1395 following scheme is recommended.
1396 </para> 1396 </para>
1397 1397
1398 <itemizedlist> 1398 <itemizedlist>
1399 1399
1400 <listitem> 1400 <listitem>
1401 <para> 1401 <para>
1402 When it's known that HBA is in ready state but ATA/ATAPI 1402 When it's known that HBA is in ready state but ATA/ATAPI
1403 device in in unknown state, reset only device. 1403 device is in unknown state, reset only device.
1404 </para> 1404 </para>
1405 </listitem> 1405 </listitem>
1406 1406
1407 <listitem> 1407 <listitem>
1408 <para> 1408 <para>
1409 If HBA is in unknown state, reset both HBA and device. 1409 If HBA is in unknown state, reset both HBA and device.
1410 </para> 1410 </para>
1411 </listitem> 1411 </listitem>
1412 1412
1413 </itemizedlist> 1413 </itemizedlist>
1414 1414
1415 <para> 1415 <para>
1416 HBA resetting is implementation specific. For a controller 1416 HBA resetting is implementation specific. For a controller
1417 complying to taskfile/BMDMA PCI IDE, stopping active DMA 1417 complying to taskfile/BMDMA PCI IDE, stopping active DMA
1418 transaction may be sufficient iff BMDMA state is the only HBA 1418 transaction may be sufficient iff BMDMA state is the only HBA
1419 context. But even mostly taskfile/BMDMA PCI IDE complying 1419 context. But even mostly taskfile/BMDMA PCI IDE complying
1420 controllers may have implementation specific requirements and 1420 controllers may have implementation specific requirements and
1421 mechanism to reset themselves. This must be addressed by 1421 mechanism to reset themselves. This must be addressed by
1422 specific drivers. 1422 specific drivers.
1423 </para> 1423 </para>
1424 1424
1425 <para> 1425 <para>
1426 OTOH, ATA/ATAPI standard describes in detail ways to reset 1426 OTOH, ATA/ATAPI standard describes in detail ways to reset
1427 ATA/ATAPI devices. 1427 ATA/ATAPI devices.
1428 </para> 1428 </para>
1429 1429
1430 <variablelist> 1430 <variablelist>
1431 1431
1432 <varlistentry><term>PATA hardware reset</term> 1432 <varlistentry><term>PATA hardware reset</term>
1433 <listitem> 1433 <listitem>
1434 <para> 1434 <para>
1435 This is hardware initiated device reset signalled with 1435 This is hardware initiated device reset signalled with
1436 asserted PATA RESET- signal. There is no standard way to 1436 asserted PATA RESET- signal. There is no standard way to
1437 initiate hardware reset from software although some 1437 initiate hardware reset from software although some
1438 hardware provides registers that allow driver to directly 1438 hardware provides registers that allow driver to directly
1439 tweak the RESET- signal. 1439 tweak the RESET- signal.
1440 </para> 1440 </para>
1441 </listitem> 1441 </listitem>
1442 </varlistentry> 1442 </varlistentry>
1443 1443
1444 <varlistentry><term>Software reset</term> 1444 <varlistentry><term>Software reset</term>
1445 <listitem> 1445 <listitem>
1446 <para> 1446 <para>
1447 This is achieved by turning CONTROL SRST bit on for at 1447 This is achieved by turning CONTROL SRST bit on for at
1448 least 5us. Both PATA and SATA support it but, in case of 1448 least 5us. Both PATA and SATA support it but, in case of
1449 SATA, this may require controller-specific support as the 1449 SATA, this may require controller-specific support as the
1450 second Register FIS to clear SRST should be transmitted 1450 second Register FIS to clear SRST should be transmitted
1451 while BSY bit is still set. Note that on PATA, this resets 1451 while BSY bit is still set. Note that on PATA, this resets
1452 both master and slave devices on a channel. 1452 both master and slave devices on a channel.
1453 </para> 1453 </para>
1454 </listitem> 1454 </listitem>
1455 </varlistentry> 1455 </varlistentry>
1456 1456
1457 <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term> 1457 <varlistentry><term>EXECUTE DEVICE DIAGNOSTIC command</term>
1458 <listitem> 1458 <listitem>
1459 <para> 1459 <para>
1460 Although ATA/ATAPI standard doesn't describe exactly, EDD 1460 Although ATA/ATAPI standard doesn't describe exactly, EDD
1461 implies some level of resetting, possibly similar level 1461 implies some level of resetting, possibly similar level
1462 with software reset. Host-side EDD protocol can be handled 1462 with software reset. Host-side EDD protocol can be handled
1463 with normal command processing and most SATA controllers 1463 with normal command processing and most SATA controllers
1464 should be able to handle EDD's just like other commands. 1464 should be able to handle EDD's just like other commands.
1465 As in software reset, EDD affects both devices on a PATA 1465 As in software reset, EDD affects both devices on a PATA
1466 bus. 1466 bus.
1467 </para> 1467 </para>
1468 <para> 1468 <para>
1469 Although EDD does reset devices, this doesn't suit error 1469 Although EDD does reset devices, this doesn't suit error
1470 handling as EDD cannot be issued while BSY is set and it's 1470 handling as EDD cannot be issued while BSY is set and it's
1471 unclear how it will act when device is in unknown/weird 1471 unclear how it will act when device is in unknown/weird
1472 state. 1472 state.
1473 </para> 1473 </para>
1474 </listitem> 1474 </listitem>
1475 </varlistentry> 1475 </varlistentry>
1476 1476
1477 <varlistentry><term>ATAPI DEVICE RESET command</term> 1477 <varlistentry><term>ATAPI DEVICE RESET command</term>
1478 <listitem> 1478 <listitem>
1479 <para> 1479 <para>
1480 This is very similar to software reset except that reset 1480 This is very similar to software reset except that reset
1481 can be restricted to the selected device without affecting 1481 can be restricted to the selected device without affecting
1482 the other device sharing the cable. 1482 the other device sharing the cable.
1483 </para> 1483 </para>
1484 </listitem> 1484 </listitem>
1485 </varlistentry> 1485 </varlistentry>
1486 1486
1487 <varlistentry><term>SATA phy reset</term> 1487 <varlistentry><term>SATA phy reset</term>
1488 <listitem> 1488 <listitem>
1489 <para> 1489 <para>
1490 This is the preferred way of resetting a SATA device. In 1490 This is the preferred way of resetting a SATA device. In
1491 effect, it's identical to PATA hardware reset. Note that 1491 effect, it's identical to PATA hardware reset. Note that
1492 this can be done with the standard SCR Control register. 1492 this can be done with the standard SCR Control register.
1493 As such, it's usually easier to implement than software 1493 As such, it's usually easier to implement than software
1494 reset. 1494 reset.
1495 </para> 1495 </para>
1496 </listitem> 1496 </listitem>
1497 </varlistentry> 1497 </varlistentry>
1498 1498
1499 </variablelist> 1499 </variablelist>
1500 1500
1501 <para> 1501 <para>
1502 One more thing to consider when resetting devices is that 1502 One more thing to consider when resetting devices is that
1503 resetting clears certain configuration parameters and they 1503 resetting clears certain configuration parameters and they
1504 need to be set to their previous or newly adjusted values 1504 need to be set to their previous or newly adjusted values
1505 after reset. 1505 after reset.
1506 </para> 1506 </para>
1507 1507
1508 <para> 1508 <para>
1509 Parameters affected are. 1509 Parameters affected are.
1510 </para> 1510 </para>
1511 1511
1512 <itemizedlist> 1512 <itemizedlist>
1513 1513
1514 <listitem> 1514 <listitem>
1515 <para> 1515 <para>
1516 CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used) 1516 CHS set up with INITIALIZE DEVICE PARAMETERS (seldomly used)
1517 </para> 1517 </para>
1518 </listitem> 1518 </listitem>
1519 1519
1520 <listitem> 1520 <listitem>
1521 <para> 1521 <para>
1522 Parameters set with SET FEATURES including transfer mode setting 1522 Parameters set with SET FEATURES including transfer mode setting
1523 </para> 1523 </para>
1524 </listitem> 1524 </listitem>
1525 1525
1526 <listitem> 1526 <listitem>
1527 <para> 1527 <para>
1528 Block count set with SET MULTIPLE MODE 1528 Block count set with SET MULTIPLE MODE
1529 </para> 1529 </para>
1530 </listitem> 1530 </listitem>
1531 1531
1532 <listitem> 1532 <listitem>
1533 <para> 1533 <para>
1534 Other parameters (SET MAX, MEDIA LOCK...) 1534 Other parameters (SET MAX, MEDIA LOCK...)
1535 </para> 1535 </para>
1536 </listitem> 1536 </listitem>
1537 1537
1538 </itemizedlist> 1538 </itemizedlist>
1539 1539
1540 <para> 1540 <para>
1541 ATA/ATAPI standard specifies that some parameters must be 1541 ATA/ATAPI standard specifies that some parameters must be
1542 maintained across hardware or software reset, but doesn't 1542 maintained across hardware or software reset, but doesn't
1543 strictly specify all of them. Always reconfiguring needed 1543 strictly specify all of them. Always reconfiguring needed
1544 parameters after reset is required for robustness. Note that 1544 parameters after reset is required for robustness. Note that
1545 this also applies when resuming from deep sleep (power-off). 1545 this also applies when resuming from deep sleep (power-off).
1546 </para> 1546 </para>
1547 1547
1548 <para> 1548 <para>
1549 Also, ATA/ATAPI standard requires that IDENTIFY DEVICE / 1549 Also, ATA/ATAPI standard requires that IDENTIFY DEVICE /
1550 IDENTIFY PACKET DEVICE is issued after any configuration 1550 IDENTIFY PACKET DEVICE is issued after any configuration
1551 parameter is updated or a hardware reset and the result used 1551 parameter is updated or a hardware reset and the result used
1552 for further operation. OS driver is required to implement 1552 for further operation. OS driver is required to implement
1553 revalidation mechanism to support this. 1553 revalidation mechanism to support this.
1554 </para> 1554 </para>
1555 1555
1556 </sect2> 1556 </sect2>
1557 1557
1558 <sect2 id="exrecReconf"> 1558 <sect2 id="exrecReconf">
1559 <title>Reconfigure transport</title> 1559 <title>Reconfigure transport</title>
1560 1560
1561 <para> 1561 <para>
1562 For both PATA and SATA, a lot of corners are cut for cheap 1562 For both PATA and SATA, a lot of corners are cut for cheap
1563 connectors, cables or controllers and it's quite common to see 1563 connectors, cables or controllers and it's quite common to see
1564 high transmission error rate. This can be mitigated by 1564 high transmission error rate. This can be mitigated by
1565 lowering transmission speed. 1565 lowering transmission speed.
1566 </para> 1566 </para>
1567 1567
1568 <para> 1568 <para>
1569 The following is a possible scheme Jeff Garzik suggested. 1569 The following is a possible scheme Jeff Garzik suggested.
1570 </para> 1570 </para>
1571 1571
1572 <blockquote> 1572 <blockquote>
1573 <para> 1573 <para>
1574 If more than $N (3?) transmission errors happen in 15 minutes, 1574 If more than $N (3?) transmission errors happen in 15 minutes,
1575 </para> 1575 </para>
1576 <itemizedlist> 1576 <itemizedlist>
1577 <listitem> 1577 <listitem>
1578 <para> 1578 <para>
1579 if SATA, decrease SATA PHY speed. if speed cannot be decreased, 1579 if SATA, decrease SATA PHY speed. if speed cannot be decreased,
1580 </para> 1580 </para>
1581 </listitem> 1581 </listitem>
1582 <listitem> 1582 <listitem>
1583 <para> 1583 <para>
1584 decrease UDMA xfer speed. if at UDMA0, switch to PIO4, 1584 decrease UDMA xfer speed. if at UDMA0, switch to PIO4,
1585 </para> 1585 </para>
1586 </listitem> 1586 </listitem>
1587 <listitem> 1587 <listitem>
1588 <para> 1588 <para>
1589 decrease PIO xfer speed. if at PIO3, complain, but continue 1589 decrease PIO xfer speed. if at PIO3, complain, but continue
1590 </para> 1590 </para>
1591 </listitem> 1591 </listitem>
1592 </itemizedlist> 1592 </itemizedlist>
1593 </blockquote> 1593 </blockquote>
1594 1594
1595 </sect2> 1595 </sect2>
1596 1596
1597 </sect1> 1597 </sect1>
1598 1598
1599 </chapter> 1599 </chapter>
1600 1600
1601 <chapter id="PiixInt"> 1601 <chapter id="PiixInt">
1602 <title>ata_piix Internals</title> 1602 <title>ata_piix Internals</title>
1603 !Idrivers/ata/ata_piix.c 1603 !Idrivers/ata/ata_piix.c
1604 </chapter> 1604 </chapter>
1605 1605
1606 <chapter id="SILInt"> 1606 <chapter id="SILInt">
1607 <title>sata_sil Internals</title> 1607 <title>sata_sil Internals</title>
1608 !Idrivers/ata/sata_sil.c 1608 !Idrivers/ata/sata_sil.c
1609 </chapter> 1609 </chapter>
1610 1610
1611 <chapter id="libataThanks"> 1611 <chapter id="libataThanks">
1612 <title>Thanks</title> 1612 <title>Thanks</title>
1613 <para> 1613 <para>
1614 The bulk of the ATA knowledge comes thanks to long conversations with 1614 The bulk of the ATA knowledge comes thanks to long conversations with
1615 Andre Hedrick (www.linux-ide.org), and long hours pondering the ATA 1615 Andre Hedrick (www.linux-ide.org), and long hours pondering the ATA
1616 and SCSI specifications. 1616 and SCSI specifications.
1617 </para> 1617 </para>
1618 <para> 1618 <para>
1619 Thanks to Alan Cox for pointing out similarities 1619 Thanks to Alan Cox for pointing out similarities
1620 between SATA and SCSI, and in general for motivation to hack on 1620 between SATA and SCSI, and in general for motivation to hack on
1621 libata. 1621 libata.
1622 </para> 1622 </para>
1623 <para> 1623 <para>
1624 libata's device detection 1624 libata's device detection
1625 method, ata_pio_devchk, and in general all the early probing was 1625 method, ata_pio_devchk, and in general all the early probing was
1626 based on extensive study of Hale Landis's probe/reset code in his 1626 based on extensive study of Hale Landis's probe/reset code in his
1627 ATADRVR driver (www.ata-atapi.com). 1627 ATADRVR driver (www.ata-atapi.com).
1628 </para> 1628 </para>
1629 </chapter> 1629 </chapter>
1630 1630
1631 </book> 1631 </book>
1632 1632
Documentation/DocBook/usb.tmpl
1 <?xml version="1.0" encoding="UTF-8"?> 1 <?xml version="1.0" encoding="UTF-8"?>
2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" 2 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> 3 "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
4 4
5 <book id="Linux-USB-API"> 5 <book id="Linux-USB-API">
6 <bookinfo> 6 <bookinfo>
7 <title>The Linux-USB Host Side API</title> 7 <title>The Linux-USB Host Side API</title>
8 8
9 <legalnotice> 9 <legalnotice>
10 <para> 10 <para>
11 This documentation is free software; you can redistribute 11 This documentation is free software; you can redistribute
12 it and/or modify it under the terms of the GNU General Public 12 it and/or modify it under the terms of the GNU General Public
13 License as published by the Free Software Foundation; either 13 License as published by the Free Software Foundation; either
14 version 2 of the License, or (at your option) any later 14 version 2 of the License, or (at your option) any later
15 version. 15 version.
16 </para> 16 </para>
17 17
18 <para> 18 <para>
19 This program is distributed in the hope that it will be 19 This program is distributed in the hope that it will be
20 useful, but WITHOUT ANY WARRANTY; without even the implied 20 useful, but WITHOUT ANY WARRANTY; without even the implied
21 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 21 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
22 See the GNU General Public License for more details. 22 See the GNU General Public License for more details.
23 </para> 23 </para>
24 24
25 <para> 25 <para>
26 You should have received a copy of the GNU General Public 26 You should have received a copy of the GNU General Public
27 License along with this program; if not, write to the Free 27 License along with this program; if not, write to the Free
28 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 28 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
29 MA 02111-1307 USA 29 MA 02111-1307 USA
30 </para> 30 </para>
31 31
32 <para> 32 <para>
33 For more details see the file COPYING in the source 33 For more details see the file COPYING in the source
34 distribution of Linux. 34 distribution of Linux.
35 </para> 35 </para>
36 </legalnotice> 36 </legalnotice>
37 </bookinfo> 37 </bookinfo>
38 38
39 <toc></toc> 39 <toc></toc>
40 40
41 <chapter id="intro"> 41 <chapter id="intro">
42 <title>Introduction to USB on Linux</title> 42 <title>Introduction to USB on Linux</title>
43 43
44 <para>A Universal Serial Bus (USB) is used to connect a host, 44 <para>A Universal Serial Bus (USB) is used to connect a host,
45 such as a PC or workstation, to a number of peripheral 45 such as a PC or workstation, to a number of peripheral
46 devices. USB uses a tree structure, with the host as the 46 devices. USB uses a tree structure, with the host as the
47 root (the system's master), hubs as interior nodes, and 47 root (the system's master), hubs as interior nodes, and
48 peripherals as leaves (and slaves). 48 peripherals as leaves (and slaves).
49 Modern PCs support several such trees of USB devices, usually 49 Modern PCs support several such trees of USB devices, usually
50 one USB 2.0 tree (480 Mbit/sec each) with 50 one USB 2.0 tree (480 Mbit/sec each) with
51 a few USB 1.1 trees (12 Mbit/sec each) that are used when you 51 a few USB 1.1 trees (12 Mbit/sec each) that are used when you
52 connect a USB 1.1 device directly to the machine's "root hub". 52 connect a USB 1.1 device directly to the machine's "root hub".
53 </para> 53 </para>
54 54
55 <para>That master/slave asymmetry was designed-in for a number of 55 <para>That master/slave asymmetry was designed-in for a number of
56 reasons, one being ease of use. It is not physically possible to 56 reasons, one being ease of use. It is not physically possible to
57 assemble (legal) USB cables incorrectly: all upstream "to the host" 57 assemble (legal) USB cables incorrectly: all upstream "to the host"
58 connectors are the rectangular type (matching the sockets on 58 connectors are the rectangular type (matching the sockets on
59 root hubs), and all downstream connectors are the squarish type 59 root hubs), and all downstream connectors are the squarish type
60 (or they are built into the peripheral). 60 (or they are built into the peripheral).
61 Also, the host software doesn't need to deal with distributed 61 Also, the host software doesn't need to deal with distributed
62 auto-configuration since the pre-designated master node manages all that. 62 auto-configuration since the pre-designated master node manages all that.
63 And finally, at the electrical level, bus protocol overhead is reduced by 63 And finally, at the electrical level, bus protocol overhead is reduced by
64 eliminating arbitration and moving scheduling into the host software. 64 eliminating arbitration and moving scheduling into the host software.
65 </para> 65 </para>
66 66
67 <para>USB 1.0 was announced in January 1996 and was revised 67 <para>USB 1.0 was announced in January 1996 and was revised
68 as USB 1.1 (with improvements in hub specification and 68 as USB 1.1 (with improvements in hub specification and
69 support for interrupt-out transfers) in September 1998. 69 support for interrupt-out transfers) in September 1998.
70 USB 2.0 was released in April 2000, adding high-speed 70 USB 2.0 was released in April 2000, adding high-speed
71 transfers and transaction-translating hubs (used for USB 1.1 71 transfers and transaction-translating hubs (used for USB 1.1
72 and 1.0 backward compatibility). 72 and 1.0 backward compatibility).
73 </para> 73 </para>
74 74
75 <para>Kernel developers added USB support to Linux early in the 2.2 kernel 75 <para>Kernel developers added USB support to Linux early in the 2.2 kernel
76 series, shortly before 2.3 development forked. Updates from 2.3 were 76 series, shortly before 2.3 development forked. Updates from 2.3 were
77 regularly folded back into 2.2 releases, which improved reliability and 77 regularly folded back into 2.2 releases, which improved reliability and
78 brought <filename>/sbin/hotplug</filename> support as well more drivers. 78 brought <filename>/sbin/hotplug</filename> support as well more drivers.
79 Such improvements were continued in the 2.5 kernel series, where they added 79 Such improvements were continued in the 2.5 kernel series, where they added
80 USB 2.0 support, improved performance, and made the host controller drivers 80 USB 2.0 support, improved performance, and made the host controller drivers
81 (HCDs) more consistent. They also simplified the API (to make bugs less 81 (HCDs) more consistent. They also simplified the API (to make bugs less
82 likely) and added internal "kerneldoc" documentation. 82 likely) and added internal "kerneldoc" documentation.
83 </para> 83 </para>
84 84
85 <para>Linux can run inside USB devices as well as on 85 <para>Linux can run inside USB devices as well as on
86 the hosts that control the devices. 86 the hosts that control the devices.
87 But USB device drivers running inside those peripherals 87 But USB device drivers running inside those peripherals
88 don't do the same things as the ones running inside hosts, 88 don't do the same things as the ones running inside hosts,
89 so they've been given a different name: 89 so they've been given a different name:
90 <emphasis>gadget drivers</emphasis>. 90 <emphasis>gadget drivers</emphasis>.
91 This document does not cover gadget drivers. 91 This document does not cover gadget drivers.
92 </para> 92 </para>
93 93
94 </chapter> 94 </chapter>
95 95
96 <chapter id="host"> 96 <chapter id="host">
97 <title>USB Host-Side API Model</title> 97 <title>USB Host-Side API Model</title>
98 98
99 <para>Host-side drivers for USB devices talk to the "usbcore" APIs. 99 <para>Host-side drivers for USB devices talk to the "usbcore" APIs.
100 There are two. One is intended for 100 There are two. One is intended for
101 <emphasis>general-purpose</emphasis> drivers (exposed through 101 <emphasis>general-purpose</emphasis> drivers (exposed through
102 driver frameworks), and the other is for drivers that are 102 driver frameworks), and the other is for drivers that are
103 <emphasis>part of the core</emphasis>. 103 <emphasis>part of the core</emphasis>.
104 Such core drivers include the <emphasis>hub</emphasis> driver 104 Such core drivers include the <emphasis>hub</emphasis> driver
105 (which manages trees of USB devices) and several different kinds 105 (which manages trees of USB devices) and several different kinds
106 of <emphasis>host controller drivers</emphasis>, 106 of <emphasis>host controller drivers</emphasis>,
107 which control individual busses. 107 which control individual busses.
108 </para> 108 </para>
109 109
110 <para>The device model seen by USB drivers is relatively complex. 110 <para>The device model seen by USB drivers is relatively complex.
111 </para> 111 </para>
112 112
113 <itemizedlist> 113 <itemizedlist>
114 114
115 <listitem><para>USB supports four kinds of data transfers 115 <listitem><para>USB supports four kinds of data transfers
116 (control, bulk, interrupt, and isochronous). Two of them (control 116 (control, bulk, interrupt, and isochronous). Two of them (control
117 and bulk) use bandwidth as it's available, 117 and bulk) use bandwidth as it's available,
118 while the other two (interrupt and isochronous) 118 while the other two (interrupt and isochronous)
119 are scheduled to provide guaranteed bandwidth. 119 are scheduled to provide guaranteed bandwidth.
120 </para></listitem> 120 </para></listitem>
121 121
122 <listitem><para>The device description model includes one or more 122 <listitem><para>The device description model includes one or more
123 "configurations" per device, only one of which is active at a time. 123 "configurations" per device, only one of which is active at a time.
124 Devices that are capable of high-speed operation must also support 124 Devices that are capable of high-speed operation must also support
125 full-speed configurations, along with a way to ask about the 125 full-speed configurations, along with a way to ask about the
126 "other speed" configurations which might be used. 126 "other speed" configurations which might be used.
127 </para></listitem> 127 </para></listitem>
128 128
129 <listitem><para>Configurations have one or more "interfaces", each 129 <listitem><para>Configurations have one or more "interfaces", each
130 of which may have "alternate settings". Interfaces may be 130 of which may have "alternate settings". Interfaces may be
131 standardized by USB "Class" specifications, or may be specific to 131 standardized by USB "Class" specifications, or may be specific to
132 a vendor or device.</para> 132 a vendor or device.</para>
133 133
134 <para>USB device drivers actually bind to interfaces, not devices. 134 <para>USB device drivers actually bind to interfaces, not devices.
135 Think of them as "interface drivers", though you 135 Think of them as "interface drivers", though you
136 may not see many devices where the distinction is important. 136 may not see many devices where the distinction is important.
137 <emphasis>Most USB devices are simple, with only one configuration, 137 <emphasis>Most USB devices are simple, with only one configuration,
138 one interface, and one alternate setting.</emphasis> 138 one interface, and one alternate setting.</emphasis>
139 </para></listitem> 139 </para></listitem>
140 140
141 <listitem><para>Interfaces have one or more "endpoints", each of 141 <listitem><para>Interfaces have one or more "endpoints", each of
142 which supports one type and direction of data transfer such as 142 which supports one type and direction of data transfer such as
143 "bulk out" or "interrupt in". The entire configuration may have 143 "bulk out" or "interrupt in". The entire configuration may have
144 up to sixteen endpoints in each direction, allocated as needed 144 up to sixteen endpoints in each direction, allocated as needed
145 among all the interfaces. 145 among all the interfaces.
146 </para></listitem> 146 </para></listitem>
147 147
148 <listitem><para>Data transfer on USB is packetized; each endpoint 148 <listitem><para>Data transfer on USB is packetized; each endpoint
149 has a maximum packet size. 149 has a maximum packet size.
150 Drivers must often be aware of conventions such as flagging the end 150 Drivers must often be aware of conventions such as flagging the end
151 of bulk transfers using "short" (including zero length) packets. 151 of bulk transfers using "short" (including zero length) packets.
152 </para></listitem> 152 </para></listitem>
153 153
154 <listitem><para>The Linux USB API supports synchronous calls for 154 <listitem><para>The Linux USB API supports synchronous calls for
155 control and bulk messages. 155 control and bulk messages.
156 It also supports asynchnous calls for all kinds of data transfer, 156 It also supports asynchnous calls for all kinds of data transfer,
157 using request structures called "URBs" (USB Request Blocks). 157 using request structures called "URBs" (USB Request Blocks).
158 </para></listitem> 158 </para></listitem>
159 159
160 </itemizedlist> 160 </itemizedlist>
161 161
162 <para>Accordingly, the USB Core API exposed to device drivers 162 <para>Accordingly, the USB Core API exposed to device drivers
163 covers quite a lot of territory. You'll probably need to consult 163 covers quite a lot of territory. You'll probably need to consult
164 the USB 2.0 specification, available online from www.usb.org at 164 the USB 2.0 specification, available online from www.usb.org at
165 no cost, as well as class or device specifications. 165 no cost, as well as class or device specifications.
166 </para> 166 </para>
167 167
168 <para>The only host-side drivers that actually touch hardware 168 <para>The only host-side drivers that actually touch hardware
169 (reading/writing registers, handling IRQs, and so on) are the HCDs. 169 (reading/writing registers, handling IRQs, and so on) are the HCDs.
170 In theory, all HCDs provide the same functionality through the same 170 In theory, all HCDs provide the same functionality through the same
171 API. In practice, that's becoming more true on the 2.5 kernels, 171 API. In practice, that's becoming more true on the 2.5 kernels,
172 but there are still differences that crop up especially with 172 but there are still differences that crop up especially with
173 fault handling. Different controllers don't necessarily report 173 fault handling. Different controllers don't necessarily report
174 the same aspects of failures, and recovery from faults (including 174 the same aspects of failures, and recovery from faults (including
175 software-induced ones like unlinking an URB) isn't yet fully 175 software-induced ones like unlinking an URB) isn't yet fully
176 consistent. 176 consistent.
177 Device driver authors should make a point of doing disconnect 177 Device driver authors should make a point of doing disconnect
178 testing (while the device is active) with each different host 178 testing (while the device is active) with each different host
179 controller driver, to make sure drivers don't have bugs of 179 controller driver, to make sure drivers don't have bugs of
180 their own as well as to make sure they aren't relying on some 180 their own as well as to make sure they aren't relying on some
181 HCD-specific behavior. 181 HCD-specific behavior.
182 (You will need external USB 1.1 and/or 182 (You will need external USB 1.1 and/or
183 USB 2.0 hubs to perform all those tests.) 183 USB 2.0 hubs to perform all those tests.)
184 </para> 184 </para>
185 185
186 </chapter> 186 </chapter>
187 187
188 <chapter><title>USB-Standard Types</title> 188 <chapter><title>USB-Standard Types</title>
189 189
190 <para>In <filename>&lt;linux/usb_ch9.h&gt;</filename> you will find 190 <para>In <filename>&lt;linux/usb_ch9.h&gt;</filename> you will find
191 the USB data types defined in chapter 9 of the USB specification. 191 the USB data types defined in chapter 9 of the USB specification.
192 These data types are used throughout USB, and in APIs including 192 These data types are used throughout USB, and in APIs including
193 this host side API, gadget APIs, and usbfs. 193 this host side API, gadget APIs, and usbfs.
194 </para> 194 </para>
195 195
196 !Iinclude/linux/usb_ch9.h 196 !Iinclude/linux/usb_ch9.h
197 197
198 </chapter> 198 </chapter>
199 199
200 <chapter><title>Host-Side Data Types and Macros</title> 200 <chapter><title>Host-Side Data Types and Macros</title>
201 201
202 <para>The host side API exposes several layers to drivers, some of 202 <para>The host side API exposes several layers to drivers, some of
203 which are more necessary than others. 203 which are more necessary than others.
204 These support lifecycle models for host side drivers 204 These support lifecycle models for host side drivers
205 and devices, and support passing buffers through usbcore to 205 and devices, and support passing buffers through usbcore to
206 some HCD that performs the I/O for the device driver. 206 some HCD that performs the I/O for the device driver.
207 </para> 207 </para>
208 208
209 209
210 !Iinclude/linux/usb.h 210 !Iinclude/linux/usb.h
211 211
212 </chapter> 212 </chapter>
213 213
214 <chapter><title>USB Core APIs</title> 214 <chapter><title>USB Core APIs</title>
215 215
216 <para>There are two basic I/O models in the USB API. 216 <para>There are two basic I/O models in the USB API.
217 The most elemental one is asynchronous: drivers submit requests 217 The most elemental one is asynchronous: drivers submit requests
218 in the form of an URB, and the URB's completion callback 218 in the form of an URB, and the URB's completion callback
219 handle the next step. 219 handle the next step.
220 All USB transfer types support that model, although there 220 All USB transfer types support that model, although there
221 are special cases for control URBs (which always have setup 221 are special cases for control URBs (which always have setup
222 and status stages, but may not have a data stage) and 222 and status stages, but may not have a data stage) and
223 isochronous URBs (which allow large packets and include 223 isochronous URBs (which allow large packets and include
224 per-packet fault reports). 224 per-packet fault reports).
225 Built on top of that is synchronous API support, where a 225 Built on top of that is synchronous API support, where a
226 driver calls a routine that allocates one or more URBs, 226 driver calls a routine that allocates one or more URBs,
227 submits them, and waits until they complete. 227 submits them, and waits until they complete.
228 There are synchronous wrappers for single-buffer control 228 There are synchronous wrappers for single-buffer control
229 and bulk transfers (which are awkward to use in some 229 and bulk transfers (which are awkward to use in some
230 driver disconnect scenarios), and for scatterlist based 230 driver disconnect scenarios), and for scatterlist based
231 streaming i/o (bulk or interrupt). 231 streaming i/o (bulk or interrupt).
232 </para> 232 </para>
233 233
234 <para>USB drivers need to provide buffers that can be 234 <para>USB drivers need to provide buffers that can be
235 used for DMA, although they don't necessarily need to 235 used for DMA, although they don't necessarily need to
236 provide the DMA mapping themselves. 236 provide the DMA mapping themselves.
237 There are APIs to use used when allocating DMA buffers, 237 There are APIs to use used when allocating DMA buffers,
238 which can prevent use of bounce buffers on some systems. 238 which can prevent use of bounce buffers on some systems.
239 In some cases, drivers may be able to rely on 64bit DMA 239 In some cases, drivers may be able to rely on 64bit DMA
240 to eliminate another kind of bounce buffer. 240 to eliminate another kind of bounce buffer.
241 </para> 241 </para>
242 242
243 !Edrivers/usb/core/urb.c 243 !Edrivers/usb/core/urb.c
244 !Edrivers/usb/core/message.c 244 !Edrivers/usb/core/message.c
245 !Edrivers/usb/core/file.c 245 !Edrivers/usb/core/file.c
246 !Edrivers/usb/core/driver.c 246 !Edrivers/usb/core/driver.c
247 !Edrivers/usb/core/usb.c 247 !Edrivers/usb/core/usb.c
248 !Edrivers/usb/core/hub.c 248 !Edrivers/usb/core/hub.c
249 </chapter> 249 </chapter>
250 250
251 <chapter><title>Host Controller APIs</title> 251 <chapter><title>Host Controller APIs</title>
252 252
253 <para>These APIs are only for use by host controller drivers, 253 <para>These APIs are only for use by host controller drivers,
254 most of which implement standard register interfaces such as 254 most of which implement standard register interfaces such as
255 EHCI, OHCI, or UHCI. 255 EHCI, OHCI, or UHCI.
256 UHCI was one of the first interfaces, designed by Intel and 256 UHCI was one of the first interfaces, designed by Intel and
257 also used by VIA; it doesn't do much in hardware. 257 also used by VIA; it doesn't do much in hardware.
258 OHCI was designed later, to have the hardware do more work 258 OHCI was designed later, to have the hardware do more work
259 (bigger transfers, tracking protocol state, and so on). 259 (bigger transfers, tracking protocol state, and so on).
260 EHCI was designed with USB 2.0; its design has features that 260 EHCI was designed with USB 2.0; its design has features that
261 resemble OHCI (hardware does much more work) as well as 261 resemble OHCI (hardware does much more work) as well as
262 UHCI (some parts of ISO support, TD list processing). 262 UHCI (some parts of ISO support, TD list processing).
263 </para> 263 </para>
264 264
265 <para>There are host controllers other than the "big three", 265 <para>There are host controllers other than the "big three",
266 although most PCI based controllers (and a few non-PCI based 266 although most PCI based controllers (and a few non-PCI based
267 ones) use one of those interfaces. 267 ones) use one of those interfaces.
268 Not all host controllers use DMA; some use PIO, and there 268 Not all host controllers use DMA; some use PIO, and there
269 is also a simulator. 269 is also a simulator.
270 </para> 270 </para>
271 271
272 <para>The same basic APIs are available to drivers for all 272 <para>The same basic APIs are available to drivers for all
273 those controllers. 273 those controllers.
274 For historical reasons they are in two layers: 274 For historical reasons they are in two layers:
275 <structname>struct usb_bus</structname> is a rather thin 275 <structname>struct usb_bus</structname> is a rather thin
276 layer that became available in the 2.2 kernels, while 276 layer that became available in the 2.2 kernels, while
277 <structname>struct usb_hcd</structname> is a more featureful 277 <structname>struct usb_hcd</structname> is a more featureful
278 layer (available in later 2.4 kernels and in 2.5) that 278 layer (available in later 2.4 kernels and in 2.5) that
279 lets HCDs share common code, to shrink driver size 279 lets HCDs share common code, to shrink driver size
280 and significantly reduce hcd-specific behaviors. 280 and significantly reduce hcd-specific behaviors.
281 </para> 281 </para>
282 282
283 !Edrivers/usb/core/hcd.c 283 !Edrivers/usb/core/hcd.c
284 !Edrivers/usb/core/hcd-pci.c 284 !Edrivers/usb/core/hcd-pci.c
285 !Idrivers/usb/core/buffer.c 285 !Idrivers/usb/core/buffer.c
286 </chapter> 286 </chapter>
287 287
288 <chapter> 288 <chapter>
289 <title>The USB Filesystem (usbfs)</title> 289 <title>The USB Filesystem (usbfs)</title>
290 290
291 <para>This chapter presents the Linux <emphasis>usbfs</emphasis>. 291 <para>This chapter presents the Linux <emphasis>usbfs</emphasis>.
292 You may prefer to avoid writing new kernel code for your 292 You may prefer to avoid writing new kernel code for your
293 USB driver; that's the problem that usbfs set out to solve. 293 USB driver; that's the problem that usbfs set out to solve.
294 User mode device drivers are usually packaged as applications 294 User mode device drivers are usually packaged as applications
295 or libraries, and may use usbfs through some programming library 295 or libraries, and may use usbfs through some programming library
296 that wraps it. Such libraries include 296 that wraps it. Such libraries include
297 <ulink url="http://libusb.sourceforge.net">libusb</ulink> 297 <ulink url="http://libusb.sourceforge.net">libusb</ulink>
298 for C/C++, and 298 for C/C++, and
299 <ulink url="http://jUSB.sourceforge.net">jUSB</ulink> for Java. 299 <ulink url="http://jUSB.sourceforge.net">jUSB</ulink> for Java.
300 </para> 300 </para>
301 301
302 <note><title>Unfinished</title> 302 <note><title>Unfinished</title>
303 <para>This particular documentation is incomplete, 303 <para>This particular documentation is incomplete,
304 especially with respect to the asynchronous mode. 304 especially with respect to the asynchronous mode.
305 As of kernel 2.5.66 the code and this (new) documentation 305 As of kernel 2.5.66 the code and this (new) documentation
306 need to be cross-reviewed. 306 need to be cross-reviewed.
307 </para> 307 </para>
308 </note> 308 </note>
309 309
310 <para>Configure usbfs into Linux kernels by enabling the 310 <para>Configure usbfs into Linux kernels by enabling the
311 <emphasis>USB filesystem</emphasis> option (CONFIG_USB_DEVICEFS), 311 <emphasis>USB filesystem</emphasis> option (CONFIG_USB_DEVICEFS),
312 and you get basic support for user mode USB device drivers. 312 and you get basic support for user mode USB device drivers.
313 Until relatively recently it was often (confusingly) called 313 Until relatively recently it was often (confusingly) called
314 <emphasis>usbdevfs</emphasis> although it wasn't solving what 314 <emphasis>usbdevfs</emphasis> although it wasn't solving what
315 <emphasis>devfs</emphasis> was. 315 <emphasis>devfs</emphasis> was.
316 Every USB device will appear in usbfs, regardless of whether or 316 Every USB device will appear in usbfs, regardless of whether or
317 not it has a kernel driver. 317 not it has a kernel driver.
318 </para> 318 </para>
319 319
320 <sect1> 320 <sect1>
321 <title>What files are in "usbfs"?</title> 321 <title>What files are in "usbfs"?</title>
322 322
323 <para>Conventionally mounted at 323 <para>Conventionally mounted at
324 <filename>/proc/bus/usb</filename>, usbfs 324 <filename>/proc/bus/usb</filename>, usbfs
325 features include: 325 features include:
326 <itemizedlist> 326 <itemizedlist>
327 <listitem><para><filename>/proc/bus/usb/devices</filename> 327 <listitem><para><filename>/proc/bus/usb/devices</filename>
328 ... a text file 328 ... a text file
329 showing each of the USB devices on known to the kernel, 329 showing each of the USB devices on known to the kernel,
330 and their configuration descriptors. 330 and their configuration descriptors.
331 You can also poll() this to learn about new devices. 331 You can also poll() this to learn about new devices.
332 </para></listitem> 332 </para></listitem>
333 <listitem><para><filename>/proc/bus/usb/BBB/DDD</filename> 333 <listitem><para><filename>/proc/bus/usb/BBB/DDD</filename>
334 ... magic files 334 ... magic files
335 exposing the each device's configuration descriptors, and 335 exposing the each device's configuration descriptors, and
336 supporting a series of ioctls for making device requests, 336 supporting a series of ioctls for making device requests,
337 including I/O to devices. (Purely for access by programs.) 337 including I/O to devices. (Purely for access by programs.)
338 </para></listitem> 338 </para></listitem>
339 </itemizedlist> 339 </itemizedlist>
340 </para> 340 </para>
341 341
342 <para> Each bus is given a number (BBB) based on when it was 342 <para> Each bus is given a number (BBB) based on when it was
343 enumerated; within each bus, each device is given a similar 343 enumerated; within each bus, each device is given a similar
344 number (DDD). 344 number (DDD).
345 Those BBB/DDD paths are not "stable" identifiers; 345 Those BBB/DDD paths are not "stable" identifiers;
346 expect them to change even if you always leave the devices 346 expect them to change even if you always leave the devices
347 plugged in to the same hub port. 347 plugged in to the same hub port.
348 <emphasis>Don't even think of saving these in application 348 <emphasis>Don't even think of saving these in application
349 configuration files.</emphasis> 349 configuration files.</emphasis>
350 Stable identifiers are available, for user mode applications 350 Stable identifiers are available, for user mode applications
351 that want to use them. HID and networking devices expose 351 that want to use them. HID and networking devices expose
352 these stable IDs, so that for example you can be sure that 352 these stable IDs, so that for example you can be sure that
353 you told the right UPS to power down its second server. 353 you told the right UPS to power down its second server.
354 "usbfs" doesn't (yet) expose those IDs. 354 "usbfs" doesn't (yet) expose those IDs.
355 </para> 355 </para>
356 356
357 </sect1> 357 </sect1>
358 358
359 <sect1> 359 <sect1>
360 <title>Mounting and Access Control</title> 360 <title>Mounting and Access Control</title>
361 361
362 <para>There are a number of mount options for usbfs, which will 362 <para>There are a number of mount options for usbfs, which will
363 be of most interest to you if you need to override the default 363 be of most interest to you if you need to override the default
364 access control policy. 364 access control policy.
365 That policy is that only root may read or write device files 365 That policy is that only root may read or write device files
366 (<filename>/proc/bus/BBB/DDD</filename>) although anyone may read 366 (<filename>/proc/bus/BBB/DDD</filename>) although anyone may read
367 the <filename>devices</filename> 367 the <filename>devices</filename>
368 or <filename>drivers</filename> files. 368 or <filename>drivers</filename> files.
369 I/O requests to the device also need the CAP_SYS_RAWIO capability, 369 I/O requests to the device also need the CAP_SYS_RAWIO capability,
370 </para> 370 </para>
371 371
372 <para>The significance of that is that by default, all user mode 372 <para>The significance of that is that by default, all user mode
373 device drivers need super-user privileges. 373 device drivers need super-user privileges.
374 You can change modes or ownership in a driver setup 374 You can change modes or ownership in a driver setup
375 when the device hotplugs, or maye just start the 375 when the device hotplugs, or maye just start the
376 driver right then, as a privileged server (or some activity 376 driver right then, as a privileged server (or some activity
377 within one). 377 within one).
378 That's the most secure approach for multi-user systems, 378 That's the most secure approach for multi-user systems,
379 but for single user systems ("trusted" by that user) 379 but for single user systems ("trusted" by that user)
380 it's more convenient just to grant everyone all access 380 it's more convenient just to grant everyone all access
381 (using the <emphasis>devmode=0666</emphasis> option) 381 (using the <emphasis>devmode=0666</emphasis> option)
382 so the driver can start whenever it's needed. 382 so the driver can start whenever it's needed.
383 </para> 383 </para>
384 384
385 <para>The mount options for usbfs, usable in /etc/fstab or 385 <para>The mount options for usbfs, usable in /etc/fstab or
386 in command line invocations of <emphasis>mount</emphasis>, are: 386 in command line invocations of <emphasis>mount</emphasis>, are:
387 387
388 <variablelist> 388 <variablelist>
389 <varlistentry> 389 <varlistentry>
390 <term><emphasis>busgid</emphasis>=NNNNN</term> 390 <term><emphasis>busgid</emphasis>=NNNNN</term>
391 <listitem><para>Controls the GID used for the 391 <listitem><para>Controls the GID used for the
392 /proc/bus/usb/BBB 392 /proc/bus/usb/BBB
393 directories. (Default: 0)</para></listitem></varlistentry> 393 directories. (Default: 0)</para></listitem></varlistentry>
394 <varlistentry><term><emphasis>busmode</emphasis>=MMM</term> 394 <varlistentry><term><emphasis>busmode</emphasis>=MMM</term>
395 <listitem><para>Controls the file mode used for the 395 <listitem><para>Controls the file mode used for the
396 /proc/bus/usb/BBB 396 /proc/bus/usb/BBB
397 directories. (Default: 0555) 397 directories. (Default: 0555)
398 </para></listitem></varlistentry> 398 </para></listitem></varlistentry>
399 <varlistentry><term><emphasis>busuid</emphasis>=NNNNN</term> 399 <varlistentry><term><emphasis>busuid</emphasis>=NNNNN</term>
400 <listitem><para>Controls the UID used for the 400 <listitem><para>Controls the UID used for the
401 /proc/bus/usb/BBB 401 /proc/bus/usb/BBB
402 directories. (Default: 0)</para></listitem></varlistentry> 402 directories. (Default: 0)</para></listitem></varlistentry>
403 403
404 <varlistentry><term><emphasis>devgid</emphasis>=NNNNN</term> 404 <varlistentry><term><emphasis>devgid</emphasis>=NNNNN</term>
405 <listitem><para>Controls the GID used for the 405 <listitem><para>Controls the GID used for the
406 /proc/bus/usb/BBB/DDD 406 /proc/bus/usb/BBB/DDD
407 files. (Default: 0)</para></listitem></varlistentry> 407 files. (Default: 0)</para></listitem></varlistentry>
408 <varlistentry><term><emphasis>devmode</emphasis>=MMM</term> 408 <varlistentry><term><emphasis>devmode</emphasis>=MMM</term>
409 <listitem><para>Controls the file mode used for the 409 <listitem><para>Controls the file mode used for the
410 /proc/bus/usb/BBB/DDD 410 /proc/bus/usb/BBB/DDD
411 files. (Default: 0644)</para></listitem></varlistentry> 411 files. (Default: 0644)</para></listitem></varlistentry>
412 <varlistentry><term><emphasis>devuid</emphasis>=NNNNN</term> 412 <varlistentry><term><emphasis>devuid</emphasis>=NNNNN</term>
413 <listitem><para>Controls the UID used for the 413 <listitem><para>Controls the UID used for the
414 /proc/bus/usb/BBB/DDD 414 /proc/bus/usb/BBB/DDD
415 files. (Default: 0)</para></listitem></varlistentry> 415 files. (Default: 0)</para></listitem></varlistentry>
416 416
417 <varlistentry><term><emphasis>listgid</emphasis>=NNNNN</term> 417 <varlistentry><term><emphasis>listgid</emphasis>=NNNNN</term>
418 <listitem><para>Controls the GID used for the 418 <listitem><para>Controls the GID used for the
419 /proc/bus/usb/devices and drivers files. 419 /proc/bus/usb/devices and drivers files.
420 (Default: 0)</para></listitem></varlistentry> 420 (Default: 0)</para></listitem></varlistentry>
421 <varlistentry><term><emphasis>listmode</emphasis>=MMM</term> 421 <varlistentry><term><emphasis>listmode</emphasis>=MMM</term>
422 <listitem><para>Controls the file mode used for the 422 <listitem><para>Controls the file mode used for the
423 /proc/bus/usb/devices and drivers files. 423 /proc/bus/usb/devices and drivers files.
424 (Default: 0444)</para></listitem></varlistentry> 424 (Default: 0444)</para></listitem></varlistentry>
425 <varlistentry><term><emphasis>listuid</emphasis>=NNNNN</term> 425 <varlistentry><term><emphasis>listuid</emphasis>=NNNNN</term>
426 <listitem><para>Controls the UID used for the 426 <listitem><para>Controls the UID used for the
427 /proc/bus/usb/devices and drivers files. 427 /proc/bus/usb/devices and drivers files.
428 (Default: 0)</para></listitem></varlistentry> 428 (Default: 0)</para></listitem></varlistentry>
429 </variablelist> 429 </variablelist>
430 430
431 </para> 431 </para>
432 432
433 <para>Note that many Linux distributions hard-wire the mount options 433 <para>Note that many Linux distributions hard-wire the mount options
434 for usbfs in their init scripts, such as 434 for usbfs in their init scripts, such as
435 <filename>/etc/rc.d/rc.sysinit</filename>, 435 <filename>/etc/rc.d/rc.sysinit</filename>,
436 rather than making it easy to set this per-system 436 rather than making it easy to set this per-system
437 policy in <filename>/etc/fstab</filename>. 437 policy in <filename>/etc/fstab</filename>.
438 </para> 438 </para>
439 439
440 </sect1> 440 </sect1>
441 441
442 <sect1> 442 <sect1>
443 <title>/proc/bus/usb/devices</title> 443 <title>/proc/bus/usb/devices</title>
444 444
445 <para>This file is handy for status viewing tools in user 445 <para>This file is handy for status viewing tools in user
446 mode, which can scan the text format and ignore most of it. 446 mode, which can scan the text format and ignore most of it.
447 More detailed device status (including class and vendor 447 More detailed device status (including class and vendor
448 status) is available from device-specific files. 448 status) is available from device-specific files.
449 For information about the current format of this file, 449 For information about the current format of this file,
450 see the 450 see the
451 <filename>Documentation/usb/proc_usb_info.txt</filename> 451 <filename>Documentation/usb/proc_usb_info.txt</filename>
452 file in your Linux kernel sources. 452 file in your Linux kernel sources.
453 </para> 453 </para>
454 454
455 <para>This file, in combination with the poll() system call, can 455 <para>This file, in combination with the poll() system call, can
456 also be used to detect when devices are added or removed: 456 also be used to detect when devices are added or removed:
457 <programlisting>int fd; 457 <programlisting>int fd;
458 struct pollfd pfd; 458 struct pollfd pfd;
459 459
460 fd = open("/proc/bus/usb/devices", O_RDONLY); 460 fd = open("/proc/bus/usb/devices", O_RDONLY);
461 pfd = { fd, POLLIN, 0 }; 461 pfd = { fd, POLLIN, 0 };
462 for (;;) { 462 for (;;) {
463 /* The first time through, this call will return immediately. */ 463 /* The first time through, this call will return immediately. */
464 poll(&amp;pfd, 1, -1); 464 poll(&amp;pfd, 1, -1);
465 465
466 /* To see what's changed, compare the file's previous and current 466 /* To see what's changed, compare the file's previous and current
467 contents or scan the filesystem. (Scanning is more precise.) */ 467 contents or scan the filesystem. (Scanning is more precise.) */
468 }</programlisting> 468 }</programlisting>
469 Note that this behavior is intended to be used for informational 469 Note that this behavior is intended to be used for informational
470 and debug purposes. It would be more appropriate to use programs 470 and debug purposes. It would be more appropriate to use programs
471 such as udev or HAL to initialize a device or start a user-mode 471 such as udev or HAL to initialize a device or start a user-mode
472 helper program, for instance. 472 helper program, for instance.
473 </para> 473 </para>
474 </sect1> 474 </sect1>
475 475
476 <sect1> 476 <sect1>
477 <title>/proc/bus/usb/BBB/DDD</title> 477 <title>/proc/bus/usb/BBB/DDD</title>
478 478
479 <para>Use these files in one of these basic ways: 479 <para>Use these files in one of these basic ways:
480 </para> 480 </para>
481 481
482 <para><emphasis>They can be read,</emphasis> 482 <para><emphasis>They can be read,</emphasis>
483 producing first the device descriptor 483 producing first the device descriptor
484 (18 bytes) and then the descriptors for the current configuration. 484 (18 bytes) and then the descriptors for the current configuration.
485 See the USB 2.0 spec for details about those binary data formats. 485 See the USB 2.0 spec for details about those binary data formats.
486 You'll need to convert most multibyte values from little endian 486 You'll need to convert most multibyte values from little endian
487 format to your native host byte order, although a few of the 487 format to your native host byte order, although a few of the
488 fields in the device descriptor (both of the BCD-encoded fields, 488 fields in the device descriptor (both of the BCD-encoded fields,
489 and the vendor and product IDs) will be byteswapped for you. 489 and the vendor and product IDs) will be byteswapped for you.
490 Note that configuration descriptors include descriptors for 490 Note that configuration descriptors include descriptors for
491 interfaces, altsettings, endpoints, and maybe additional 491 interfaces, altsettings, endpoints, and maybe additional
492 class descriptors. 492 class descriptors.
493 </para> 493 </para>
494 494
495 <para><emphasis>Perform USB operations</emphasis> using 495 <para><emphasis>Perform USB operations</emphasis> using
496 <emphasis>ioctl()</emphasis> requests to make endpoint I/O 496 <emphasis>ioctl()</emphasis> requests to make endpoint I/O
497 requests (synchronously or asynchronously) or manage 497 requests (synchronously or asynchronously) or manage
498 the device. 498 the device.
499 These requests need the CAP_SYS_RAWIO capability, 499 These requests need the CAP_SYS_RAWIO capability,
500 as well as filesystem access permissions. 500 as well as filesystem access permissions.
501 Only one ioctl request can be made on one of these 501 Only one ioctl request can be made on one of these
502 device files at a time. 502 device files at a time.
503 This means that if you are synchronously reading an endpoint 503 This means that if you are synchronously reading an endpoint
504 from one thread, you won't be able to write to a different 504 from one thread, you won't be able to write to a different
505 endpoint from another thread until the read completes. 505 endpoint from another thread until the read completes.
506 This works for <emphasis>half duplex</emphasis> protocols, 506 This works for <emphasis>half duplex</emphasis> protocols,
507 but otherwise you'd use asynchronous i/o requests. 507 but otherwise you'd use asynchronous i/o requests.
508 </para> 508 </para>
509 509
510 </sect1> 510 </sect1>
511 511
512 512
513 <sect1> 513 <sect1>
514 <title>Life Cycle of User Mode Drivers</title> 514 <title>Life Cycle of User Mode Drivers</title>
515 515
516 <para>Such a driver first needs to find a device file 516 <para>Such a driver first needs to find a device file
517 for a device it knows how to handle. 517 for a device it knows how to handle.
518 Maybe it was told about it because a 518 Maybe it was told about it because a
519 <filename>/sbin/hotplug</filename> event handling agent 519 <filename>/sbin/hotplug</filename> event handling agent
520 chose that driver to handle the new device. 520 chose that driver to handle the new device.
521 Or maybe it's an application that scans all the 521 Or maybe it's an application that scans all the
522 /proc/bus/usb device files, and ignores most devices. 522 /proc/bus/usb device files, and ignores most devices.
523 In either case, it should <function>read()</function> all 523 In either case, it should <function>read()</function> all
524 the descriptors from the device file, 524 the descriptors from the device file,
525 and check them against what it knows how to handle. 525 and check them against what it knows how to handle.
526 It might just reject everything except a particular 526 It might just reject everything except a particular
527 vendor and product ID, or need a more complex policy. 527 vendor and product ID, or need a more complex policy.
528 </para> 528 </para>
529 529
530 <para>Never assume there will only be one such device 530 <para>Never assume there will only be one such device
531 on the system at a time! 531 on the system at a time!
532 If your code can't handle more than one device at 532 If your code can't handle more than one device at
533 a time, at least detect when there's more than one, and 533 a time, at least detect when there's more than one, and
534 have your users choose which device to use. 534 have your users choose which device to use.
535 </para> 535 </para>
536 536
537 <para>Once your user mode driver knows what device to use, 537 <para>Once your user mode driver knows what device to use,
538 it interacts with it in either of two styles. 538 it interacts with it in either of two styles.
539 The simple style is to make only control requests; some 539 The simple style is to make only control requests; some
540 devices don't need more complex interactions than those. 540 devices don't need more complex interactions than those.
541 (An example might be software using vendor-specific control 541 (An example might be software using vendor-specific control
542 requests for some initialization or configuration tasks, 542 requests for some initialization or configuration tasks,
543 with a kernel driver for the rest.) 543 with a kernel driver for the rest.)
544 </para> 544 </para>
545 545
546 <para>More likely, you need a more complex style driver: 546 <para>More likely, you need a more complex style driver:
547 one using non-control endpoints, reading or writing data 547 one using non-control endpoints, reading or writing data
548 and claiming exclusive use of an interface. 548 and claiming exclusive use of an interface.
549 <emphasis>Bulk</emphasis> transfers are easiest to use, 549 <emphasis>Bulk</emphasis> transfers are easiest to use,
550 but only their sibling <emphasis>interrupt</emphasis> transfers 550 but only their sibling <emphasis>interrupt</emphasis> transfers
551 work with low speed devices. 551 work with low speed devices.
552 Both interrupt and <emphasis>isochronous</emphasis> transfers 552 Both interrupt and <emphasis>isochronous</emphasis> transfers
553 offer service guarantees because their bandwidth is reserved. 553 offer service guarantees because their bandwidth is reserved.
554 Such "periodic" transfers are awkward to use through usbfs, 554 Such "periodic" transfers are awkward to use through usbfs,
555 unless you're using the asynchronous calls. However, interrupt 555 unless you're using the asynchronous calls. However, interrupt
556 transfers can also be used in a synchronous "one shot" style. 556 transfers can also be used in a synchronous "one shot" style.
557 </para> 557 </para>
558 558
559 <para>Your user-mode driver should never need to worry 559 <para>Your user-mode driver should never need to worry
560 about cleaning up request state when the device is 560 about cleaning up request state when the device is
561 disconnected, although it should close its open file 561 disconnected, although it should close its open file
562 descriptors as soon as it starts seeing the ENODEV 562 descriptors as soon as it starts seeing the ENODEV
563 errors. 563 errors.
564 </para> 564 </para>
565 565
566 </sect1> 566 </sect1>
567 567
568 <sect1><title>The ioctl() Requests</title> 568 <sect1><title>The ioctl() Requests</title>
569 569
570 <para>To use these ioctls, you need to include the following 570 <para>To use these ioctls, you need to include the following
571 headers in your userspace program: 571 headers in your userspace program:
572 <programlisting>#include &lt;linux/usb.h&gt; 572 <programlisting>#include &lt;linux/usb.h&gt;
573 #include &lt;linux/usbdevice_fs.h&gt; 573 #include &lt;linux/usbdevice_fs.h&gt;
574 #include &lt;asm/byteorder.h&gt;</programlisting> 574 #include &lt;asm/byteorder.h&gt;</programlisting>
575 The standard USB device model requests, from "Chapter 9" of 575 The standard USB device model requests, from "Chapter 9" of
576 the USB 2.0 specification, are automatically included from 576 the USB 2.0 specification, are automatically included from
577 the <filename>&lt;linux/usb_ch9.h&gt;</filename> header. 577 the <filename>&lt;linux/usb_ch9.h&gt;</filename> header.
578 </para> 578 </para>
579 579
580 <para>Unless noted otherwise, the ioctl requests 580 <para>Unless noted otherwise, the ioctl requests
581 described here will 581 described here will
582 update the modification time on the usbfs file to which 582 update the modification time on the usbfs file to which
583 they are applied (unless they fail). 583 they are applied (unless they fail).
584 A return of zero indicates success; otherwise, a 584 A return of zero indicates success; otherwise, a
585 standard USB error code is returned. (These are 585 standard USB error code is returned. (These are
586 documented in 586 documented in
587 <filename>Documentation/usb/error-codes.txt</filename> 587 <filename>Documentation/usb/error-codes.txt</filename>
588 in your kernel sources.) 588 in your kernel sources.)
589 </para> 589 </para>
590 590
591 <para>Each of these files multiplexes access to several 591 <para>Each of these files multiplexes access to several
592 I/O streams, one per endpoint. 592 I/O streams, one per endpoint.
593 Each device has one control endpoint (endpoint zero) 593 Each device has one control endpoint (endpoint zero)
594 which supports a limited RPC style RPC access. 594 which supports a limited RPC style RPC access.
595 Devices are configured 595 Devices are configured
596 by khubd (in the kernel) setting a device-wide 596 by khubd (in the kernel) setting a device-wide
597 <emphasis>configuration</emphasis> that affects things 597 <emphasis>configuration</emphasis> that affects things
598 like power consumption and basic functionality. 598 like power consumption and basic functionality.
599 The endpoints are part of USB <emphasis>interfaces</emphasis>, 599 The endpoints are part of USB <emphasis>interfaces</emphasis>,
600 which may have <emphasis>altsettings</emphasis> 600 which may have <emphasis>altsettings</emphasis>
601 affecting things like which endpoints are available. 601 affecting things like which endpoints are available.
602 Many devices only have a single configuration and interface, 602 Many devices only have a single configuration and interface,
603 so drivers for them will ignore configurations and altsettings. 603 so drivers for them will ignore configurations and altsettings.
604 </para> 604 </para>
605 605
606 606
607 <sect2> 607 <sect2>
608 <title>Management/Status Requests</title> 608 <title>Management/Status Requests</title>
609 609
610 <para>A number of usbfs requests don't deal very directly 610 <para>A number of usbfs requests don't deal very directly
611 with device I/O. 611 with device I/O.
612 They mostly relate to device management and status. 612 They mostly relate to device management and status.
613 These are all synchronous requests. 613 These are all synchronous requests.
614 </para> 614 </para>
615 615
616 <variablelist> 616 <variablelist>
617 617
618 <varlistentry><term>USBDEVFS_CLAIMINTERFACE</term> 618 <varlistentry><term>USBDEVFS_CLAIMINTERFACE</term>
619 <listitem><para>This is used to force usbfs to 619 <listitem><para>This is used to force usbfs to
620 claim a specific interface, 620 claim a specific interface,
621 which has not previously been claimed by usbfs or any other 621 which has not previously been claimed by usbfs or any other
622 kernel driver. 622 kernel driver.
623 The ioctl parameter is an integer holding the number of 623 The ioctl parameter is an integer holding the number of
624 the interface (bInterfaceNumber from descriptor). 624 the interface (bInterfaceNumber from descriptor).
625 </para><para> 625 </para><para>
626 Note that if your driver doesn't claim an interface 626 Note that if your driver doesn't claim an interface
627 before trying to use one of its endpoints, and no 627 before trying to use one of its endpoints, and no
628 other driver has bound to it, then the interface is 628 other driver has bound to it, then the interface is
629 automatically claimed by usbfs. 629 automatically claimed by usbfs.
630 </para><para> 630 </para><para>
631 This claim will be released by a RELEASEINTERFACE ioctl, 631 This claim will be released by a RELEASEINTERFACE ioctl,
632 or by closing the file descriptor. 632 or by closing the file descriptor.
633 File modification time is not updated by this request. 633 File modification time is not updated by this request.
634 </para></listitem></varlistentry> 634 </para></listitem></varlistentry>
635 635
636 <varlistentry><term>USBDEVFS_CONNECTINFO</term> 636 <varlistentry><term>USBDEVFS_CONNECTINFO</term>
637 <listitem><para>Says whether the device is lowspeed. 637 <listitem><para>Says whether the device is lowspeed.
638 The ioctl parameter points to a structure like this: 638 The ioctl parameter points to a structure like this:
639 <programlisting>struct usbdevfs_connectinfo { 639 <programlisting>struct usbdevfs_connectinfo {
640 unsigned int devnum; 640 unsigned int devnum;
641 unsigned char slow; 641 unsigned char slow;
642 }; </programlisting> 642 }; </programlisting>
643 File modification time is not updated by this request. 643 File modification time is not updated by this request.
644 </para><para> 644 </para><para>
645 <emphasis>You can't tell whether a "not slow" 645 <emphasis>You can't tell whether a "not slow"
646 device is connected at high speed (480 MBit/sec) 646 device is connected at high speed (480 MBit/sec)
647 or just full speed (12 MBit/sec).</emphasis> 647 or just full speed (12 MBit/sec).</emphasis>
648 You should know the devnum value already, 648 You should know the devnum value already,
649 it's the DDD value of the device file name. 649 it's the DDD value of the device file name.
650 </para></listitem></varlistentry> 650 </para></listitem></varlistentry>
651 651
652 <varlistentry><term>USBDEVFS_GETDRIVER</term> 652 <varlistentry><term>USBDEVFS_GETDRIVER</term>
653 <listitem><para>Returns the name of the kernel driver 653 <listitem><para>Returns the name of the kernel driver
654 bound to a given interface (a string). Parameter 654 bound to a given interface (a string). Parameter
655 is a pointer to this structure, which is modified: 655 is a pointer to this structure, which is modified:
656 <programlisting>struct usbdevfs_getdriver { 656 <programlisting>struct usbdevfs_getdriver {
657 unsigned int interface; 657 unsigned int interface;
658 char driver[USBDEVFS_MAXDRIVERNAME + 1]; 658 char driver[USBDEVFS_MAXDRIVERNAME + 1];
659 };</programlisting> 659 };</programlisting>
660 File modification time is not updated by this request. 660 File modification time is not updated by this request.
661 </para></listitem></varlistentry> 661 </para></listitem></varlistentry>
662 662
663 <varlistentry><term>USBDEVFS_IOCTL</term> 663 <varlistentry><term>USBDEVFS_IOCTL</term>
664 <listitem><para>Passes a request from userspace through 664 <listitem><para>Passes a request from userspace through
665 to a kernel driver that has an ioctl entry in the 665 to a kernel driver that has an ioctl entry in the
666 <emphasis>struct usb_driver</emphasis> it registered. 666 <emphasis>struct usb_driver</emphasis> it registered.
667 <programlisting>struct usbdevfs_ioctl { 667 <programlisting>struct usbdevfs_ioctl {
668 int ifno; 668 int ifno;
669 int ioctl_code; 669 int ioctl_code;
670 void *data; 670 void *data;
671 }; 671 };
672 672
673 /* user mode call looks like this. 673 /* user mode call looks like this.
674 * 'request' becomes the driver->ioctl() 'code' parameter. 674 * 'request' becomes the driver->ioctl() 'code' parameter.
675 * the size of 'param' is encoded in 'request', and that data 675 * the size of 'param' is encoded in 'request', and that data
676 * is copied to or from the driver->ioctl() 'buf' parameter. 676 * is copied to or from the driver->ioctl() 'buf' parameter.
677 */ 677 */
678 static int 678 static int
679 usbdev_ioctl (int fd, int ifno, unsigned request, void *param) 679 usbdev_ioctl (int fd, int ifno, unsigned request, void *param)
680 { 680 {
681 struct usbdevfs_ioctl wrapper; 681 struct usbdevfs_ioctl wrapper;
682 682
683 wrapper.ifno = ifno; 683 wrapper.ifno = ifno;
684 wrapper.ioctl_code = request; 684 wrapper.ioctl_code = request;
685 wrapper.data = param; 685 wrapper.data = param;
686 686
687 return ioctl (fd, USBDEVFS_IOCTL, &amp;wrapper); 687 return ioctl (fd, USBDEVFS_IOCTL, &amp;wrapper);
688 } </programlisting> 688 } </programlisting>
689 File modification time is not updated by this request. 689 File modification time is not updated by this request.
690 </para><para> 690 </para><para>
691 This request lets kernel drivers talk to user mode code 691 This request lets kernel drivers talk to user mode code
692 through filesystem operations even when they don't create 692 through filesystem operations even when they don't create
693 a charactor or block special device. 693 a charactor or block special device.
694 It's also been used to do things like ask devices what 694 It's also been used to do things like ask devices what
695 device special file should be used. 695 device special file should be used.
696 Two pre-defined ioctls are used 696 Two pre-defined ioctls are used
697 to disconnect and reconnect kernel drivers, so 697 to disconnect and reconnect kernel drivers, so
698 that user mode code can completely manage binding 698 that user mode code can completely manage binding
699 and configuration of devices. 699 and configuration of devices.
700 </para></listitem></varlistentry> 700 </para></listitem></varlistentry>
701 701
702 <varlistentry><term>USBDEVFS_RELEASEINTERFACE</term> 702 <varlistentry><term>USBDEVFS_RELEASEINTERFACE</term>
703 <listitem><para>This is used to release the claim usbfs 703 <listitem><para>This is used to release the claim usbfs
704 made on interface, either implicitly or because of a 704 made on interface, either implicitly or because of a
705 USBDEVFS_CLAIMINTERFACE call, before the file 705 USBDEVFS_CLAIMINTERFACE call, before the file
706 descriptor is closed. 706 descriptor is closed.
707 The ioctl parameter is an integer holding the number of 707 The ioctl parameter is an integer holding the number of
708 the interface (bInterfaceNumber from descriptor); 708 the interface (bInterfaceNumber from descriptor);
709 File modification time is not updated by this request. 709 File modification time is not updated by this request.
710 </para><warning><para> 710 </para><warning><para>
711 <emphasis>No security check is made to ensure 711 <emphasis>No security check is made to ensure
712 that the task which made the claim is the one 712 that the task which made the claim is the one
713 which is releasing it. 713 which is releasing it.
714 This means that user mode driver may interfere 714 This means that user mode driver may interfere
715 other ones. </emphasis> 715 other ones. </emphasis>
716 </para></warning></listitem></varlistentry> 716 </para></warning></listitem></varlistentry>
717 717
718 <varlistentry><term>USBDEVFS_RESETEP</term> 718 <varlistentry><term>USBDEVFS_RESETEP</term>
719 <listitem><para>Resets the data toggle value for an endpoint 719 <listitem><para>Resets the data toggle value for an endpoint
720 (bulk or interrupt) to DATA0. 720 (bulk or interrupt) to DATA0.
721 The ioctl parameter is an integer endpoint number 721 The ioctl parameter is an integer endpoint number
722 (1 to 15, as identified in the endpoint descriptor), 722 (1 to 15, as identified in the endpoint descriptor),
723 with USB_DIR_IN added if the device's endpoint sends 723 with USB_DIR_IN added if the device's endpoint sends
724 data to the host. 724 data to the host.
725 </para><warning><para> 725 </para><warning><para>
726 <emphasis>Avoid using this request. 726 <emphasis>Avoid using this request.
727 It should probably be removed.</emphasis> 727 It should probably be removed.</emphasis>
728 Using it typically means the device and driver will lose 728 Using it typically means the device and driver will lose
729 toggle synchronization. If you really lost synchronization, 729 toggle synchronization. If you really lost synchronization,
730 you likely need to completely handshake with the device, 730 you likely need to completely handshake with the device,
731 using a request like CLEAR_HALT 731 using a request like CLEAR_HALT
732 or SET_INTERFACE. 732 or SET_INTERFACE.
733 </para></warning></listitem></varlistentry> 733 </para></warning></listitem></varlistentry>
734 734
735 </variablelist> 735 </variablelist>
736 736
737 </sect2> 737 </sect2>
738 738
739 <sect2> 739 <sect2>
740 <title>Synchronous I/O Support</title> 740 <title>Synchronous I/O Support</title>
741 741
742 <para>Synchronous requests involve the kernel blocking 742 <para>Synchronous requests involve the kernel blocking
743 until until the user mode request completes, either by 743 until the user mode request completes, either by
744 finishing successfully or by reporting an error. 744 finishing successfully or by reporting an error.
745 In most cases this is the simplest way to use usbfs, 745 In most cases this is the simplest way to use usbfs,
746 although as noted above it does prevent performing I/O 746 although as noted above it does prevent performing I/O
747 to more than one endpoint at a time. 747 to more than one endpoint at a time.
748 </para> 748 </para>
749 749
750 <variablelist> 750 <variablelist>
751 751
752 <varlistentry><term>USBDEVFS_BULK</term> 752 <varlistentry><term>USBDEVFS_BULK</term>
753 <listitem><para>Issues a bulk read or write request to the 753 <listitem><para>Issues a bulk read or write request to the
754 device. 754 device.
755 The ioctl parameter is a pointer to this structure: 755 The ioctl parameter is a pointer to this structure:
756 <programlisting>struct usbdevfs_bulktransfer { 756 <programlisting>struct usbdevfs_bulktransfer {
757 unsigned int ep; 757 unsigned int ep;
758 unsigned int len; 758 unsigned int len;
759 unsigned int timeout; /* in milliseconds */ 759 unsigned int timeout; /* in milliseconds */
760 void *data; 760 void *data;
761 };</programlisting> 761 };</programlisting>
762 </para><para>The "ep" value identifies a 762 </para><para>The "ep" value identifies a
763 bulk endpoint number (1 to 15, as identified in an endpoint 763 bulk endpoint number (1 to 15, as identified in an endpoint
764 descriptor), 764 descriptor),
765 masked with USB_DIR_IN when referring to an endpoint which 765 masked with USB_DIR_IN when referring to an endpoint which
766 sends data to the host from the device. 766 sends data to the host from the device.
767 The length of the data buffer is identified by "len"; 767 The length of the data buffer is identified by "len";
768 Recent kernels support requests up to about 128KBytes. 768 Recent kernels support requests up to about 128KBytes.
769 <emphasis>FIXME say how read length is returned, 769 <emphasis>FIXME say how read length is returned,
770 and how short reads are handled.</emphasis>. 770 and how short reads are handled.</emphasis>.
771 </para></listitem></varlistentry> 771 </para></listitem></varlistentry>
772 772
773 <varlistentry><term>USBDEVFS_CLEAR_HALT</term> 773 <varlistentry><term>USBDEVFS_CLEAR_HALT</term>
774 <listitem><para>Clears endpoint halt (stall) and 774 <listitem><para>Clears endpoint halt (stall) and
775 resets the endpoint toggle. This is only 775 resets the endpoint toggle. This is only
776 meaningful for bulk or interrupt endpoints. 776 meaningful for bulk or interrupt endpoints.
777 The ioctl parameter is an integer endpoint number 777 The ioctl parameter is an integer endpoint number
778 (1 to 15, as identified in an endpoint descriptor), 778 (1 to 15, as identified in an endpoint descriptor),
779 masked with USB_DIR_IN when referring to an endpoint which 779 masked with USB_DIR_IN when referring to an endpoint which
780 sends data to the host from the device. 780 sends data to the host from the device.
781 </para><para> 781 </para><para>
782 Use this on bulk or interrupt endpoints which have 782 Use this on bulk or interrupt endpoints which have
783 stalled, returning <emphasis>-EPIPE</emphasis> status 783 stalled, returning <emphasis>-EPIPE</emphasis> status
784 to a data transfer request. 784 to a data transfer request.
785 Do not issue the control request directly, since 785 Do not issue the control request directly, since
786 that could invalidate the host's record of the 786 that could invalidate the host's record of the
787 data toggle. 787 data toggle.
788 </para></listitem></varlistentry> 788 </para></listitem></varlistentry>
789 789
790 <varlistentry><term>USBDEVFS_CONTROL</term> 790 <varlistentry><term>USBDEVFS_CONTROL</term>
791 <listitem><para>Issues a control request to the device. 791 <listitem><para>Issues a control request to the device.
792 The ioctl parameter points to a structure like this: 792 The ioctl parameter points to a structure like this:
793 <programlisting>struct usbdevfs_ctrltransfer { 793 <programlisting>struct usbdevfs_ctrltransfer {
794 __u8 bRequestType; 794 __u8 bRequestType;
795 __u8 bRequest; 795 __u8 bRequest;
796 __u16 wValue; 796 __u16 wValue;
797 __u16 wIndex; 797 __u16 wIndex;
798 __u16 wLength; 798 __u16 wLength;
799 __u32 timeout; /* in milliseconds */ 799 __u32 timeout; /* in milliseconds */
800 void *data; 800 void *data;
801 };</programlisting> 801 };</programlisting>
802 </para><para> 802 </para><para>
803 The first eight bytes of this structure are the contents 803 The first eight bytes of this structure are the contents
804 of the SETUP packet to be sent to the device; see the 804 of the SETUP packet to be sent to the device; see the
805 USB 2.0 specification for details. 805 USB 2.0 specification for details.
806 The bRequestType value is composed by combining a 806 The bRequestType value is composed by combining a
807 USB_TYPE_* value, a USB_DIR_* value, and a 807 USB_TYPE_* value, a USB_DIR_* value, and a
808 USB_RECIP_* value (from 808 USB_RECIP_* value (from
809 <emphasis>&lt;linux/usb.h&gt;</emphasis>). 809 <emphasis>&lt;linux/usb.h&gt;</emphasis>).
810 If wLength is nonzero, it describes the length of the data 810 If wLength is nonzero, it describes the length of the data
811 buffer, which is either written to the device 811 buffer, which is either written to the device
812 (USB_DIR_OUT) or read from the device (USB_DIR_IN). 812 (USB_DIR_OUT) or read from the device (USB_DIR_IN).
813 </para><para> 813 </para><para>
814 At this writing, you can't transfer more than 4 KBytes 814 At this writing, you can't transfer more than 4 KBytes
815 of data to or from a device; usbfs has a limit, and 815 of data to or from a device; usbfs has a limit, and
816 some host controller drivers have a limit. 816 some host controller drivers have a limit.
817 (That's not usually a problem.) 817 (That's not usually a problem.)
818 <emphasis>Also</emphasis> there's no way to say it's 818 <emphasis>Also</emphasis> there's no way to say it's
819 not OK to get a short read back from the device. 819 not OK to get a short read back from the device.
820 </para></listitem></varlistentry> 820 </para></listitem></varlistentry>
821 821
822 <varlistentry><term>USBDEVFS_RESET</term> 822 <varlistentry><term>USBDEVFS_RESET</term>
823 <listitem><para>Does a USB level device reset. 823 <listitem><para>Does a USB level device reset.
824 The ioctl parameter is ignored. 824 The ioctl parameter is ignored.
825 After the reset, this rebinds all device interfaces. 825 After the reset, this rebinds all device interfaces.
826 File modification time is not updated by this request. 826 File modification time is not updated by this request.
827 </para><warning><para> 827 </para><warning><para>
828 <emphasis>Avoid using this call</emphasis> 828 <emphasis>Avoid using this call</emphasis>
829 until some usbcore bugs get fixed, 829 until some usbcore bugs get fixed,
830 since it does not fully synchronize device, interface, 830 since it does not fully synchronize device, interface,
831 and driver (not just usbfs) state. 831 and driver (not just usbfs) state.
832 </para></warning></listitem></varlistentry> 832 </para></warning></listitem></varlistentry>
833 833
834 <varlistentry><term>USBDEVFS_SETINTERFACE</term> 834 <varlistentry><term>USBDEVFS_SETINTERFACE</term>
835 <listitem><para>Sets the alternate setting for an 835 <listitem><para>Sets the alternate setting for an
836 interface. The ioctl parameter is a pointer to a 836 interface. The ioctl parameter is a pointer to a
837 structure like this: 837 structure like this:
838 <programlisting>struct usbdevfs_setinterface { 838 <programlisting>struct usbdevfs_setinterface {
839 unsigned int interface; 839 unsigned int interface;
840 unsigned int altsetting; 840 unsigned int altsetting;
841 }; </programlisting> 841 }; </programlisting>
842 File modification time is not updated by this request. 842 File modification time is not updated by this request.
843 </para><para> 843 </para><para>
844 Those struct members are from some interface descriptor 844 Those struct members are from some interface descriptor
845 applying to the current configuration. 845 applying to the current configuration.
846 The interface number is the bInterfaceNumber value, and 846 The interface number is the bInterfaceNumber value, and
847 the altsetting number is the bAlternateSetting value. 847 the altsetting number is the bAlternateSetting value.
848 (This resets each endpoint in the interface.) 848 (This resets each endpoint in the interface.)
849 </para></listitem></varlistentry> 849 </para></listitem></varlistentry>
850 850
851 <varlistentry><term>USBDEVFS_SETCONFIGURATION</term> 851 <varlistentry><term>USBDEVFS_SETCONFIGURATION</term>
852 <listitem><para>Issues the 852 <listitem><para>Issues the
853 <function>usb_set_configuration</function> call 853 <function>usb_set_configuration</function> call
854 for the device. 854 for the device.
855 The parameter is an integer holding the number of 855 The parameter is an integer holding the number of
856 a configuration (bConfigurationValue from descriptor). 856 a configuration (bConfigurationValue from descriptor).
857 File modification time is not updated by this request. 857 File modification time is not updated by this request.
858 </para><warning><para> 858 </para><warning><para>
859 <emphasis>Avoid using this call</emphasis> 859 <emphasis>Avoid using this call</emphasis>
860 until some usbcore bugs get fixed, 860 until some usbcore bugs get fixed,
861 since it does not fully synchronize device, interface, 861 since it does not fully synchronize device, interface,
862 and driver (not just usbfs) state. 862 and driver (not just usbfs) state.
863 </para></warning></listitem></varlistentry> 863 </para></warning></listitem></varlistentry>
864 864
865 </variablelist> 865 </variablelist>
866 </sect2> 866 </sect2>
867 867
868 <sect2> 868 <sect2>
869 <title>Asynchronous I/O Support</title> 869 <title>Asynchronous I/O Support</title>
870 870
871 <para>As mentioned above, there are situations where it may be 871 <para>As mentioned above, there are situations where it may be
872 important to initiate concurrent operations from user mode code. 872 important to initiate concurrent operations from user mode code.
873 This is particularly important for periodic transfers 873 This is particularly important for periodic transfers
874 (interrupt and isochronous), but it can be used for other 874 (interrupt and isochronous), but it can be used for other
875 kinds of USB requests too. 875 kinds of USB requests too.
876 In such cases, the asynchronous requests described here 876 In such cases, the asynchronous requests described here
877 are essential. Rather than submitting one request and having 877 are essential. Rather than submitting one request and having
878 the kernel block until it completes, the blocking is separate. 878 the kernel block until it completes, the blocking is separate.
879 </para> 879 </para>
880 880
881 <para>These requests are packaged into a structure that 881 <para>These requests are packaged into a structure that
882 resembles the URB used by kernel device drivers. 882 resembles the URB used by kernel device drivers.
883 (No POSIX Async I/O support here, sorry.) 883 (No POSIX Async I/O support here, sorry.)
884 It identifies the endpoint type (USBDEVFS_URB_TYPE_*), 884 It identifies the endpoint type (USBDEVFS_URB_TYPE_*),
885 endpoint (number, masked with USB_DIR_IN as appropriate), 885 endpoint (number, masked with USB_DIR_IN as appropriate),
886 buffer and length, and a user "context" value serving to 886 buffer and length, and a user "context" value serving to
887 uniquely identify each request. 887 uniquely identify each request.
888 (It's usually a pointer to per-request data.) 888 (It's usually a pointer to per-request data.)
889 Flags can modify requests (not as many as supported for 889 Flags can modify requests (not as many as supported for
890 kernel drivers). 890 kernel drivers).
891 </para> 891 </para>
892 892
893 <para>Each request can specify a realtime signal number 893 <para>Each request can specify a realtime signal number
894 (between SIGRTMIN and SIGRTMAX, inclusive) to request a 894 (between SIGRTMIN and SIGRTMAX, inclusive) to request a
895 signal be sent when the request completes. 895 signal be sent when the request completes.
896 </para> 896 </para>
897 897
898 <para>When usbfs returns these urbs, the status value 898 <para>When usbfs returns these urbs, the status value
899 is updated, and the buffer may have been modified. 899 is updated, and the buffer may have been modified.
900 Except for isochronous transfers, the actual_length is 900 Except for isochronous transfers, the actual_length is
901 updated to say how many bytes were transferred; if the 901 updated to say how many bytes were transferred; if the
902 USBDEVFS_URB_DISABLE_SPD flag is set 902 USBDEVFS_URB_DISABLE_SPD flag is set
903 ("short packets are not OK"), if fewer bytes were read 903 ("short packets are not OK"), if fewer bytes were read
904 than were requested then you get an error report. 904 than were requested then you get an error report.
905 </para> 905 </para>
906 906
907 <programlisting>struct usbdevfs_iso_packet_desc { 907 <programlisting>struct usbdevfs_iso_packet_desc {
908 unsigned int length; 908 unsigned int length;
909 unsigned int actual_length; 909 unsigned int actual_length;
910 unsigned int status; 910 unsigned int status;
911 }; 911 };
912 912
913 struct usbdevfs_urb { 913 struct usbdevfs_urb {
914 unsigned char type; 914 unsigned char type;
915 unsigned char endpoint; 915 unsigned char endpoint;
916 int status; 916 int status;
917 unsigned int flags; 917 unsigned int flags;
918 void *buffer; 918 void *buffer;
919 int buffer_length; 919 int buffer_length;
920 int actual_length; 920 int actual_length;
921 int start_frame; 921 int start_frame;
922 int number_of_packets; 922 int number_of_packets;
923 int error_count; 923 int error_count;
924 unsigned int signr; 924 unsigned int signr;
925 void *usercontext; 925 void *usercontext;
926 struct usbdevfs_iso_packet_desc iso_frame_desc[]; 926 struct usbdevfs_iso_packet_desc iso_frame_desc[];
927 };</programlisting> 927 };</programlisting>
928 928
929 <para> For these asynchronous requests, the file modification 929 <para> For these asynchronous requests, the file modification
930 time reflects when the request was initiated. 930 time reflects when the request was initiated.
931 This contrasts with their use with the synchronous requests, 931 This contrasts with their use with the synchronous requests,
932 where it reflects when requests complete. 932 where it reflects when requests complete.
933 </para> 933 </para>
934 934
935 <variablelist> 935 <variablelist>
936 936
937 <varlistentry><term>USBDEVFS_DISCARDURB</term> 937 <varlistentry><term>USBDEVFS_DISCARDURB</term>
938 <listitem><para> 938 <listitem><para>
939 <emphasis>TBS</emphasis> 939 <emphasis>TBS</emphasis>
940 File modification time is not updated by this request. 940 File modification time is not updated by this request.
941 </para><para> 941 </para><para>
942 </para></listitem></varlistentry> 942 </para></listitem></varlistentry>
943 943
944 <varlistentry><term>USBDEVFS_DISCSIGNAL</term> 944 <varlistentry><term>USBDEVFS_DISCSIGNAL</term>
945 <listitem><para> 945 <listitem><para>
946 <emphasis>TBS</emphasis> 946 <emphasis>TBS</emphasis>
947 File modification time is not updated by this request. 947 File modification time is not updated by this request.
948 </para><para> 948 </para><para>
949 </para></listitem></varlistentry> 949 </para></listitem></varlistentry>
950 950
951 <varlistentry><term>USBDEVFS_REAPURB</term> 951 <varlistentry><term>USBDEVFS_REAPURB</term>
952 <listitem><para> 952 <listitem><para>
953 <emphasis>TBS</emphasis> 953 <emphasis>TBS</emphasis>
954 File modification time is not updated by this request. 954 File modification time is not updated by this request.
955 </para><para> 955 </para><para>
956 </para></listitem></varlistentry> 956 </para></listitem></varlistentry>
957 957
958 <varlistentry><term>USBDEVFS_REAPURBNDELAY</term> 958 <varlistentry><term>USBDEVFS_REAPURBNDELAY</term>
959 <listitem><para> 959 <listitem><para>
960 <emphasis>TBS</emphasis> 960 <emphasis>TBS</emphasis>
961 File modification time is not updated by this request. 961 File modification time is not updated by this request.
962 </para><para> 962 </para><para>
963 </para></listitem></varlistentry> 963 </para></listitem></varlistentry>
964 964
965 <varlistentry><term>USBDEVFS_SUBMITURB</term> 965 <varlistentry><term>USBDEVFS_SUBMITURB</term>
966 <listitem><para> 966 <listitem><para>
967 <emphasis>TBS</emphasis> 967 <emphasis>TBS</emphasis>
968 </para><para> 968 </para><para>
969 </para></listitem></varlistentry> 969 </para></listitem></varlistentry>
970 970
971 </variablelist> 971 </variablelist>
972 </sect2> 972 </sect2>
973 973
974 </sect1> 974 </sect1>
975 975
976 </chapter> 976 </chapter>
977 977
978 </book> 978 </book>
979 <!-- vim:syntax=sgml:sw=4 979 <!-- vim:syntax=sgml:sw=4
980 --> 980 -->
981 981
Documentation/RCU/whatisRCU.txt
1 What is RCU? 1 What is RCU?
2 2
3 RCU is a synchronization mechanism that was added to the Linux kernel 3 RCU is a synchronization mechanism that was added to the Linux kernel
4 during the 2.5 development effort that is optimized for read-mostly 4 during the 2.5 development effort that is optimized for read-mostly
5 situations. Although RCU is actually quite simple once you understand it, 5 situations. Although RCU is actually quite simple once you understand it,
6 getting there can sometimes be a challenge. Part of the problem is that 6 getting there can sometimes be a challenge. Part of the problem is that
7 most of the past descriptions of RCU have been written with the mistaken 7 most of the past descriptions of RCU have been written with the mistaken
8 assumption that there is "one true way" to describe RCU. Instead, 8 assumption that there is "one true way" to describe RCU. Instead,
9 the experience has been that different people must take different paths 9 the experience has been that different people must take different paths
10 to arrive at an understanding of RCU. This document provides several 10 to arrive at an understanding of RCU. This document provides several
11 different paths, as follows: 11 different paths, as follows:
12 12
13 1. RCU OVERVIEW 13 1. RCU OVERVIEW
14 2. WHAT IS RCU'S CORE API? 14 2. WHAT IS RCU'S CORE API?
15 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? 15 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
16 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? 16 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
17 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? 17 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
18 6. ANALOGY WITH READER-WRITER LOCKING 18 6. ANALOGY WITH READER-WRITER LOCKING
19 7. FULL LIST OF RCU APIs 19 7. FULL LIST OF RCU APIs
20 8. ANSWERS TO QUICK QUIZZES 20 8. ANSWERS TO QUICK QUIZZES
21 21
22 People who prefer starting with a conceptual overview should focus on 22 People who prefer starting with a conceptual overview should focus on
23 Section 1, though most readers will profit by reading this section at 23 Section 1, though most readers will profit by reading this section at
24 some point. People who prefer to start with an API that they can then 24 some point. People who prefer to start with an API that they can then
25 experiment with should focus on Section 2. People who prefer to start 25 experiment with should focus on Section 2. People who prefer to start
26 with example uses should focus on Sections 3 and 4. People who need to 26 with example uses should focus on Sections 3 and 4. People who need to
27 understand the RCU implementation should focus on Section 5, then dive 27 understand the RCU implementation should focus on Section 5, then dive
28 into the kernel source code. People who reason best by analogy should 28 into the kernel source code. People who reason best by analogy should
29 focus on Section 6. Section 7 serves as an index to the docbook API 29 focus on Section 6. Section 7 serves as an index to the docbook API
30 documentation, and Section 8 is the traditional answer key. 30 documentation, and Section 8 is the traditional answer key.
31 31
32 So, start with the section that makes the most sense to you and your 32 So, start with the section that makes the most sense to you and your
33 preferred method of learning. If you need to know everything about 33 preferred method of learning. If you need to know everything about
34 everything, feel free to read the whole thing -- but if you are really 34 everything, feel free to read the whole thing -- but if you are really
35 that type of person, you have perused the source code and will therefore 35 that type of person, you have perused the source code and will therefore
36 never need this document anyway. ;-) 36 never need this document anyway. ;-)
37 37
38 38
39 1. RCU OVERVIEW 39 1. RCU OVERVIEW
40 40
41 The basic idea behind RCU is to split updates into "removal" and 41 The basic idea behind RCU is to split updates into "removal" and
42 "reclamation" phases. The removal phase removes references to data items 42 "reclamation" phases. The removal phase removes references to data items
43 within a data structure (possibly by replacing them with references to 43 within a data structure (possibly by replacing them with references to
44 new versions of these data items), and can run concurrently with readers. 44 new versions of these data items), and can run concurrently with readers.
45 The reason that it is safe to run the removal phase concurrently with 45 The reason that it is safe to run the removal phase concurrently with
46 readers is the semantics of modern CPUs guarantee that readers will see 46 readers is the semantics of modern CPUs guarantee that readers will see
47 either the old or the new version of the data structure rather than a 47 either the old or the new version of the data structure rather than a
48 partially updated reference. The reclamation phase does the work of reclaiming 48 partially updated reference. The reclamation phase does the work of reclaiming
49 (e.g., freeing) the data items removed from the data structure during the 49 (e.g., freeing) the data items removed from the data structure during the
50 removal phase. Because reclaiming data items can disrupt any readers 50 removal phase. Because reclaiming data items can disrupt any readers
51 concurrently referencing those data items, the reclamation phase must 51 concurrently referencing those data items, the reclamation phase must
52 not start until readers no longer hold references to those data items. 52 not start until readers no longer hold references to those data items.
53 53
54 Splitting the update into removal and reclamation phases permits the 54 Splitting the update into removal and reclamation phases permits the
55 updater to perform the removal phase immediately, and to defer the 55 updater to perform the removal phase immediately, and to defer the
56 reclamation phase until all readers active during the removal phase have 56 reclamation phase until all readers active during the removal phase have
57 completed, either by blocking until they finish or by registering a 57 completed, either by blocking until they finish or by registering a
58 callback that is invoked after they finish. Only readers that are active 58 callback that is invoked after they finish. Only readers that are active
59 during the removal phase need be considered, because any reader starting 59 during the removal phase need be considered, because any reader starting
60 after the removal phase will be unable to gain a reference to the removed 60 after the removal phase will be unable to gain a reference to the removed
61 data items, and therefore cannot be disrupted by the reclamation phase. 61 data items, and therefore cannot be disrupted by the reclamation phase.
62 62
63 So the typical RCU update sequence goes something like the following: 63 So the typical RCU update sequence goes something like the following:
64 64
65 a. Remove pointers to a data structure, so that subsequent 65 a. Remove pointers to a data structure, so that subsequent
66 readers cannot gain a reference to it. 66 readers cannot gain a reference to it.
67 67
68 b. Wait for all previous readers to complete their RCU read-side 68 b. Wait for all previous readers to complete their RCU read-side
69 critical sections. 69 critical sections.
70 70
71 c. At this point, there cannot be any readers who hold references 71 c. At this point, there cannot be any readers who hold references
72 to the data structure, so it now may safely be reclaimed 72 to the data structure, so it now may safely be reclaimed
73 (e.g., kfree()d). 73 (e.g., kfree()d).
74 74
75 Step (b) above is the key idea underlying RCU's deferred destruction. 75 Step (b) above is the key idea underlying RCU's deferred destruction.
76 The ability to wait until all readers are done allows RCU readers to 76 The ability to wait until all readers are done allows RCU readers to
77 use much lighter-weight synchronization, in some cases, absolutely no 77 use much lighter-weight synchronization, in some cases, absolutely no
78 synchronization at all. In contrast, in more conventional lock-based 78 synchronization at all. In contrast, in more conventional lock-based
79 schemes, readers must use heavy-weight synchronization in order to 79 schemes, readers must use heavy-weight synchronization in order to
80 prevent an updater from deleting the data structure out from under them. 80 prevent an updater from deleting the data structure out from under them.
81 This is because lock-based updaters typically update data items in place, 81 This is because lock-based updaters typically update data items in place,
82 and must therefore exclude readers. In contrast, RCU-based updaters 82 and must therefore exclude readers. In contrast, RCU-based updaters
83 typically take advantage of the fact that writes to single aligned 83 typically take advantage of the fact that writes to single aligned
84 pointers are atomic on modern CPUs, allowing atomic insertion, removal, 84 pointers are atomic on modern CPUs, allowing atomic insertion, removal,
85 and replacement of data items in a linked structure without disrupting 85 and replacement of data items in a linked structure without disrupting
86 readers. Concurrent RCU readers can then continue accessing the old 86 readers. Concurrent RCU readers can then continue accessing the old
87 versions, and can dispense with the atomic operations, memory barriers, 87 versions, and can dispense with the atomic operations, memory barriers,
88 and communications cache misses that are so expensive on present-day 88 and communications cache misses that are so expensive on present-day
89 SMP computer systems, even in absence of lock contention. 89 SMP computer systems, even in absence of lock contention.
90 90
91 In the three-step procedure shown above, the updater is performing both 91 In the three-step procedure shown above, the updater is performing both
92 the removal and the reclamation step, but it is often helpful for an 92 the removal and the reclamation step, but it is often helpful for an
93 entirely different thread to do the reclamation, as is in fact the case 93 entirely different thread to do the reclamation, as is in fact the case
94 in the Linux kernel's directory-entry cache (dcache). Even if the same 94 in the Linux kernel's directory-entry cache (dcache). Even if the same
95 thread performs both the update step (step (a) above) and the reclamation 95 thread performs both the update step (step (a) above) and the reclamation
96 step (step (c) above), it is often helpful to think of them separately. 96 step (step (c) above), it is often helpful to think of them separately.
97 For example, RCU readers and updaters need not communicate at all, 97 For example, RCU readers and updaters need not communicate at all,
98 but RCU provides implicit low-overhead communication between readers 98 but RCU provides implicit low-overhead communication between readers
99 and reclaimers, namely, in step (b) above. 99 and reclaimers, namely, in step (b) above.
100 100
101 So how the heck can a reclaimer tell when a reader is done, given 101 So how the heck can a reclaimer tell when a reader is done, given
102 that readers are not doing any sort of synchronization operations??? 102 that readers are not doing any sort of synchronization operations???
103 Read on to learn about how RCU's API makes this easy. 103 Read on to learn about how RCU's API makes this easy.
104 104
105 105
106 2. WHAT IS RCU'S CORE API? 106 2. WHAT IS RCU'S CORE API?
107 107
108 The core RCU API is quite small: 108 The core RCU API is quite small:
109 109
110 a. rcu_read_lock() 110 a. rcu_read_lock()
111 b. rcu_read_unlock() 111 b. rcu_read_unlock()
112 c. synchronize_rcu() / call_rcu() 112 c. synchronize_rcu() / call_rcu()
113 d. rcu_assign_pointer() 113 d. rcu_assign_pointer()
114 e. rcu_dereference() 114 e. rcu_dereference()
115 115
116 There are many other members of the RCU API, but the rest can be 116 There are many other members of the RCU API, but the rest can be
117 expressed in terms of these five, though most implementations instead 117 expressed in terms of these five, though most implementations instead
118 express synchronize_rcu() in terms of the call_rcu() callback API. 118 express synchronize_rcu() in terms of the call_rcu() callback API.
119 119
120 The five core RCU APIs are described below, the other 18 will be enumerated 120 The five core RCU APIs are described below, the other 18 will be enumerated
121 later. See the kernel docbook documentation for more info, or look directly 121 later. See the kernel docbook documentation for more info, or look directly
122 at the function header comments. 122 at the function header comments.
123 123
124 rcu_read_lock() 124 rcu_read_lock()
125 125
126 void rcu_read_lock(void); 126 void rcu_read_lock(void);
127 127
128 Used by a reader to inform the reclaimer that the reader is 128 Used by a reader to inform the reclaimer that the reader is
129 entering an RCU read-side critical section. It is illegal 129 entering an RCU read-side critical section. It is illegal
130 to block while in an RCU read-side critical section, though 130 to block while in an RCU read-side critical section, though
131 kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side 131 kernels built with CONFIG_PREEMPT_RCU can preempt RCU read-side
132 critical sections. Any RCU-protected data structure accessed 132 critical sections. Any RCU-protected data structure accessed
133 during an RCU read-side critical section is guaranteed to remain 133 during an RCU read-side critical section is guaranteed to remain
134 unreclaimed for the full duration of that critical section. 134 unreclaimed for the full duration of that critical section.
135 Reference counts may be used in conjunction with RCU to maintain 135 Reference counts may be used in conjunction with RCU to maintain
136 longer-term references to data structures. 136 longer-term references to data structures.
137 137
138 rcu_read_unlock() 138 rcu_read_unlock()
139 139
140 void rcu_read_unlock(void); 140 void rcu_read_unlock(void);
141 141
142 Used by a reader to inform the reclaimer that the reader is 142 Used by a reader to inform the reclaimer that the reader is
143 exiting an RCU read-side critical section. Note that RCU 143 exiting an RCU read-side critical section. Note that RCU
144 read-side critical sections may be nested and/or overlapping. 144 read-side critical sections may be nested and/or overlapping.
145 145
146 synchronize_rcu() 146 synchronize_rcu()
147 147
148 void synchronize_rcu(void); 148 void synchronize_rcu(void);
149 149
150 Marks the end of updater code and the beginning of reclaimer 150 Marks the end of updater code and the beginning of reclaimer
151 code. It does this by blocking until all pre-existing RCU 151 code. It does this by blocking until all pre-existing RCU
152 read-side critical sections on all CPUs have completed. 152 read-side critical sections on all CPUs have completed.
153 Note that synchronize_rcu() will -not- necessarily wait for 153 Note that synchronize_rcu() will -not- necessarily wait for
154 any subsequent RCU read-side critical sections to complete. 154 any subsequent RCU read-side critical sections to complete.
155 For example, consider the following sequence of events: 155 For example, consider the following sequence of events:
156 156
157 CPU 0 CPU 1 CPU 2 157 CPU 0 CPU 1 CPU 2
158 ----------------- ------------------------- --------------- 158 ----------------- ------------------------- ---------------
159 1. rcu_read_lock() 159 1. rcu_read_lock()
160 2. enters synchronize_rcu() 160 2. enters synchronize_rcu()
161 3. rcu_read_lock() 161 3. rcu_read_lock()
162 4. rcu_read_unlock() 162 4. rcu_read_unlock()
163 5. exits synchronize_rcu() 163 5. exits synchronize_rcu()
164 6. rcu_read_unlock() 164 6. rcu_read_unlock()
165 165
166 To reiterate, synchronize_rcu() waits only for ongoing RCU 166 To reiterate, synchronize_rcu() waits only for ongoing RCU
167 read-side critical sections to complete, not necessarily for 167 read-side critical sections to complete, not necessarily for
168 any that begin after synchronize_rcu() is invoked. 168 any that begin after synchronize_rcu() is invoked.
169 169
170 Of course, synchronize_rcu() does not necessarily return 170 Of course, synchronize_rcu() does not necessarily return
171 -immediately- after the last pre-existing RCU read-side critical 171 -immediately- after the last pre-existing RCU read-side critical
172 section completes. For one thing, there might well be scheduling 172 section completes. For one thing, there might well be scheduling
173 delays. For another thing, many RCU implementations process 173 delays. For another thing, many RCU implementations process
174 requests in batches in order to improve efficiencies, which can 174 requests in batches in order to improve efficiencies, which can
175 further delay synchronize_rcu(). 175 further delay synchronize_rcu().
176 176
177 Since synchronize_rcu() is the API that must figure out when 177 Since synchronize_rcu() is the API that must figure out when
178 readers are done, its implementation is key to RCU. For RCU 178 readers are done, its implementation is key to RCU. For RCU
179 to be useful in all but the most read-intensive situations, 179 to be useful in all but the most read-intensive situations,
180 synchronize_rcu()'s overhead must also be quite small. 180 synchronize_rcu()'s overhead must also be quite small.
181 181
182 The call_rcu() API is a callback form of synchronize_rcu(), 182 The call_rcu() API is a callback form of synchronize_rcu(),
183 and is described in more detail in a later section. Instead of 183 and is described in more detail in a later section. Instead of
184 blocking, it registers a function and argument which are invoked 184 blocking, it registers a function and argument which are invoked
185 after all ongoing RCU read-side critical sections have completed. 185 after all ongoing RCU read-side critical sections have completed.
186 This callback variant is particularly useful in situations where 186 This callback variant is particularly useful in situations where
187 it is illegal to block or where update-side performance is 187 it is illegal to block or where update-side performance is
188 critically important. 188 critically important.
189 189
190 However, the call_rcu() API should not be used lightly, as use 190 However, the call_rcu() API should not be used lightly, as use
191 of the synchronize_rcu() API generally results in simpler code. 191 of the synchronize_rcu() API generally results in simpler code.
192 In addition, the synchronize_rcu() API has the nice property 192 In addition, the synchronize_rcu() API has the nice property
193 of automatically limiting update rate should grace periods 193 of automatically limiting update rate should grace periods
194 be delayed. This property results in system resilience in face 194 be delayed. This property results in system resilience in face
195 of denial-of-service attacks. Code using call_rcu() should limit 195 of denial-of-service attacks. Code using call_rcu() should limit
196 update rate in order to gain this same sort of resilience. See 196 update rate in order to gain this same sort of resilience. See
197 checklist.txt for some approaches to limiting the update rate. 197 checklist.txt for some approaches to limiting the update rate.
198 198
199 rcu_assign_pointer() 199 rcu_assign_pointer()
200 200
201 typeof(p) rcu_assign_pointer(p, typeof(p) v); 201 typeof(p) rcu_assign_pointer(p, typeof(p) v);
202 202
203 Yes, rcu_assign_pointer() -is- implemented as a macro, though it 203 Yes, rcu_assign_pointer() -is- implemented as a macro, though it
204 would be cool to be able to declare a function in this manner. 204 would be cool to be able to declare a function in this manner.
205 (Compiler experts will no doubt disagree.) 205 (Compiler experts will no doubt disagree.)
206 206
207 The updater uses this function to assign a new value to an 207 The updater uses this function to assign a new value to an
208 RCU-protected pointer, in order to safely communicate the change 208 RCU-protected pointer, in order to safely communicate the change
209 in value from the updater to the reader. This function returns 209 in value from the updater to the reader. This function returns
210 the new value, and also executes any memory-barrier instructions 210 the new value, and also executes any memory-barrier instructions
211 required for a given CPU architecture. 211 required for a given CPU architecture.
212 212
213 Perhaps just as important, it serves to document (1) which 213 Perhaps just as important, it serves to document (1) which
214 pointers are protected by RCU and (2) the point at which a 214 pointers are protected by RCU and (2) the point at which a
215 given structure becomes accessible to other CPUs. That said, 215 given structure becomes accessible to other CPUs. That said,
216 rcu_assign_pointer() is most frequently used indirectly, via 216 rcu_assign_pointer() is most frequently used indirectly, via
217 the _rcu list-manipulation primitives such as list_add_rcu(). 217 the _rcu list-manipulation primitives such as list_add_rcu().
218 218
219 rcu_dereference() 219 rcu_dereference()
220 220
221 typeof(p) rcu_dereference(p); 221 typeof(p) rcu_dereference(p);
222 222
223 Like rcu_assign_pointer(), rcu_dereference() must be implemented 223 Like rcu_assign_pointer(), rcu_dereference() must be implemented
224 as a macro. 224 as a macro.
225 225
226 The reader uses rcu_dereference() to fetch an RCU-protected 226 The reader uses rcu_dereference() to fetch an RCU-protected
227 pointer, which returns a value that may then be safely 227 pointer, which returns a value that may then be safely
228 dereferenced. Note that rcu_deference() does not actually 228 dereferenced. Note that rcu_deference() does not actually
229 dereference the pointer, instead, it protects the pointer for 229 dereference the pointer, instead, it protects the pointer for
230 later dereferencing. It also executes any needed memory-barrier 230 later dereferencing. It also executes any needed memory-barrier
231 instructions for a given CPU architecture. Currently, only Alpha 231 instructions for a given CPU architecture. Currently, only Alpha
232 needs memory barriers within rcu_dereference() -- on other CPUs, 232 needs memory barriers within rcu_dereference() -- on other CPUs,
233 it compiles to nothing, not even a compiler directive. 233 it compiles to nothing, not even a compiler directive.
234 234
235 Common coding practice uses rcu_dereference() to copy an 235 Common coding practice uses rcu_dereference() to copy an
236 RCU-protected pointer to a local variable, then dereferences 236 RCU-protected pointer to a local variable, then dereferences
237 this local variable, for example as follows: 237 this local variable, for example as follows:
238 238
239 p = rcu_dereference(head.next); 239 p = rcu_dereference(head.next);
240 return p->data; 240 return p->data;
241 241
242 However, in this case, one could just as easily combine these 242 However, in this case, one could just as easily combine these
243 into one statement: 243 into one statement:
244 244
245 return rcu_dereference(head.next)->data; 245 return rcu_dereference(head.next)->data;
246 246
247 If you are going to be fetching multiple fields from the 247 If you are going to be fetching multiple fields from the
248 RCU-protected structure, using the local variable is of 248 RCU-protected structure, using the local variable is of
249 course preferred. Repeated rcu_dereference() calls look 249 course preferred. Repeated rcu_dereference() calls look
250 ugly and incur unnecessary overhead on Alpha CPUs. 250 ugly and incur unnecessary overhead on Alpha CPUs.
251 251
252 Note that the value returned by rcu_dereference() is valid 252 Note that the value returned by rcu_dereference() is valid
253 only within the enclosing RCU read-side critical section. 253 only within the enclosing RCU read-side critical section.
254 For example, the following is -not- legal: 254 For example, the following is -not- legal:
255 255
256 rcu_read_lock(); 256 rcu_read_lock();
257 p = rcu_dereference(head.next); 257 p = rcu_dereference(head.next);
258 rcu_read_unlock(); 258 rcu_read_unlock();
259 x = p->address; 259 x = p->address;
260 rcu_read_lock(); 260 rcu_read_lock();
261 y = p->data; 261 y = p->data;
262 rcu_read_unlock(); 262 rcu_read_unlock();
263 263
264 Holding a reference from one RCU read-side critical section 264 Holding a reference from one RCU read-side critical section
265 to another is just as illegal as holding a reference from 265 to another is just as illegal as holding a reference from
266 one lock-based critical section to another! Similarly, 266 one lock-based critical section to another! Similarly,
267 using a reference outside of the critical section in which 267 using a reference outside of the critical section in which
268 it was acquired is just as illegal as doing so with normal 268 it was acquired is just as illegal as doing so with normal
269 locking. 269 locking.
270 270
271 As with rcu_assign_pointer(), an important function of 271 As with rcu_assign_pointer(), an important function of
272 rcu_dereference() is to document which pointers are protected by 272 rcu_dereference() is to document which pointers are protected by
273 RCU, in particular, flagging a pointer that is subject to changing 273 RCU, in particular, flagging a pointer that is subject to changing
274 at any time, including immediately after the rcu_dereference(). 274 at any time, including immediately after the rcu_dereference().
275 And, again like rcu_assign_pointer(), rcu_dereference() is 275 And, again like rcu_assign_pointer(), rcu_dereference() is
276 typically used indirectly, via the _rcu list-manipulation 276 typically used indirectly, via the _rcu list-manipulation
277 primitives, such as list_for_each_entry_rcu(). 277 primitives, such as list_for_each_entry_rcu().
278 278
279 The following diagram shows how each API communicates among the 279 The following diagram shows how each API communicates among the
280 reader, updater, and reclaimer. 280 reader, updater, and reclaimer.
281 281
282 282
283 rcu_assign_pointer() 283 rcu_assign_pointer()
284 +--------+ 284 +--------+
285 +---------------------->| reader |---------+ 285 +---------------------->| reader |---------+
286 | +--------+ | 286 | +--------+ |
287 | | | 287 | | |
288 | | | Protect: 288 | | | Protect:
289 | | | rcu_read_lock() 289 | | | rcu_read_lock()
290 | | | rcu_read_unlock() 290 | | | rcu_read_unlock()
291 | rcu_dereference() | | 291 | rcu_dereference() | |
292 +---------+ | | 292 +---------+ | |
293 | updater |<---------------------+ | 293 | updater |<---------------------+ |
294 +---------+ V 294 +---------+ V
295 | +-----------+ 295 | +-----------+
296 +----------------------------------->| reclaimer | 296 +----------------------------------->| reclaimer |
297 +-----------+ 297 +-----------+
298 Defer: 298 Defer:
299 synchronize_rcu() & call_rcu() 299 synchronize_rcu() & call_rcu()
300 300
301 301
302 The RCU infrastructure observes the time sequence of rcu_read_lock(), 302 The RCU infrastructure observes the time sequence of rcu_read_lock(),
303 rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in 303 rcu_read_unlock(), synchronize_rcu(), and call_rcu() invocations in
304 order to determine when (1) synchronize_rcu() invocations may return 304 order to determine when (1) synchronize_rcu() invocations may return
305 to their callers and (2) call_rcu() callbacks may be invoked. Efficient 305 to their callers and (2) call_rcu() callbacks may be invoked. Efficient
306 implementations of the RCU infrastructure make heavy use of batching in 306 implementations of the RCU infrastructure make heavy use of batching in
307 order to amortize their overhead over many uses of the corresponding APIs. 307 order to amortize their overhead over many uses of the corresponding APIs.
308 308
309 There are no fewer than three RCU mechanisms in the Linux kernel; the 309 There are no fewer than three RCU mechanisms in the Linux kernel; the
310 diagram above shows the first one, which is by far the most commonly used. 310 diagram above shows the first one, which is by far the most commonly used.
311 The rcu_dereference() and rcu_assign_pointer() primitives are used for 311 The rcu_dereference() and rcu_assign_pointer() primitives are used for
312 all three mechanisms, but different defer and protect primitives are 312 all three mechanisms, but different defer and protect primitives are
313 used as follows: 313 used as follows:
314 314
315 Defer Protect 315 Defer Protect
316 316
317 a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock() 317 a. synchronize_rcu() rcu_read_lock() / rcu_read_unlock()
318 call_rcu() 318 call_rcu()
319 319
320 b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh() 320 b. call_rcu_bh() rcu_read_lock_bh() / rcu_read_unlock_bh()
321 321
322 c. synchronize_sched() preempt_disable() / preempt_enable() 322 c. synchronize_sched() preempt_disable() / preempt_enable()
323 local_irq_save() / local_irq_restore() 323 local_irq_save() / local_irq_restore()
324 hardirq enter / hardirq exit 324 hardirq enter / hardirq exit
325 NMI enter / NMI exit 325 NMI enter / NMI exit
326 326
327 These three mechanisms are used as follows: 327 These three mechanisms are used as follows:
328 328
329 a. RCU applied to normal data structures. 329 a. RCU applied to normal data structures.
330 330
331 b. RCU applied to networking data structures that may be subjected 331 b. RCU applied to networking data structures that may be subjected
332 to remote denial-of-service attacks. 332 to remote denial-of-service attacks.
333 333
334 c. RCU applied to scheduler and interrupt/NMI-handler tasks. 334 c. RCU applied to scheduler and interrupt/NMI-handler tasks.
335 335
336 Again, most uses will be of (a). The (b) and (c) cases are important 336 Again, most uses will be of (a). The (b) and (c) cases are important
337 for specialized uses, but are relatively uncommon. 337 for specialized uses, but are relatively uncommon.
338 338
339 339
340 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API? 340 3. WHAT ARE SOME EXAMPLE USES OF CORE RCU API?
341 341
342 This section shows a simple use of the core RCU API to protect a 342 This section shows a simple use of the core RCU API to protect a
343 global pointer to a dynamically allocated structure. More-typical 343 global pointer to a dynamically allocated structure. More-typical
344 uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt. 344 uses of RCU may be found in listRCU.txt, arrayRCU.txt, and NMI-RCU.txt.
345 345
346 struct foo { 346 struct foo {
347 int a; 347 int a;
348 char b; 348 char b;
349 long c; 349 long c;
350 }; 350 };
351 DEFINE_SPINLOCK(foo_mutex); 351 DEFINE_SPINLOCK(foo_mutex);
352 352
353 struct foo *gbl_foo; 353 struct foo *gbl_foo;
354 354
355 /* 355 /*
356 * Create a new struct foo that is the same as the one currently 356 * Create a new struct foo that is the same as the one currently
357 * pointed to by gbl_foo, except that field "a" is replaced 357 * pointed to by gbl_foo, except that field "a" is replaced
358 * with "new_a". Points gbl_foo to the new structure, and 358 * with "new_a". Points gbl_foo to the new structure, and
359 * frees up the old structure after a grace period. 359 * frees up the old structure after a grace period.
360 * 360 *
361 * Uses rcu_assign_pointer() to ensure that concurrent readers 361 * Uses rcu_assign_pointer() to ensure that concurrent readers
362 * see the initialized version of the new structure. 362 * see the initialized version of the new structure.
363 * 363 *
364 * Uses synchronize_rcu() to ensure that any readers that might 364 * Uses synchronize_rcu() to ensure that any readers that might
365 * have references to the old structure complete before freeing 365 * have references to the old structure complete before freeing
366 * the old structure. 366 * the old structure.
367 */ 367 */
368 void foo_update_a(int new_a) 368 void foo_update_a(int new_a)
369 { 369 {
370 struct foo *new_fp; 370 struct foo *new_fp;
371 struct foo *old_fp; 371 struct foo *old_fp;
372 372
373 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 373 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
374 spin_lock(&foo_mutex); 374 spin_lock(&foo_mutex);
375 old_fp = gbl_foo; 375 old_fp = gbl_foo;
376 *new_fp = *old_fp; 376 *new_fp = *old_fp;
377 new_fp->a = new_a; 377 new_fp->a = new_a;
378 rcu_assign_pointer(gbl_foo, new_fp); 378 rcu_assign_pointer(gbl_foo, new_fp);
379 spin_unlock(&foo_mutex); 379 spin_unlock(&foo_mutex);
380 synchronize_rcu(); 380 synchronize_rcu();
381 kfree(old_fp); 381 kfree(old_fp);
382 } 382 }
383 383
384 /* 384 /*
385 * Return the value of field "a" of the current gbl_foo 385 * Return the value of field "a" of the current gbl_foo
386 * structure. Use rcu_read_lock() and rcu_read_unlock() 386 * structure. Use rcu_read_lock() and rcu_read_unlock()
387 * to ensure that the structure does not get deleted out 387 * to ensure that the structure does not get deleted out
388 * from under us, and use rcu_dereference() to ensure that 388 * from under us, and use rcu_dereference() to ensure that
389 * we see the initialized version of the structure (important 389 * we see the initialized version of the structure (important
390 * for DEC Alpha and for people reading the code). 390 * for DEC Alpha and for people reading the code).
391 */ 391 */
392 int foo_get_a(void) 392 int foo_get_a(void)
393 { 393 {
394 int retval; 394 int retval;
395 395
396 rcu_read_lock(); 396 rcu_read_lock();
397 retval = rcu_dereference(gbl_foo)->a; 397 retval = rcu_dereference(gbl_foo)->a;
398 rcu_read_unlock(); 398 rcu_read_unlock();
399 return retval; 399 return retval;
400 } 400 }
401 401
402 So, to sum up: 402 So, to sum up:
403 403
404 o Use rcu_read_lock() and rcu_read_unlock() to guard RCU 404 o Use rcu_read_lock() and rcu_read_unlock() to guard RCU
405 read-side critical sections. 405 read-side critical sections.
406 406
407 o Within an RCU read-side critical section, use rcu_dereference() 407 o Within an RCU read-side critical section, use rcu_dereference()
408 to dereference RCU-protected pointers. 408 to dereference RCU-protected pointers.
409 409
410 o Use some solid scheme (such as locks or semaphores) to 410 o Use some solid scheme (such as locks or semaphores) to
411 keep concurrent updates from interfering with each other. 411 keep concurrent updates from interfering with each other.
412 412
413 o Use rcu_assign_pointer() to update an RCU-protected pointer. 413 o Use rcu_assign_pointer() to update an RCU-protected pointer.
414 This primitive protects concurrent readers from the updater, 414 This primitive protects concurrent readers from the updater,
415 -not- concurrent updates from each other! You therefore still 415 -not- concurrent updates from each other! You therefore still
416 need to use locking (or something similar) to keep concurrent 416 need to use locking (or something similar) to keep concurrent
417 rcu_assign_pointer() primitives from interfering with each other. 417 rcu_assign_pointer() primitives from interfering with each other.
418 418
419 o Use synchronize_rcu() -after- removing a data element from an 419 o Use synchronize_rcu() -after- removing a data element from an
420 RCU-protected data structure, but -before- reclaiming/freeing 420 RCU-protected data structure, but -before- reclaiming/freeing
421 the data element, in order to wait for the completion of all 421 the data element, in order to wait for the completion of all
422 RCU read-side critical sections that might be referencing that 422 RCU read-side critical sections that might be referencing that
423 data item. 423 data item.
424 424
425 See checklist.txt for additional rules to follow when using RCU. 425 See checklist.txt for additional rules to follow when using RCU.
426 And again, more-typical uses of RCU may be found in listRCU.txt, 426 And again, more-typical uses of RCU may be found in listRCU.txt,
427 arrayRCU.txt, and NMI-RCU.txt. 427 arrayRCU.txt, and NMI-RCU.txt.
428 428
429 429
430 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK? 430 4. WHAT IF MY UPDATING THREAD CANNOT BLOCK?
431 431
432 In the example above, foo_update_a() blocks until a grace period elapses. 432 In the example above, foo_update_a() blocks until a grace period elapses.
433 This is quite simple, but in some cases one cannot afford to wait so 433 This is quite simple, but in some cases one cannot afford to wait so
434 long -- there might be other high-priority work to be done. 434 long -- there might be other high-priority work to be done.
435 435
436 In such cases, one uses call_rcu() rather than synchronize_rcu(). 436 In such cases, one uses call_rcu() rather than synchronize_rcu().
437 The call_rcu() API is as follows: 437 The call_rcu() API is as follows:
438 438
439 void call_rcu(struct rcu_head * head, 439 void call_rcu(struct rcu_head * head,
440 void (*func)(struct rcu_head *head)); 440 void (*func)(struct rcu_head *head));
441 441
442 This function invokes func(head) after a grace period has elapsed. 442 This function invokes func(head) after a grace period has elapsed.
443 This invocation might happen from either softirq or process context, 443 This invocation might happen from either softirq or process context,
444 so the function is not permitted to block. The foo struct needs to 444 so the function is not permitted to block. The foo struct needs to
445 have an rcu_head structure added, perhaps as follows: 445 have an rcu_head structure added, perhaps as follows:
446 446
447 struct foo { 447 struct foo {
448 int a; 448 int a;
449 char b; 449 char b;
450 long c; 450 long c;
451 struct rcu_head rcu; 451 struct rcu_head rcu;
452 }; 452 };
453 453
454 The foo_update_a() function might then be written as follows: 454 The foo_update_a() function might then be written as follows:
455 455
456 /* 456 /*
457 * Create a new struct foo that is the same as the one currently 457 * Create a new struct foo that is the same as the one currently
458 * pointed to by gbl_foo, except that field "a" is replaced 458 * pointed to by gbl_foo, except that field "a" is replaced
459 * with "new_a". Points gbl_foo to the new structure, and 459 * with "new_a". Points gbl_foo to the new structure, and
460 * frees up the old structure after a grace period. 460 * frees up the old structure after a grace period.
461 * 461 *
462 * Uses rcu_assign_pointer() to ensure that concurrent readers 462 * Uses rcu_assign_pointer() to ensure that concurrent readers
463 * see the initialized version of the new structure. 463 * see the initialized version of the new structure.
464 * 464 *
465 * Uses call_rcu() to ensure that any readers that might have 465 * Uses call_rcu() to ensure that any readers that might have
466 * references to the old structure complete before freeing the 466 * references to the old structure complete before freeing the
467 * old structure. 467 * old structure.
468 */ 468 */
469 void foo_update_a(int new_a) 469 void foo_update_a(int new_a)
470 { 470 {
471 struct foo *new_fp; 471 struct foo *new_fp;
472 struct foo *old_fp; 472 struct foo *old_fp;
473 473
474 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); 474 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
475 spin_lock(&foo_mutex); 475 spin_lock(&foo_mutex);
476 old_fp = gbl_foo; 476 old_fp = gbl_foo;
477 *new_fp = *old_fp; 477 *new_fp = *old_fp;
478 new_fp->a = new_a; 478 new_fp->a = new_a;
479 rcu_assign_pointer(gbl_foo, new_fp); 479 rcu_assign_pointer(gbl_foo, new_fp);
480 spin_unlock(&foo_mutex); 480 spin_unlock(&foo_mutex);
481 call_rcu(&old_fp->rcu, foo_reclaim); 481 call_rcu(&old_fp->rcu, foo_reclaim);
482 } 482 }
483 483
484 The foo_reclaim() function might appear as follows: 484 The foo_reclaim() function might appear as follows:
485 485
486 void foo_reclaim(struct rcu_head *rp) 486 void foo_reclaim(struct rcu_head *rp)
487 { 487 {
488 struct foo *fp = container_of(rp, struct foo, rcu); 488 struct foo *fp = container_of(rp, struct foo, rcu);
489 489
490 kfree(fp); 490 kfree(fp);
491 } 491 }
492 492
493 The container_of() primitive is a macro that, given a pointer into a 493 The container_of() primitive is a macro that, given a pointer into a
494 struct, the type of the struct, and the pointed-to field within the 494 struct, the type of the struct, and the pointed-to field within the
495 struct, returns a pointer to the beginning of the struct. 495 struct, returns a pointer to the beginning of the struct.
496 496
497 The use of call_rcu() permits the caller of foo_update_a() to 497 The use of call_rcu() permits the caller of foo_update_a() to
498 immediately regain control, without needing to worry further about the 498 immediately regain control, without needing to worry further about the
499 old version of the newly updated element. It also clearly shows the 499 old version of the newly updated element. It also clearly shows the
500 RCU distinction between updater, namely foo_update_a(), and reclaimer, 500 RCU distinction between updater, namely foo_update_a(), and reclaimer,
501 namely foo_reclaim(). 501 namely foo_reclaim().
502 502
503 The summary of advice is the same as for the previous section, except 503 The summary of advice is the same as for the previous section, except
504 that we are now using call_rcu() rather than synchronize_rcu(): 504 that we are now using call_rcu() rather than synchronize_rcu():
505 505
506 o Use call_rcu() -after- removing a data element from an 506 o Use call_rcu() -after- removing a data element from an
507 RCU-protected data structure in order to register a callback 507 RCU-protected data structure in order to register a callback
508 function that will be invoked after the completion of all RCU 508 function that will be invoked after the completion of all RCU
509 read-side critical sections that might be referencing that 509 read-side critical sections that might be referencing that
510 data item. 510 data item.
511 511
512 Again, see checklist.txt for additional rules governing the use of RCU. 512 Again, see checklist.txt for additional rules governing the use of RCU.
513 513
514 514
515 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU? 515 5. WHAT ARE SOME SIMPLE IMPLEMENTATIONS OF RCU?
516 516
517 One of the nice things about RCU is that it has extremely simple "toy" 517 One of the nice things about RCU is that it has extremely simple "toy"
518 implementations that are a good first step towards understanding the 518 implementations that are a good first step towards understanding the
519 production-quality implementations in the Linux kernel. This section 519 production-quality implementations in the Linux kernel. This section
520 presents two such "toy" implementations of RCU, one that is implemented 520 presents two such "toy" implementations of RCU, one that is implemented
521 in terms of familiar locking primitives, and another that more closely 521 in terms of familiar locking primitives, and another that more closely
522 resembles "classic" RCU. Both are way too simple for real-world use, 522 resembles "classic" RCU. Both are way too simple for real-world use,
523 lacking both functionality and performance. However, they are useful 523 lacking both functionality and performance. However, they are useful
524 in getting a feel for how RCU works. See kernel/rcupdate.c for a 524 in getting a feel for how RCU works. See kernel/rcupdate.c for a
525 production-quality implementation, and see: 525 production-quality implementation, and see:
526 526
527 http://www.rdrop.com/users/paulmck/RCU 527 http://www.rdrop.com/users/paulmck/RCU
528 528
529 for papers describing the Linux kernel RCU implementation. The OLS'01 529 for papers describing the Linux kernel RCU implementation. The OLS'01
530 and OLS'02 papers are a good introduction, and the dissertation provides 530 and OLS'02 papers are a good introduction, and the dissertation provides
531 more details on the current implementation as of early 2004. 531 more details on the current implementation as of early 2004.
532 532
533 533
534 5A. "TOY" IMPLEMENTATION #1: LOCKING 534 5A. "TOY" IMPLEMENTATION #1: LOCKING
535 535
536 This section presents a "toy" RCU implementation that is based on 536 This section presents a "toy" RCU implementation that is based on
537 familiar locking primitives. Its overhead makes it a non-starter for 537 familiar locking primitives. Its overhead makes it a non-starter for
538 real-life use, as does its lack of scalability. It is also unsuitable 538 real-life use, as does its lack of scalability. It is also unsuitable
539 for realtime use, since it allows scheduling latency to "bleed" from 539 for realtime use, since it allows scheduling latency to "bleed" from
540 one read-side critical section to another. 540 one read-side critical section to another.
541 541
542 However, it is probably the easiest implementation to relate to, so is 542 However, it is probably the easiest implementation to relate to, so is
543 a good starting point. 543 a good starting point.
544 544
545 It is extremely simple: 545 It is extremely simple:
546 546
547 static DEFINE_RWLOCK(rcu_gp_mutex); 547 static DEFINE_RWLOCK(rcu_gp_mutex);
548 548
549 void rcu_read_lock(void) 549 void rcu_read_lock(void)
550 { 550 {
551 read_lock(&rcu_gp_mutex); 551 read_lock(&rcu_gp_mutex);
552 } 552 }
553 553
554 void rcu_read_unlock(void) 554 void rcu_read_unlock(void)
555 { 555 {
556 read_unlock(&rcu_gp_mutex); 556 read_unlock(&rcu_gp_mutex);
557 } 557 }
558 558
559 void synchronize_rcu(void) 559 void synchronize_rcu(void)
560 { 560 {
561 write_lock(&rcu_gp_mutex); 561 write_lock(&rcu_gp_mutex);
562 write_unlock(&rcu_gp_mutex); 562 write_unlock(&rcu_gp_mutex);
563 } 563 }
564 564
565 [You can ignore rcu_assign_pointer() and rcu_dereference() without 565 [You can ignore rcu_assign_pointer() and rcu_dereference() without
566 missing much. But here they are anyway. And whatever you do, don't 566 missing much. But here they are anyway. And whatever you do, don't
567 forget about them when submitting patches making use of RCU!] 567 forget about them when submitting patches making use of RCU!]
568 568
569 #define rcu_assign_pointer(p, v) ({ \ 569 #define rcu_assign_pointer(p, v) ({ \
570 smp_wmb(); \ 570 smp_wmb(); \
571 (p) = (v); \ 571 (p) = (v); \
572 }) 572 })
573 573
574 #define rcu_dereference(p) ({ \ 574 #define rcu_dereference(p) ({ \
575 typeof(p) _________p1 = p; \ 575 typeof(p) _________p1 = p; \
576 smp_read_barrier_depends(); \ 576 smp_read_barrier_depends(); \
577 (_________p1); \ 577 (_________p1); \
578 }) 578 })
579 579
580 580
581 The rcu_read_lock() and rcu_read_unlock() primitive read-acquire 581 The rcu_read_lock() and rcu_read_unlock() primitive read-acquire
582 and release a global reader-writer lock. The synchronize_rcu() 582 and release a global reader-writer lock. The synchronize_rcu()
583 primitive write-acquires this same lock, then immediately releases 583 primitive write-acquires this same lock, then immediately releases
584 it. This means that once synchronize_rcu() exits, all RCU read-side 584 it. This means that once synchronize_rcu() exits, all RCU read-side
585 critical sections that were in progress before synchronize_rcu() was 585 critical sections that were in progress before synchronize_rcu() was
586 called are guaranteed to have completed -- there is no way that 586 called are guaranteed to have completed -- there is no way that
587 synchronize_rcu() would have been able to write-acquire the lock 587 synchronize_rcu() would have been able to write-acquire the lock
588 otherwise. 588 otherwise.
589 589
590 It is possible to nest rcu_read_lock(), since reader-writer locks may 590 It is possible to nest rcu_read_lock(), since reader-writer locks may
591 be recursively acquired. Note also that rcu_read_lock() is immune 591 be recursively acquired. Note also that rcu_read_lock() is immune
592 from deadlock (an important property of RCU). The reason for this is 592 from deadlock (an important property of RCU). The reason for this is
593 that the only thing that can block rcu_read_lock() is a synchronize_rcu(). 593 that the only thing that can block rcu_read_lock() is a synchronize_rcu().
594 But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex, 594 But synchronize_rcu() does not acquire any locks while holding rcu_gp_mutex,
595 so there can be no deadlock cycle. 595 so there can be no deadlock cycle.
596 596
597 Quick Quiz #1: Why is this argument naive? How could a deadlock 597 Quick Quiz #1: Why is this argument naive? How could a deadlock
598 occur when using this algorithm in a real-world Linux 598 occur when using this algorithm in a real-world Linux
599 kernel? How could this deadlock be avoided? 599 kernel? How could this deadlock be avoided?
600 600
601 601
602 5B. "TOY" EXAMPLE #2: CLASSIC RCU 602 5B. "TOY" EXAMPLE #2: CLASSIC RCU
603 603
604 This section presents a "toy" RCU implementation that is based on 604 This section presents a "toy" RCU implementation that is based on
605 "classic RCU". It is also short on performance (but only for updates) and 605 "classic RCU". It is also short on performance (but only for updates) and
606 on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT 606 on features such as hotplug CPU and the ability to run in CONFIG_PREEMPT
607 kernels. The definitions of rcu_dereference() and rcu_assign_pointer() 607 kernels. The definitions of rcu_dereference() and rcu_assign_pointer()
608 are the same as those shown in the preceding section, so they are omitted. 608 are the same as those shown in the preceding section, so they are omitted.
609 609
610 void rcu_read_lock(void) { } 610 void rcu_read_lock(void) { }
611 611
612 void rcu_read_unlock(void) { } 612 void rcu_read_unlock(void) { }
613 613
614 void synchronize_rcu(void) 614 void synchronize_rcu(void)
615 { 615 {
616 int cpu; 616 int cpu;
617 617
618 for_each_possible_cpu(cpu) 618 for_each_possible_cpu(cpu)
619 run_on(cpu); 619 run_on(cpu);
620 } 620 }
621 621
622 Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing. 622 Note that rcu_read_lock() and rcu_read_unlock() do absolutely nothing.
623 This is the great strength of classic RCU in a non-preemptive kernel: 623 This is the great strength of classic RCU in a non-preemptive kernel:
624 read-side overhead is precisely zero, at least on non-Alpha CPUs. 624 read-side overhead is precisely zero, at least on non-Alpha CPUs.
625 And there is absolutely no way that rcu_read_lock() can possibly 625 And there is absolutely no way that rcu_read_lock() can possibly
626 participate in a deadlock cycle! 626 participate in a deadlock cycle!
627 627
628 The implementation of synchronize_rcu() simply schedules itself on each 628 The implementation of synchronize_rcu() simply schedules itself on each
629 CPU in turn. The run_on() primitive can be implemented straightforwardly 629 CPU in turn. The run_on() primitive can be implemented straightforwardly
630 in terms of the sched_setaffinity() primitive. Of course, a somewhat less 630 in terms of the sched_setaffinity() primitive. Of course, a somewhat less
631 "toy" implementation would restore the affinity upon completion rather 631 "toy" implementation would restore the affinity upon completion rather
632 than just leaving all tasks running on the last CPU, but when I said 632 than just leaving all tasks running on the last CPU, but when I said
633 "toy", I meant -toy-! 633 "toy", I meant -toy-!
634 634
635 So how the heck is this supposed to work??? 635 So how the heck is this supposed to work???
636 636
637 Remember that it is illegal to block while in an RCU read-side critical 637 Remember that it is illegal to block while in an RCU read-side critical
638 section. Therefore, if a given CPU executes a context switch, we know 638 section. Therefore, if a given CPU executes a context switch, we know
639 that it must have completed all preceding RCU read-side critical sections. 639 that it must have completed all preceding RCU read-side critical sections.
640 Once -all- CPUs have executed a context switch, then -all- preceding 640 Once -all- CPUs have executed a context switch, then -all- preceding
641 RCU read-side critical sections will have completed. 641 RCU read-side critical sections will have completed.
642 642
643 So, suppose that we remove a data item from its structure and then invoke 643 So, suppose that we remove a data item from its structure and then invoke
644 synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed 644 synchronize_rcu(). Once synchronize_rcu() returns, we are guaranteed
645 that there are no RCU read-side critical sections holding a reference 645 that there are no RCU read-side critical sections holding a reference
646 to that data item, so we can safely reclaim it. 646 to that data item, so we can safely reclaim it.
647 647
648 Quick Quiz #2: Give an example where Classic RCU's read-side 648 Quick Quiz #2: Give an example where Classic RCU's read-side
649 overhead is -negative-. 649 overhead is -negative-.
650 650
651 Quick Quiz #3: If it is illegal to block in an RCU read-side 651 Quick Quiz #3: If it is illegal to block in an RCU read-side
652 critical section, what the heck do you do in 652 critical section, what the heck do you do in
653 PREEMPT_RT, where normal spinlocks can block??? 653 PREEMPT_RT, where normal spinlocks can block???
654 654
655 655
656 6. ANALOGY WITH READER-WRITER LOCKING 656 6. ANALOGY WITH READER-WRITER LOCKING
657 657
658 Although RCU can be used in many different ways, a very common use of 658 Although RCU can be used in many different ways, a very common use of
659 RCU is analogous to reader-writer locking. The following unified 659 RCU is analogous to reader-writer locking. The following unified
660 diff shows how closely related RCU and reader-writer locking can be. 660 diff shows how closely related RCU and reader-writer locking can be.
661 661
662 @@ -13,15 +14,15 @@ 662 @@ -13,15 +14,15 @@
663 struct list_head *lp; 663 struct list_head *lp;
664 struct el *p; 664 struct el *p;
665 665
666 - read_lock(); 666 - read_lock();
667 - list_for_each_entry(p, head, lp) { 667 - list_for_each_entry(p, head, lp) {
668 + rcu_read_lock(); 668 + rcu_read_lock();
669 + list_for_each_entry_rcu(p, head, lp) { 669 + list_for_each_entry_rcu(p, head, lp) {
670 if (p->key == key) { 670 if (p->key == key) {
671 *result = p->data; 671 *result = p->data;
672 - read_unlock(); 672 - read_unlock();
673 + rcu_read_unlock(); 673 + rcu_read_unlock();
674 return 1; 674 return 1;
675 } 675 }
676 } 676 }
677 - read_unlock(); 677 - read_unlock();
678 + rcu_read_unlock(); 678 + rcu_read_unlock();
679 return 0; 679 return 0;
680 } 680 }
681 681
682 @@ -29,15 +30,16 @@ 682 @@ -29,15 +30,16 @@
683 { 683 {
684 struct el *p; 684 struct el *p;
685 685
686 - write_lock(&listmutex); 686 - write_lock(&listmutex);
687 + spin_lock(&listmutex); 687 + spin_lock(&listmutex);
688 list_for_each_entry(p, head, lp) { 688 list_for_each_entry(p, head, lp) {
689 if (p->key == key) { 689 if (p->key == key) {
690 - list_del(&p->list); 690 - list_del(&p->list);
691 - write_unlock(&listmutex); 691 - write_unlock(&listmutex);
692 + list_del_rcu(&p->list); 692 + list_del_rcu(&p->list);
693 + spin_unlock(&listmutex); 693 + spin_unlock(&listmutex);
694 + synchronize_rcu(); 694 + synchronize_rcu();
695 kfree(p); 695 kfree(p);
696 return 1; 696 return 1;
697 } 697 }
698 } 698 }
699 - write_unlock(&listmutex); 699 - write_unlock(&listmutex);
700 + spin_unlock(&listmutex); 700 + spin_unlock(&listmutex);
701 return 0; 701 return 0;
702 } 702 }
703 703
704 Or, for those who prefer a side-by-side listing: 704 Or, for those who prefer a side-by-side listing:
705 705
706 1 struct el { 1 struct el { 706 1 struct el { 1 struct el {
707 2 struct list_head list; 2 struct list_head list; 707 2 struct list_head list; 2 struct list_head list;
708 3 long key; 3 long key; 708 3 long key; 3 long key;
709 4 spinlock_t mutex; 4 spinlock_t mutex; 709 4 spinlock_t mutex; 4 spinlock_t mutex;
710 5 int data; 5 int data; 710 5 int data; 5 int data;
711 6 /* Other data fields */ 6 /* Other data fields */ 711 6 /* Other data fields */ 6 /* Other data fields */
712 7 }; 7 }; 712 7 }; 7 };
713 8 spinlock_t listmutex; 8 spinlock_t listmutex; 713 8 spinlock_t listmutex; 8 spinlock_t listmutex;
714 9 struct el head; 9 struct el head; 714 9 struct el head; 9 struct el head;
715 715
716 1 int search(long key, int *result) 1 int search(long key, int *result) 716 1 int search(long key, int *result) 1 int search(long key, int *result)
717 2 { 2 { 717 2 { 2 {
718 3 struct list_head *lp; 3 struct list_head *lp; 718 3 struct list_head *lp; 3 struct list_head *lp;
719 4 struct el *p; 4 struct el *p; 719 4 struct el *p; 4 struct el *p;
720 5 5 720 5 5
721 6 read_lock(); 6 rcu_read_lock(); 721 6 read_lock(); 6 rcu_read_lock();
722 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) { 722 7 list_for_each_entry(p, head, lp) { 7 list_for_each_entry_rcu(p, head, lp) {
723 8 if (p->key == key) { 8 if (p->key == key) { 723 8 if (p->key == key) { 8 if (p->key == key) {
724 9 *result = p->data; 9 *result = p->data; 724 9 *result = p->data; 9 *result = p->data;
725 10 read_unlock(); 10 rcu_read_unlock(); 725 10 read_unlock(); 10 rcu_read_unlock();
726 11 return 1; 11 return 1; 726 11 return 1; 11 return 1;
727 12 } 12 } 727 12 } 12 }
728 13 } 13 } 728 13 } 13 }
729 14 read_unlock(); 14 rcu_read_unlock(); 729 14 read_unlock(); 14 rcu_read_unlock();
730 15 return 0; 15 return 0; 730 15 return 0; 15 return 0;
731 16 } 16 } 731 16 } 16 }
732 732
733 1 int delete(long key) 1 int delete(long key) 733 1 int delete(long key) 1 int delete(long key)
734 2 { 2 { 734 2 { 2 {
735 3 struct el *p; 3 struct el *p; 735 3 struct el *p; 3 struct el *p;
736 4 4 736 4 4
737 5 write_lock(&listmutex); 5 spin_lock(&listmutex); 737 5 write_lock(&listmutex); 5 spin_lock(&listmutex);
738 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) { 738 6 list_for_each_entry(p, head, lp) { 6 list_for_each_entry(p, head, lp) {
739 7 if (p->key == key) { 7 if (p->key == key) { 739 7 if (p->key == key) { 7 if (p->key == key) {
740 8 list_del(&p->list); 8 list_del_rcu(&p->list); 740 8 list_del(&p->list); 8 list_del_rcu(&p->list);
741 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex); 741 9 write_unlock(&listmutex); 9 spin_unlock(&listmutex);
742 10 synchronize_rcu(); 742 10 synchronize_rcu();
743 10 kfree(p); 11 kfree(p); 743 10 kfree(p); 11 kfree(p);
744 11 return 1; 12 return 1; 744 11 return 1; 12 return 1;
745 12 } 13 } 745 12 } 13 }
746 13 } 14 } 746 13 } 14 }
747 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex); 747 14 write_unlock(&listmutex); 15 spin_unlock(&listmutex);
748 15 return 0; 16 return 0; 748 15 return 0; 16 return 0;
749 16 } 17 } 749 16 } 17 }
750 750
751 Either way, the differences are quite small. Read-side locking moves 751 Either way, the differences are quite small. Read-side locking moves
752 to rcu_read_lock() and rcu_read_unlock, update-side locking moves from 752 to rcu_read_lock() and rcu_read_unlock, update-side locking moves from
753 from a reader-writer lock to a simple spinlock, and a synchronize_rcu() 753 a reader-writer lock to a simple spinlock, and a synchronize_rcu()
754 precedes the kfree(). 754 precedes the kfree().
755 755
756 However, there is one potential catch: the read-side and update-side 756 However, there is one potential catch: the read-side and update-side
757 critical sections can now run concurrently. In many cases, this will 757 critical sections can now run concurrently. In many cases, this will
758 not be a problem, but it is necessary to check carefully regardless. 758 not be a problem, but it is necessary to check carefully regardless.
759 For example, if multiple independent list updates must be seen as 759 For example, if multiple independent list updates must be seen as
760 a single atomic update, converting to RCU will require special care. 760 a single atomic update, converting to RCU will require special care.
761 761
762 Also, the presence of synchronize_rcu() means that the RCU version of 762 Also, the presence of synchronize_rcu() means that the RCU version of
763 delete() can now block. If this is a problem, there is a callback-based 763 delete() can now block. If this is a problem, there is a callback-based
764 mechanism that never blocks, namely call_rcu(), that can be used in 764 mechanism that never blocks, namely call_rcu(), that can be used in
765 place of synchronize_rcu(). 765 place of synchronize_rcu().
766 766
767 767
768 7. FULL LIST OF RCU APIs 768 7. FULL LIST OF RCU APIs
769 769
770 The RCU APIs are documented in docbook-format header comments in the 770 The RCU APIs are documented in docbook-format header comments in the
771 Linux-kernel source code, but it helps to have a full list of the 771 Linux-kernel source code, but it helps to have a full list of the
772 APIs, since there does not appear to be a way to categorize them 772 APIs, since there does not appear to be a way to categorize them
773 in docbook. Here is the list, by category. 773 in docbook. Here is the list, by category.
774 774
775 Markers for RCU read-side critical sections: 775 Markers for RCU read-side critical sections:
776 776
777 rcu_read_lock 777 rcu_read_lock
778 rcu_read_unlock 778 rcu_read_unlock
779 rcu_read_lock_bh 779 rcu_read_lock_bh
780 rcu_read_unlock_bh 780 rcu_read_unlock_bh
781 781
782 RCU pointer/list traversal: 782 RCU pointer/list traversal:
783 783
784 rcu_dereference 784 rcu_dereference
785 list_for_each_rcu (to be deprecated in favor of 785 list_for_each_rcu (to be deprecated in favor of
786 list_for_each_entry_rcu) 786 list_for_each_entry_rcu)
787 list_for_each_entry_rcu 787 list_for_each_entry_rcu
788 list_for_each_continue_rcu (to be deprecated in favor of new 788 list_for_each_continue_rcu (to be deprecated in favor of new
789 list_for_each_entry_continue_rcu) 789 list_for_each_entry_continue_rcu)
790 hlist_for_each_entry_rcu 790 hlist_for_each_entry_rcu
791 791
792 RCU pointer update: 792 RCU pointer update:
793 793
794 rcu_assign_pointer 794 rcu_assign_pointer
795 list_add_rcu 795 list_add_rcu
796 list_add_tail_rcu 796 list_add_tail_rcu
797 list_del_rcu 797 list_del_rcu
798 list_replace_rcu 798 list_replace_rcu
799 hlist_del_rcu 799 hlist_del_rcu
800 hlist_add_head_rcu 800 hlist_add_head_rcu
801 801
802 RCU grace period: 802 RCU grace period:
803 803
804 synchronize_net 804 synchronize_net
805 synchronize_sched 805 synchronize_sched
806 synchronize_rcu 806 synchronize_rcu
807 call_rcu 807 call_rcu
808 call_rcu_bh 808 call_rcu_bh
809 809
810 See the comment headers in the source code (or the docbook generated 810 See the comment headers in the source code (or the docbook generated
811 from them) for more information. 811 from them) for more information.
812 812
813 813
814 8. ANSWERS TO QUICK QUIZZES 814 8. ANSWERS TO QUICK QUIZZES
815 815
816 Quick Quiz #1: Why is this argument naive? How could a deadlock 816 Quick Quiz #1: Why is this argument naive? How could a deadlock
817 occur when using this algorithm in a real-world Linux 817 occur when using this algorithm in a real-world Linux
818 kernel? [Referring to the lock-based "toy" RCU 818 kernel? [Referring to the lock-based "toy" RCU
819 algorithm.] 819 algorithm.]
820 820
821 Answer: Consider the following sequence of events: 821 Answer: Consider the following sequence of events:
822 822
823 1. CPU 0 acquires some unrelated lock, call it 823 1. CPU 0 acquires some unrelated lock, call it
824 "problematic_lock", disabling irq via 824 "problematic_lock", disabling irq via
825 spin_lock_irqsave(). 825 spin_lock_irqsave().
826 826
827 2. CPU 1 enters synchronize_rcu(), write-acquiring 827 2. CPU 1 enters synchronize_rcu(), write-acquiring
828 rcu_gp_mutex. 828 rcu_gp_mutex.
829 829
830 3. CPU 0 enters rcu_read_lock(), but must wait 830 3. CPU 0 enters rcu_read_lock(), but must wait
831 because CPU 1 holds rcu_gp_mutex. 831 because CPU 1 holds rcu_gp_mutex.
832 832
833 4. CPU 1 is interrupted, and the irq handler 833 4. CPU 1 is interrupted, and the irq handler
834 attempts to acquire problematic_lock. 834 attempts to acquire problematic_lock.
835 835
836 The system is now deadlocked. 836 The system is now deadlocked.
837 837
838 One way to avoid this deadlock is to use an approach like 838 One way to avoid this deadlock is to use an approach like
839 that of CONFIG_PREEMPT_RT, where all normal spinlocks 839 that of CONFIG_PREEMPT_RT, where all normal spinlocks
840 become blocking locks, and all irq handlers execute in 840 become blocking locks, and all irq handlers execute in
841 the context of special tasks. In this case, in step 4 841 the context of special tasks. In this case, in step 4
842 above, the irq handler would block, allowing CPU 1 to 842 above, the irq handler would block, allowing CPU 1 to
843 release rcu_gp_mutex, avoiding the deadlock. 843 release rcu_gp_mutex, avoiding the deadlock.
844 844
845 Even in the absence of deadlock, this RCU implementation 845 Even in the absence of deadlock, this RCU implementation
846 allows latency to "bleed" from readers to other 846 allows latency to "bleed" from readers to other
847 readers through synchronize_rcu(). To see this, 847 readers through synchronize_rcu(). To see this,
848 consider task A in an RCU read-side critical section 848 consider task A in an RCU read-side critical section
849 (thus read-holding rcu_gp_mutex), task B blocked 849 (thus read-holding rcu_gp_mutex), task B blocked
850 attempting to write-acquire rcu_gp_mutex, and 850 attempting to write-acquire rcu_gp_mutex, and
851 task C blocked in rcu_read_lock() attempting to 851 task C blocked in rcu_read_lock() attempting to
852 read_acquire rcu_gp_mutex. Task A's RCU read-side 852 read_acquire rcu_gp_mutex. Task A's RCU read-side
853 latency is holding up task C, albeit indirectly via 853 latency is holding up task C, albeit indirectly via
854 task B. 854 task B.
855 855
856 Realtime RCU implementations therefore use a counter-based 856 Realtime RCU implementations therefore use a counter-based
857 approach where tasks in RCU read-side critical sections 857 approach where tasks in RCU read-side critical sections
858 cannot be blocked by tasks executing synchronize_rcu(). 858 cannot be blocked by tasks executing synchronize_rcu().
859 859
860 Quick Quiz #2: Give an example where Classic RCU's read-side 860 Quick Quiz #2: Give an example where Classic RCU's read-side
861 overhead is -negative-. 861 overhead is -negative-.
862 862
863 Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT 863 Answer: Imagine a single-CPU system with a non-CONFIG_PREEMPT
864 kernel where a routing table is used by process-context 864 kernel where a routing table is used by process-context
865 code, but can be updated by irq-context code (for example, 865 code, but can be updated by irq-context code (for example,
866 by an "ICMP REDIRECT" packet). The usual way of handling 866 by an "ICMP REDIRECT" packet). The usual way of handling
867 this would be to have the process-context code disable 867 this would be to have the process-context code disable
868 interrupts while searching the routing table. Use of 868 interrupts while searching the routing table. Use of
869 RCU allows such interrupt-disabling to be dispensed with. 869 RCU allows such interrupt-disabling to be dispensed with.
870 Thus, without RCU, you pay the cost of disabling interrupts, 870 Thus, without RCU, you pay the cost of disabling interrupts,
871 and with RCU you don't. 871 and with RCU you don't.
872 872
873 One can argue that the overhead of RCU in this 873 One can argue that the overhead of RCU in this
874 case is negative with respect to the single-CPU 874 case is negative with respect to the single-CPU
875 interrupt-disabling approach. Others might argue that 875 interrupt-disabling approach. Others might argue that
876 the overhead of RCU is merely zero, and that replacing 876 the overhead of RCU is merely zero, and that replacing
877 the positive overhead of the interrupt-disabling scheme 877 the positive overhead of the interrupt-disabling scheme
878 with the zero-overhead RCU scheme does not constitute 878 with the zero-overhead RCU scheme does not constitute
879 negative overhead. 879 negative overhead.
880 880
881 In real life, of course, things are more complex. But 881 In real life, of course, things are more complex. But
882 even the theoretical possibility of negative overhead for 882 even the theoretical possibility of negative overhead for
883 a synchronization primitive is a bit unexpected. ;-) 883 a synchronization primitive is a bit unexpected. ;-)
884 884
885 Quick Quiz #3: If it is illegal to block in an RCU read-side 885 Quick Quiz #3: If it is illegal to block in an RCU read-side
886 critical section, what the heck do you do in 886 critical section, what the heck do you do in
887 PREEMPT_RT, where normal spinlocks can block??? 887 PREEMPT_RT, where normal spinlocks can block???
888 888
889 Answer: Just as PREEMPT_RT permits preemption of spinlock 889 Answer: Just as PREEMPT_RT permits preemption of spinlock
890 critical sections, it permits preemption of RCU 890 critical sections, it permits preemption of RCU
891 read-side critical sections. It also permits 891 read-side critical sections. It also permits
892 spinlocks blocking while in RCU read-side critical 892 spinlocks blocking while in RCU read-side critical
893 sections. 893 sections.
894 894
895 Why the apparent inconsistency? Because it is it 895 Why the apparent inconsistency? Because it is it
896 possible to use priority boosting to keep the RCU 896 possible to use priority boosting to keep the RCU
897 grace periods short if need be (for example, if running 897 grace periods short if need be (for example, if running
898 short of memory). In contrast, if blocking waiting 898 short of memory). In contrast, if blocking waiting
899 for (say) network reception, there is no way to know 899 for (say) network reception, there is no way to know
900 what should be boosted. Especially given that the 900 what should be boosted. Especially given that the
901 process we need to boost might well be a human being 901 process we need to boost might well be a human being
902 who just went out for a pizza or something. And although 902 who just went out for a pizza or something. And although
903 a computer-operated cattle prod might arouse serious 903 a computer-operated cattle prod might arouse serious
904 interest, it might also provoke serious objections. 904 interest, it might also provoke serious objections.
905 Besides, how does the computer know what pizza parlor 905 Besides, how does the computer know what pizza parlor
906 the human being went to??? 906 the human being went to???
907 907
908 908
909 ACKNOWLEDGEMENTS 909 ACKNOWLEDGEMENTS
910 910
911 My thanks to the people who helped make this human-readable, including 911 My thanks to the people who helped make this human-readable, including
912 Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern. 912 Jon Walpole, Josh Triplett, Serge Hallyn, Suzanne Wood, and Alan Stern.
913 913
914 914
915 For more information, see http://www.rdrop.com/users/paulmck/RCU. 915 For more information, see http://www.rdrop.com/users/paulmck/RCU.
916 916
Documentation/block/biodoc.txt
1 Notes on the Generic Block Layer Rewrite in Linux 2.5 1 Notes on the Generic Block Layer Rewrite in Linux 2.5
2 ===================================================== 2 =====================================================
3 3
4 Notes Written on Jan 15, 2002: 4 Notes Written on Jan 15, 2002:
5 Jens Axboe <axboe@suse.de> 5 Jens Axboe <axboe@suse.de>
6 Suparna Bhattacharya <suparna@in.ibm.com> 6 Suparna Bhattacharya <suparna@in.ibm.com>
7 7
8 Last Updated May 2, 2002 8 Last Updated May 2, 2002
9 September 2003: Updated I/O Scheduler portions 9 September 2003: Updated I/O Scheduler portions
10 Nick Piggin <piggin@cyberone.com.au> 10 Nick Piggin <piggin@cyberone.com.au>
11 11
12 Introduction: 12 Introduction:
13 13
14 These are some notes describing some aspects of the 2.5 block layer in the 14 These are some notes describing some aspects of the 2.5 block layer in the
15 context of the bio rewrite. The idea is to bring out some of the key 15 context of the bio rewrite. The idea is to bring out some of the key
16 changes and a glimpse of the rationale behind those changes. 16 changes and a glimpse of the rationale behind those changes.
17 17
18 Please mail corrections & suggestions to suparna@in.ibm.com. 18 Please mail corrections & suggestions to suparna@in.ibm.com.
19 19
20 Credits: 20 Credits:
21 --------- 21 ---------
22 22
23 2.5 bio rewrite: 23 2.5 bio rewrite:
24 Jens Axboe <axboe@suse.de> 24 Jens Axboe <axboe@suse.de>
25 25
26 Many aspects of the generic block layer redesign were driven by and evolved 26 Many aspects of the generic block layer redesign were driven by and evolved
27 over discussions, prior patches and the collective experience of several 27 over discussions, prior patches and the collective experience of several
28 people. See sections 8 and 9 for a list of some related references. 28 people. See sections 8 and 9 for a list of some related references.
29 29
30 The following people helped with review comments and inputs for this 30 The following people helped with review comments and inputs for this
31 document: 31 document:
32 Christoph Hellwig <hch@infradead.org> 32 Christoph Hellwig <hch@infradead.org>
33 Arjan van de Ven <arjanv@redhat.com> 33 Arjan van de Ven <arjanv@redhat.com>
34 Randy Dunlap <rdunlap@xenotime.net> 34 Randy Dunlap <rdunlap@xenotime.net>
35 Andre Hedrick <andre@linux-ide.org> 35 Andre Hedrick <andre@linux-ide.org>
36 36
37 The following people helped with fixes/contributions to the bio patches 37 The following people helped with fixes/contributions to the bio patches
38 while it was still work-in-progress: 38 while it was still work-in-progress:
39 David S. Miller <davem@redhat.com> 39 David S. Miller <davem@redhat.com>
40 40
41 41
42 Description of Contents: 42 Description of Contents:
43 ------------------------ 43 ------------------------
44 44
45 1. Scope for tuning of logic to various needs 45 1. Scope for tuning of logic to various needs
46 1.1 Tuning based on device or low level driver capabilities 46 1.1 Tuning based on device or low level driver capabilities
47 - Per-queue parameters 47 - Per-queue parameters
48 - Highmem I/O support 48 - Highmem I/O support
49 - I/O scheduler modularization 49 - I/O scheduler modularization
50 1.2 Tuning based on high level requirements/capabilities 50 1.2 Tuning based on high level requirements/capabilities
51 1.2.1 I/O Barriers 51 1.2.1 I/O Barriers
52 1.2.2 Request Priority/Latency 52 1.2.2 Request Priority/Latency
53 1.3 Direct access/bypass to lower layers for diagnostics and special 53 1.3 Direct access/bypass to lower layers for diagnostics and special
54 device operations 54 device operations
55 1.3.1 Pre-built commands 55 1.3.1 Pre-built commands
56 2. New flexible and generic but minimalist i/o structure or descriptor 56 2. New flexible and generic but minimalist i/o structure or descriptor
57 (instead of using buffer heads at the i/o layer) 57 (instead of using buffer heads at the i/o layer)
58 2.1 Requirements/Goals addressed 58 2.1 Requirements/Goals addressed
59 2.2 The bio struct in detail (multi-page io unit) 59 2.2 The bio struct in detail (multi-page io unit)
60 2.3 Changes in the request structure 60 2.3 Changes in the request structure
61 3. Using bios 61 3. Using bios
62 3.1 Setup/teardown (allocation, splitting) 62 3.1 Setup/teardown (allocation, splitting)
63 3.2 Generic bio helper routines 63 3.2 Generic bio helper routines
64 3.2.1 Traversing segments and completion units in a request 64 3.2.1 Traversing segments and completion units in a request
65 3.2.2 Setting up DMA scatterlists 65 3.2.2 Setting up DMA scatterlists
66 3.2.3 I/O completion 66 3.2.3 I/O completion
67 3.2.4 Implications for drivers that do not interpret bios (don't handle 67 3.2.4 Implications for drivers that do not interpret bios (don't handle
68 multiple segments) 68 multiple segments)
69 3.2.5 Request command tagging 69 3.2.5 Request command tagging
70 3.3 I/O submission 70 3.3 I/O submission
71 4. The I/O scheduler 71 4. The I/O scheduler
72 5. Scalability related changes 72 5. Scalability related changes
73 5.1 Granular locking: Removal of io_request_lock 73 5.1 Granular locking: Removal of io_request_lock
74 5.2 Prepare for transition to 64 bit sector_t 74 5.2 Prepare for transition to 64 bit sector_t
75 6. Other Changes/Implications 75 6. Other Changes/Implications
76 6.1 Partition re-mapping handled by the generic block layer 76 6.1 Partition re-mapping handled by the generic block layer
77 7. A few tips on migration of older drivers 77 7. A few tips on migration of older drivers
78 8. A list of prior/related/impacted patches/ideas 78 8. A list of prior/related/impacted patches/ideas
79 9. Other References/Discussion Threads 79 9. Other References/Discussion Threads
80 80
81 --------------------------------------------------------------------------- 81 ---------------------------------------------------------------------------
82 82
83 Bio Notes 83 Bio Notes
84 -------- 84 --------
85 85
86 Let us discuss the changes in the context of how some overall goals for the 86 Let us discuss the changes in the context of how some overall goals for the
87 block layer are addressed. 87 block layer are addressed.
88 88
89 1. Scope for tuning the generic logic to satisfy various requirements 89 1. Scope for tuning the generic logic to satisfy various requirements
90 90
91 The block layer design supports adaptable abstractions to handle common 91 The block layer design supports adaptable abstractions to handle common
92 processing with the ability to tune the logic to an appropriate extent 92 processing with the ability to tune the logic to an appropriate extent
93 depending on the nature of the device and the requirements of the caller. 93 depending on the nature of the device and the requirements of the caller.
94 One of the objectives of the rewrite was to increase the degree of tunability 94 One of the objectives of the rewrite was to increase the degree of tunability
95 and to enable higher level code to utilize underlying device/driver 95 and to enable higher level code to utilize underlying device/driver
96 capabilities to the maximum extent for better i/o performance. This is 96 capabilities to the maximum extent for better i/o performance. This is
97 important especially in the light of ever improving hardware capabilities 97 important especially in the light of ever improving hardware capabilities
98 and application/middleware software designed to take advantage of these 98 and application/middleware software designed to take advantage of these
99 capabilities. 99 capabilities.
100 100
101 1.1 Tuning based on low level device / driver capabilities 101 1.1 Tuning based on low level device / driver capabilities
102 102
103 Sophisticated devices with large built-in caches, intelligent i/o scheduling 103 Sophisticated devices with large built-in caches, intelligent i/o scheduling
104 optimizations, high memory DMA support, etc may find some of the 104 optimizations, high memory DMA support, etc may find some of the
105 generic processing an overhead, while for less capable devices the 105 generic processing an overhead, while for less capable devices the
106 generic functionality is essential for performance or correctness reasons. 106 generic functionality is essential for performance or correctness reasons.
107 Knowledge of some of the capabilities or parameters of the device should be 107 Knowledge of some of the capabilities or parameters of the device should be
108 used at the generic block layer to take the right decisions on 108 used at the generic block layer to take the right decisions on
109 behalf of the driver. 109 behalf of the driver.
110 110
111 How is this achieved ? 111 How is this achieved ?
112 112
113 Tuning at a per-queue level: 113 Tuning at a per-queue level:
114 114
115 i. Per-queue limits/values exported to the generic layer by the driver 115 i. Per-queue limits/values exported to the generic layer by the driver
116 116
117 Various parameters that the generic i/o scheduler logic uses are set at 117 Various parameters that the generic i/o scheduler logic uses are set at
118 a per-queue level (e.g maximum request size, maximum number of segments in 118 a per-queue level (e.g maximum request size, maximum number of segments in
119 a scatter-gather list, hardsect size) 119 a scatter-gather list, hardsect size)
120 120
121 Some parameters that were earlier available as global arrays indexed by 121 Some parameters that were earlier available as global arrays indexed by
122 major/minor are now directly associated with the queue. Some of these may 122 major/minor are now directly associated with the queue. Some of these may
123 move into the block device structure in the future. Some characteristics 123 move into the block device structure in the future. Some characteristics
124 have been incorporated into a queue flags field rather than separate fields 124 have been incorporated into a queue flags field rather than separate fields
125 in themselves. There are blk_queue_xxx functions to set the parameters, 125 in themselves. There are blk_queue_xxx functions to set the parameters,
126 rather than update the fields directly 126 rather than update the fields directly
127 127
128 Some new queue property settings: 128 Some new queue property settings:
129 129
130 blk_queue_bounce_limit(q, u64 dma_address) 130 blk_queue_bounce_limit(q, u64 dma_address)
131 Enable I/O to highmem pages, dma_address being the 131 Enable I/O to highmem pages, dma_address being the
132 limit. No highmem default. 132 limit. No highmem default.
133 133
134 blk_queue_max_sectors(q, max_sectors) 134 blk_queue_max_sectors(q, max_sectors)
135 Sets two variables that limit the size of the request. 135 Sets two variables that limit the size of the request.
136 136
137 - The request queue's max_sectors, which is a soft size in 137 - The request queue's max_sectors, which is a soft size in
138 in units of 512 byte sectors, and could be dynamically varied 138 units of 512 byte sectors, and could be dynamically varied
139 by the core kernel. 139 by the core kernel.
140 140
141 - The request queue's max_hw_sectors, which is a hard limit 141 - The request queue's max_hw_sectors, which is a hard limit
142 and reflects the maximum size request a driver can handle 142 and reflects the maximum size request a driver can handle
143 in units of 512 byte sectors. 143 in units of 512 byte sectors.
144 144
145 The default for both max_sectors and max_hw_sectors is 145 The default for both max_sectors and max_hw_sectors is
146 255. The upper limit of max_sectors is 1024. 146 255. The upper limit of max_sectors is 1024.
147 147
148 blk_queue_max_phys_segments(q, max_segments) 148 blk_queue_max_phys_segments(q, max_segments)
149 Maximum physical segments you can handle in a request. 128 149 Maximum physical segments you can handle in a request. 128
150 default (driver limit). (See 3.2.2) 150 default (driver limit). (See 3.2.2)
151 151
152 blk_queue_max_hw_segments(q, max_segments) 152 blk_queue_max_hw_segments(q, max_segments)
153 Maximum dma segments the hardware can handle in a request. 128 153 Maximum dma segments the hardware can handle in a request. 128
154 default (host adapter limit, after dma remapping). 154 default (host adapter limit, after dma remapping).
155 (See 3.2.2) 155 (See 3.2.2)
156 156
157 blk_queue_max_segment_size(q, max_seg_size) 157 blk_queue_max_segment_size(q, max_seg_size)
158 Maximum size of a clustered segment, 64kB default. 158 Maximum size of a clustered segment, 64kB default.
159 159
160 blk_queue_hardsect_size(q, hardsect_size) 160 blk_queue_hardsect_size(q, hardsect_size)
161 Lowest possible sector size that the hardware can operate 161 Lowest possible sector size that the hardware can operate
162 on, 512 bytes default. 162 on, 512 bytes default.
163 163
164 New queue flags: 164 New queue flags:
165 165
166 QUEUE_FLAG_CLUSTER (see 3.2.2) 166 QUEUE_FLAG_CLUSTER (see 3.2.2)
167 QUEUE_FLAG_QUEUED (see 3.2.4) 167 QUEUE_FLAG_QUEUED (see 3.2.4)
168 168
169 169
170 ii. High-mem i/o capabilities are now considered the default 170 ii. High-mem i/o capabilities are now considered the default
171 171
172 The generic bounce buffer logic, present in 2.4, where the block layer would 172 The generic bounce buffer logic, present in 2.4, where the block layer would
173 by default copyin/out i/o requests on high-memory buffers to low-memory buffers 173 by default copyin/out i/o requests on high-memory buffers to low-memory buffers
174 assuming that the driver wouldn't be able to handle it directly, has been 174 assuming that the driver wouldn't be able to handle it directly, has been
175 changed in 2.5. The bounce logic is now applied only for memory ranges 175 changed in 2.5. The bounce logic is now applied only for memory ranges
176 for which the device cannot handle i/o. A driver can specify this by 176 for which the device cannot handle i/o. A driver can specify this by
177 setting the queue bounce limit for the request queue for the device 177 setting the queue bounce limit for the request queue for the device
178 (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out 178 (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
179 where a device is capable of handling high memory i/o. 179 where a device is capable of handling high memory i/o.
180 180
181 In order to enable high-memory i/o where the device is capable of supporting 181 In order to enable high-memory i/o where the device is capable of supporting
182 it, the pci dma mapping routines and associated data structures have now been 182 it, the pci dma mapping routines and associated data structures have now been
183 modified to accomplish a direct page -> bus translation, without requiring 183 modified to accomplish a direct page -> bus translation, without requiring
184 a virtual address mapping (unlike the earlier scheme of virtual address 184 a virtual address mapping (unlike the earlier scheme of virtual address
185 -> bus translation). So this works uniformly for high-memory pages (which 185 -> bus translation). So this works uniformly for high-memory pages (which
186 do not have a correponding kernel virtual address space mapping) and 186 do not have a correponding kernel virtual address space mapping) and
187 low-memory pages. 187 low-memory pages.
188 188
189 Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA 189 Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMA
190 aspects and mapping of scatter gather lists, and support for 64 bit PCI. 190 aspects and mapping of scatter gather lists, and support for 64 bit PCI.
191 191
192 Special handling is required only for cases where i/o needs to happen on 192 Special handling is required only for cases where i/o needs to happen on
193 pages at physical memory addresses beyond what the device can support. In these 193 pages at physical memory addresses beyond what the device can support. In these
194 cases, a bounce bio representing a buffer from the supported memory range 194 cases, a bounce bio representing a buffer from the supported memory range
195 is used for performing the i/o with copyin/copyout as needed depending on 195 is used for performing the i/o with copyin/copyout as needed depending on
196 the type of the operation. For example, in case of a read operation, the 196 the type of the operation. For example, in case of a read operation, the
197 data read has to be copied to the original buffer on i/o completion, so a 197 data read has to be copied to the original buffer on i/o completion, so a
198 callback routine is set up to do this, while for write, the data is copied 198 callback routine is set up to do this, while for write, the data is copied
199 from the original buffer to the bounce buffer prior to issuing the 199 from the original buffer to the bounce buffer prior to issuing the
200 operation. Since an original buffer may be in a high memory area that's not 200 operation. Since an original buffer may be in a high memory area that's not
201 mapped in kernel virtual addr, a kmap operation may be required for 201 mapped in kernel virtual addr, a kmap operation may be required for
202 performing the copy, and special care may be needed in the completion path 202 performing the copy, and special care may be needed in the completion path
203 as it may not be in irq context. Special care is also required (by way of 203 as it may not be in irq context. Special care is also required (by way of
204 GFP flags) when allocating bounce buffers, to avoid certain highmem 204 GFP flags) when allocating bounce buffers, to avoid certain highmem
205 deadlock possibilities. 205 deadlock possibilities.
206 206
207 It is also possible that a bounce buffer may be allocated from high-memory 207 It is also possible that a bounce buffer may be allocated from high-memory
208 area that's not mapped in kernel virtual addr, but within the range that the 208 area that's not mapped in kernel virtual addr, but within the range that the
209 device can use directly; so the bounce page may need to be kmapped during 209 device can use directly; so the bounce page may need to be kmapped during
210 copy operations. [Note: This does not hold in the current implementation, 210 copy operations. [Note: This does not hold in the current implementation,
211 though] 211 though]
212 212
213 There are some situations when pages from high memory may need to 213 There are some situations when pages from high memory may need to
214 be kmapped, even if bounce buffers are not necessary. For example a device 214 be kmapped, even if bounce buffers are not necessary. For example a device
215 may need to abort DMA operations and revert to PIO for the transfer, in 215 may need to abort DMA operations and revert to PIO for the transfer, in
216 which case a virtual mapping of the page is required. For SCSI it is also 216 which case a virtual mapping of the page is required. For SCSI it is also
217 done in some scenarios where the low level driver cannot be trusted to 217 done in some scenarios where the low level driver cannot be trusted to
218 handle a single sg entry correctly. The driver is expected to perform the 218 handle a single sg entry correctly. The driver is expected to perform the
219 kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq 219 kmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irq
220 routines as appropriate. A driver could also use the blk_queue_bounce() 220 routines as appropriate. A driver could also use the blk_queue_bounce()
221 routine on its own to bounce highmem i/o to low memory for specific requests 221 routine on its own to bounce highmem i/o to low memory for specific requests
222 if so desired. 222 if so desired.
223 223
224 iii. The i/o scheduler algorithm itself can be replaced/set as appropriate 224 iii. The i/o scheduler algorithm itself can be replaced/set as appropriate
225 225
226 As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular 226 As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular
227 queue or pick from (copy) existing generic schedulers and replace/override 227 queue or pick from (copy) existing generic schedulers and replace/override
228 certain portions of it. The 2.5 rewrite provides improved modularization 228 certain portions of it. The 2.5 rewrite provides improved modularization
229 of the i/o scheduler. There are more pluggable callbacks, e.g for init, 229 of the i/o scheduler. There are more pluggable callbacks, e.g for init,
230 add request, extract request, which makes it possible to abstract specific 230 add request, extract request, which makes it possible to abstract specific
231 i/o scheduling algorithm aspects and details outside of the generic loop. 231 i/o scheduling algorithm aspects and details outside of the generic loop.
232 It also makes it possible to completely hide the implementation details of 232 It also makes it possible to completely hide the implementation details of
233 the i/o scheduler from block drivers. 233 the i/o scheduler from block drivers.
234 234
235 I/O scheduler wrappers are to be used instead of accessing the queue directly. 235 I/O scheduler wrappers are to be used instead of accessing the queue directly.
236 See section 4. The I/O scheduler for details. 236 See section 4. The I/O scheduler for details.
237 237
238 1.2 Tuning Based on High level code capabilities 238 1.2 Tuning Based on High level code capabilities
239 239
240 i. Application capabilities for raw i/o 240 i. Application capabilities for raw i/o
241 241
242 This comes from some of the high-performance database/middleware 242 This comes from some of the high-performance database/middleware
243 requirements where an application prefers to make its own i/o scheduling 243 requirements where an application prefers to make its own i/o scheduling
244 decisions based on an understanding of the access patterns and i/o 244 decisions based on an understanding of the access patterns and i/o
245 characteristics 245 characteristics
246 246
247 ii. High performance filesystems or other higher level kernel code's 247 ii. High performance filesystems or other higher level kernel code's
248 capabilities 248 capabilities
249 249
250 Kernel components like filesystems could also take their own i/o scheduling 250 Kernel components like filesystems could also take their own i/o scheduling
251 decisions for optimizing performance. Journalling filesystems may need 251 decisions for optimizing performance. Journalling filesystems may need
252 some control over i/o ordering. 252 some control over i/o ordering.
253 253
254 What kind of support exists at the generic block layer for this ? 254 What kind of support exists at the generic block layer for this ?
255 255
256 The flags and rw fields in the bio structure can be used for some tuning 256 The flags and rw fields in the bio structure can be used for some tuning
257 from above e.g indicating that an i/o is just a readahead request, or for 257 from above e.g indicating that an i/o is just a readahead request, or for
258 marking barrier requests (discussed next), or priority settings (currently 258 marking barrier requests (discussed next), or priority settings (currently
259 unused). As far as user applications are concerned they would need an 259 unused). As far as user applications are concerned they would need an
260 additional mechanism either via open flags or ioctls, or some other upper 260 additional mechanism either via open flags or ioctls, or some other upper
261 level mechanism to communicate such settings to block. 261 level mechanism to communicate such settings to block.
262 262
263 1.2.1 I/O Barriers 263 1.2.1 I/O Barriers
264 264
265 There is a way to enforce strict ordering for i/os through barriers. 265 There is a way to enforce strict ordering for i/os through barriers.
266 All requests before a barrier point must be serviced before the barrier 266 All requests before a barrier point must be serviced before the barrier
267 request and any other requests arriving after the barrier will not be 267 request and any other requests arriving after the barrier will not be
268 serviced until after the barrier has completed. This is useful for higher 268 serviced until after the barrier has completed. This is useful for higher
269 level control on write ordering, e.g flushing a log of committed updates 269 level control on write ordering, e.g flushing a log of committed updates
270 to disk before the corresponding updates themselves. 270 to disk before the corresponding updates themselves.
271 271
272 A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o. 272 A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.
273 The generic i/o scheduler would make sure that it places the barrier request and 273 The generic i/o scheduler would make sure that it places the barrier request and
274 all other requests coming after it after all the previous requests in the 274 all other requests coming after it after all the previous requests in the
275 queue. Barriers may be implemented in different ways depending on the 275 queue. Barriers may be implemented in different ways depending on the
276 driver. For more details regarding I/O barriers, please read barrier.txt 276 driver. For more details regarding I/O barriers, please read barrier.txt
277 in this directory. 277 in this directory.
278 278
279 1.2.2 Request Priority/Latency 279 1.2.2 Request Priority/Latency
280 280
281 Todo/Under discussion: 281 Todo/Under discussion:
282 Arjan's proposed request priority scheme allows higher levels some broad 282 Arjan's proposed request priority scheme allows higher levels some broad
283 control (high/med/low) over the priority of an i/o request vs other pending 283 control (high/med/low) over the priority of an i/o request vs other pending
284 requests in the queue. For example it allows reads for bringing in an 284 requests in the queue. For example it allows reads for bringing in an
285 executable page on demand to be given a higher priority over pending write 285 executable page on demand to be given a higher priority over pending write
286 requests which haven't aged too much on the queue. Potentially this priority 286 requests which haven't aged too much on the queue. Potentially this priority
287 could even be exposed to applications in some manner, providing higher level 287 could even be exposed to applications in some manner, providing higher level
288 tunability. Time based aging avoids starvation of lower priority 288 tunability. Time based aging avoids starvation of lower priority
289 requests. Some bits in the bi_rw flags field in the bio structure are 289 requests. Some bits in the bi_rw flags field in the bio structure are
290 intended to be used for this priority information. 290 intended to be used for this priority information.
291 291
292 292
293 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) 293 1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)
294 (e.g Diagnostics, Systems Management) 294 (e.g Diagnostics, Systems Management)
295 295
296 There are situations where high-level code needs to have direct access to 296 There are situations where high-level code needs to have direct access to
297 the low level device capabilities or requires the ability to issue commands 297 the low level device capabilities or requires the ability to issue commands
298 to the device bypassing some of the intermediate i/o layers. 298 to the device bypassing some of the intermediate i/o layers.
299 These could, for example, be special control commands issued through ioctl 299 These could, for example, be special control commands issued through ioctl
300 interfaces, or could be raw read/write commands that stress the drive's 300 interfaces, or could be raw read/write commands that stress the drive's
301 capabilities for certain kinds of fitness tests. Having direct interfaces at 301 capabilities for certain kinds of fitness tests. Having direct interfaces at
302 multiple levels without having to pass through upper layers makes 302 multiple levels without having to pass through upper layers makes
303 it possible to perform bottom up validation of the i/o path, layer by 303 it possible to perform bottom up validation of the i/o path, layer by
304 layer, starting from the media. 304 layer, starting from the media.
305 305
306 The normal i/o submission interfaces, e.g submit_bio, could be bypassed 306 The normal i/o submission interfaces, e.g submit_bio, could be bypassed
307 for specially crafted requests which such ioctl or diagnostics 307 for specially crafted requests which such ioctl or diagnostics
308 interfaces would typically use, and the elevator add_request routine 308 interfaces would typically use, and the elevator add_request routine
309 can instead be used to directly insert such requests in the queue or preferably 309 can instead be used to directly insert such requests in the queue or preferably
310 the blk_do_rq routine can be used to place the request on the queue and 310 the blk_do_rq routine can be used to place the request on the queue and
311 wait for completion. Alternatively, sometimes the caller might just 311 wait for completion. Alternatively, sometimes the caller might just
312 invoke a lower level driver specific interface with the request as a 312 invoke a lower level driver specific interface with the request as a
313 parameter. 313 parameter.
314 314
315 If the request is a means for passing on special information associated with 315 If the request is a means for passing on special information associated with
316 the command, then such information is associated with the request->special 316 the command, then such information is associated with the request->special
317 field (rather than misuse the request->buffer field which is meant for the 317 field (rather than misuse the request->buffer field which is meant for the
318 request data buffer's virtual mapping). 318 request data buffer's virtual mapping).
319 319
320 For passing request data, the caller must build up a bio descriptor 320 For passing request data, the caller must build up a bio descriptor
321 representing the concerned memory buffer if the underlying driver interprets 321 representing the concerned memory buffer if the underlying driver interprets
322 bio segments or uses the block layer end*request* functions for i/o 322 bio segments or uses the block layer end*request* functions for i/o
323 completion. Alternatively one could directly use the request->buffer field to 323 completion. Alternatively one could directly use the request->buffer field to
324 specify the virtual address of the buffer, if the driver expects buffer 324 specify the virtual address of the buffer, if the driver expects buffer
325 addresses passed in this way and ignores bio entries for the request type 325 addresses passed in this way and ignores bio entries for the request type
326 involved. In the latter case, the driver would modify and manage the 326 involved. In the latter case, the driver would modify and manage the
327 request->buffer, request->sector and request->nr_sectors or 327 request->buffer, request->sector and request->nr_sectors or
328 request->current_nr_sectors fields itself rather than using the block layer 328 request->current_nr_sectors fields itself rather than using the block layer
329 end_request or end_that_request_first completion interfaces. 329 end_request or end_that_request_first completion interfaces.
330 (See 2.3 or Documentation/block/request.txt for a brief explanation of 330 (See 2.3 or Documentation/block/request.txt for a brief explanation of
331 the request structure fields) 331 the request structure fields)
332 332
333 [TBD: end_that_request_last should be usable even in this case; 333 [TBD: end_that_request_last should be usable even in this case;
334 Perhaps an end_that_direct_request_first routine could be implemented to make 334 Perhaps an end_that_direct_request_first routine could be implemented to make
335 handling direct requests easier for such drivers; Also for drivers that 335 handling direct requests easier for such drivers; Also for drivers that
336 expect bios, a helper function could be provided for setting up a bio 336 expect bios, a helper function could be provided for setting up a bio
337 corresponding to a data buffer] 337 corresponding to a data buffer]
338 338
339 <JENS: I dont understand the above, why is end_that_request_first() not 339 <JENS: I dont understand the above, why is end_that_request_first() not
340 usable? Or _last for that matter. I must be missing something> 340 usable? Or _last for that matter. I must be missing something>
341 <SUP: What I meant here was that if the request doesn't have a bio, then 341 <SUP: What I meant here was that if the request doesn't have a bio, then
342 end_that_request_first doesn't modify nr_sectors or current_nr_sectors, 342 end_that_request_first doesn't modify nr_sectors or current_nr_sectors,
343 and hence can't be used for advancing request state settings on the 343 and hence can't be used for advancing request state settings on the
344 completion of partial transfers. The driver has to modify these fields 344 completion of partial transfers. The driver has to modify these fields
345 directly by hand. 345 directly by hand.
346 This is because end_that_request_first only iterates over the bio list, 346 This is because end_that_request_first only iterates over the bio list,
347 and always returns 0 if there are none associated with the request. 347 and always returns 0 if there are none associated with the request.
348 _last works OK in this case, and is not a problem, as I mentioned earlier 348 _last works OK in this case, and is not a problem, as I mentioned earlier
349 > 349 >
350 350
351 1.3.1 Pre-built Commands 351 1.3.1 Pre-built Commands
352 352
353 A request can be created with a pre-built custom command to be sent directly 353 A request can be created with a pre-built custom command to be sent directly
354 to the device. The cmd block in the request structure has room for filling 354 to the device. The cmd block in the request structure has room for filling
355 in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for 355 in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for
356 command pre-building, and the type of the request is now indicated 356 command pre-building, and the type of the request is now indicated
357 through rq->flags instead of via rq->cmd) 357 through rq->flags instead of via rq->cmd)
358 358
359 The request structure flags can be set up to indicate the type of request 359 The request structure flags can be set up to indicate the type of request
360 in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: 360 in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC:
361 packet command issued via blk_do_rq, REQ_SPECIAL: special request). 361 packet command issued via blk_do_rq, REQ_SPECIAL: special request).
362 362
363 It can help to pre-build device commands for requests in advance. 363 It can help to pre-build device commands for requests in advance.
364 Drivers can now specify a request prepare function (q->prep_rq_fn) that the 364 Drivers can now specify a request prepare function (q->prep_rq_fn) that the
365 block layer would invoke to pre-build device commands for a given request, 365 block layer would invoke to pre-build device commands for a given request,
366 or perform other preparatory processing for the request. This is routine is 366 or perform other preparatory processing for the request. This is routine is
367 called by elv_next_request(), i.e. typically just before servicing a request. 367 called by elv_next_request(), i.e. typically just before servicing a request.
368 (The prepare function would not be called for requests that have REQ_DONTPREP 368 (The prepare function would not be called for requests that have REQ_DONTPREP
369 enabled) 369 enabled)
370 370
371 Aside: 371 Aside:
372 Pre-building could possibly even be done early, i.e before placing the 372 Pre-building could possibly even be done early, i.e before placing the
373 request on the queue, rather than construct the command on the fly in the 373 request on the queue, rather than construct the command on the fly in the
374 driver while servicing the request queue when it may affect latencies in 374 driver while servicing the request queue when it may affect latencies in
375 interrupt context or responsiveness in general. One way to add early 375 interrupt context or responsiveness in general. One way to add early
376 pre-building would be to do it whenever we fail to merge on a request. 376 pre-building would be to do it whenever we fail to merge on a request.
377 Now REQ_NOMERGE is set in the request flags to skip this one in the future, 377 Now REQ_NOMERGE is set in the request flags to skip this one in the future,
378 which means that it will not change before we feed it to the device. So 378 which means that it will not change before we feed it to the device. So
379 the pre-builder hook can be invoked there. 379 the pre-builder hook can be invoked there.
380 380
381 381
382 2. Flexible and generic but minimalist i/o structure/descriptor. 382 2. Flexible and generic but minimalist i/o structure/descriptor.
383 383
384 2.1 Reason for a new structure and requirements addressed 384 2.1 Reason for a new structure and requirements addressed
385 385
386 Prior to 2.5, buffer heads were used as the unit of i/o at the generic block 386 Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
387 layer, and the low level request structure was associated with a chain of 387 layer, and the low level request structure was associated with a chain of
388 buffer heads for a contiguous i/o request. This led to certain inefficiencies 388 buffer heads for a contiguous i/o request. This led to certain inefficiencies
389 when it came to large i/o requests and readv/writev style operations, as it 389 when it came to large i/o requests and readv/writev style operations, as it
390 forced such requests to be broken up into small chunks before being passed 390 forced such requests to be broken up into small chunks before being passed
391 on to the generic block layer, only to be merged by the i/o scheduler 391 on to the generic block layer, only to be merged by the i/o scheduler
392 when the underlying device was capable of handling the i/o in one shot. 392 when the underlying device was capable of handling the i/o in one shot.
393 Also, using the buffer head as an i/o structure for i/os that didn't originate 393 Also, using the buffer head as an i/o structure for i/os that didn't originate
394 from the buffer cache unecessarily added to the weight of the descriptors 394 from the buffer cache unecessarily added to the weight of the descriptors
395 which were generated for each such chunk. 395 which were generated for each such chunk.
396 396
397 The following were some of the goals and expectations considered in the 397 The following were some of the goals and expectations considered in the
398 redesign of the block i/o data structure in 2.5. 398 redesign of the block i/o data structure in 2.5.
399 399
400 i. Should be appropriate as a descriptor for both raw and buffered i/o - 400 i. Should be appropriate as a descriptor for both raw and buffered i/o -
401 avoid cache related fields which are irrelevant in the direct/page i/o path, 401 avoid cache related fields which are irrelevant in the direct/page i/o path,
402 or filesystem block size alignment restrictions which may not be relevant 402 or filesystem block size alignment restrictions which may not be relevant
403 for raw i/o. 403 for raw i/o.
404 ii. Ability to represent high-memory buffers (which do not have a virtual 404 ii. Ability to represent high-memory buffers (which do not have a virtual
405 address mapping in kernel address space). 405 address mapping in kernel address space).
406 iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e 406 iii.Ability to represent large i/os w/o unecessarily breaking them up (i.e
407 greater than PAGE_SIZE chunks in one shot) 407 greater than PAGE_SIZE chunks in one shot)
408 iv. At the same time, ability to retain independent identity of i/os from 408 iv. At the same time, ability to retain independent identity of i/os from
409 different sources or i/o units requiring individual completion (e.g. for 409 different sources or i/o units requiring individual completion (e.g. for
410 latency reasons) 410 latency reasons)
411 v. Ability to represent an i/o involving multiple physical memory segments 411 v. Ability to represent an i/o involving multiple physical memory segments
412 (including non-page aligned page fragments, as specified via readv/writev) 412 (including non-page aligned page fragments, as specified via readv/writev)
413 without unecessarily breaking it up, if the underlying device is capable of 413 without unecessarily breaking it up, if the underlying device is capable of
414 handling it. 414 handling it.
415 vi. Preferably should be based on a memory descriptor structure that can be 415 vi. Preferably should be based on a memory descriptor structure that can be
416 passed around different types of subsystems or layers, maybe even 416 passed around different types of subsystems or layers, maybe even
417 networking, without duplication or extra copies of data/descriptor fields 417 networking, without duplication or extra copies of data/descriptor fields
418 themselves in the process 418 themselves in the process
419 vii.Ability to handle the possibility of splits/merges as the structure passes 419 vii.Ability to handle the possibility of splits/merges as the structure passes
420 through layered drivers (lvm, md, evms), with minimal overhead. 420 through layered drivers (lvm, md, evms), with minimal overhead.
421 421
422 The solution was to define a new structure (bio) for the block layer, 422 The solution was to define a new structure (bio) for the block layer,
423 instead of using the buffer head structure (bh) directly, the idea being 423 instead of using the buffer head structure (bh) directly, the idea being
424 avoidance of some associated baggage and limitations. The bio structure 424 avoidance of some associated baggage and limitations. The bio structure
425 is uniformly used for all i/o at the block layer ; it forms a part of the 425 is uniformly used for all i/o at the block layer ; it forms a part of the
426 bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are 426 bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
427 mapped to bio structures. 427 mapped to bio structures.
428 428
429 2.2 The bio struct 429 2.2 The bio struct
430 430
431 The bio structure uses a vector representation pointing to an array of tuples 431 The bio structure uses a vector representation pointing to an array of tuples
432 of <page, offset, len> to describe the i/o buffer, and has various other 432 of <page, offset, len> to describe the i/o buffer, and has various other
433 fields describing i/o parameters and state that needs to be maintained for 433 fields describing i/o parameters and state that needs to be maintained for
434 performing the i/o. 434 performing the i/o.
435 435
436 Notice that this representation means that a bio has no virtual address 436 Notice that this representation means that a bio has no virtual address
437 mapping at all (unlike buffer heads). 437 mapping at all (unlike buffer heads).
438 438
439 struct bio_vec { 439 struct bio_vec {
440 struct page *bv_page; 440 struct page *bv_page;
441 unsigned short bv_len; 441 unsigned short bv_len;
442 unsigned short bv_offset; 442 unsigned short bv_offset;
443 }; 443 };
444 444
445 /* 445 /*
446 * main unit of I/O for the block layer and lower layers (ie drivers) 446 * main unit of I/O for the block layer and lower layers (ie drivers)
447 */ 447 */
448 struct bio { 448 struct bio {
449 sector_t bi_sector; 449 sector_t bi_sector;
450 struct bio *bi_next; /* request queue link */ 450 struct bio *bi_next; /* request queue link */
451 struct block_device *bi_bdev; /* target device */ 451 struct block_device *bi_bdev; /* target device */
452 unsigned long bi_flags; /* status, command, etc */ 452 unsigned long bi_flags; /* status, command, etc */
453 unsigned long bi_rw; /* low bits: r/w, high: priority */ 453 unsigned long bi_rw; /* low bits: r/w, high: priority */
454 454
455 unsigned int bi_vcnt; /* how may bio_vec's */ 455 unsigned int bi_vcnt; /* how may bio_vec's */
456 unsigned int bi_idx; /* current index into bio_vec array */ 456 unsigned int bi_idx; /* current index into bio_vec array */
457 457
458 unsigned int bi_size; /* total size in bytes */ 458 unsigned int bi_size; /* total size in bytes */
459 unsigned short bi_phys_segments; /* segments after physaddr coalesce*/ 459 unsigned short bi_phys_segments; /* segments after physaddr coalesce*/
460 unsigned short bi_hw_segments; /* segments after DMA remapping */ 460 unsigned short bi_hw_segments; /* segments after DMA remapping */
461 unsigned int bi_max; /* max bio_vecs we can hold 461 unsigned int bi_max; /* max bio_vecs we can hold
462 used as index into pool */ 462 used as index into pool */
463 struct bio_vec *bi_io_vec; /* the actual vec list */ 463 struct bio_vec *bi_io_vec; /* the actual vec list */
464 bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ 464 bio_end_io_t *bi_end_io; /* bi_end_io (bio) */
465 atomic_t bi_cnt; /* pin count: free when it hits zero */ 465 atomic_t bi_cnt; /* pin count: free when it hits zero */
466 void *bi_private; 466 void *bi_private;
467 bio_destructor_t *bi_destructor; /* bi_destructor (bio) */ 467 bio_destructor_t *bi_destructor; /* bi_destructor (bio) */
468 }; 468 };
469 469
470 With this multipage bio design: 470 With this multipage bio design:
471 471
472 - Large i/os can be sent down in one go using a bio_vec list consisting 472 - Large i/os can be sent down in one go using a bio_vec list consisting
473 of an array of <page, offset, len> fragments (similar to the way fragments 473 of an array of <page, offset, len> fragments (similar to the way fragments
474 are represented in the zero-copy network code) 474 are represented in the zero-copy network code)
475 - Splitting of an i/o request across multiple devices (as in the case of 475 - Splitting of an i/o request across multiple devices (as in the case of
476 lvm or raid) is achieved by cloning the bio (where the clone points to 476 lvm or raid) is achieved by cloning the bio (where the clone points to
477 the same bi_io_vec array, but with the index and size accordingly modified) 477 the same bi_io_vec array, but with the index and size accordingly modified)
478 - A linked list of bios is used as before for unrelated merges (*) - this 478 - A linked list of bios is used as before for unrelated merges (*) - this
479 avoids reallocs and makes independent completions easier to handle. 479 avoids reallocs and makes independent completions easier to handle.
480 - Code that traverses the req list needs to make a distinction between 480 - Code that traverses the req list needs to make a distinction between
481 segments of a request (bio_for_each_segment) and the distinct completion 481 segments of a request (bio_for_each_segment) and the distinct completion
482 units/bios (rq_for_each_bio). 482 units/bios (rq_for_each_bio).
483 - Drivers which can't process a large bio in one shot can use the bi_idx 483 - Drivers which can't process a large bio in one shot can use the bi_idx
484 field to keep track of the next bio_vec entry to process. 484 field to keep track of the next bio_vec entry to process.
485 (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) 485 (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE)
486 [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying 486 [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying
487 bi_offset an len fields] 487 bi_offset an len fields]
488 488
489 (*) unrelated merges -- a request ends up containing two or more bios that 489 (*) unrelated merges -- a request ends up containing two or more bios that
490 didn't originate from the same place. 490 didn't originate from the same place.
491 491
492 bi_end_io() i/o callback gets called on i/o completion of the entire bio. 492 bi_end_io() i/o callback gets called on i/o completion of the entire bio.
493 493
494 At a lower level, drivers build a scatter gather list from the merged bios. 494 At a lower level, drivers build a scatter gather list from the merged bios.
495 The scatter gather list is in the form of an array of <page, offset, len> 495 The scatter gather list is in the form of an array of <page, offset, len>
496 entries with their corresponding dma address mappings filled in at the 496 entries with their corresponding dma address mappings filled in at the
497 appropriate time. As an optimization, contiguous physical pages can be 497 appropriate time. As an optimization, contiguous physical pages can be
498 covered by a single entry where <page> refers to the first page and <len> 498 covered by a single entry where <page> refers to the first page and <len>
499 covers the range of pages (upto 16 contiguous pages could be covered this 499 covers the range of pages (upto 16 contiguous pages could be covered this
500 way). There is a helper routine (blk_rq_map_sg) which drivers can use to build 500 way). There is a helper routine (blk_rq_map_sg) which drivers can use to build
501 the sg list. 501 the sg list.
502 502
503 Note: Right now the only user of bios with more than one page is ll_rw_kio, 503 Note: Right now the only user of bios with more than one page is ll_rw_kio,
504 which in turn means that only raw I/O uses it (direct i/o may not work 504 which in turn means that only raw I/O uses it (direct i/o may not work
505 right now). The intent however is to enable clustering of pages etc to 505 right now). The intent however is to enable clustering of pages etc to
506 become possible. The pagebuf abstraction layer from SGI also uses multi-page 506 become possible. The pagebuf abstraction layer from SGI also uses multi-page
507 bios, but that is currently not included in the stock development kernels. 507 bios, but that is currently not included in the stock development kernels.
508 The same is true of Andrew Morton's work-in-progress multipage bio writeout 508 The same is true of Andrew Morton's work-in-progress multipage bio writeout
509 and readahead patches. 509 and readahead patches.
510 510
511 2.3 Changes in the Request Structure 511 2.3 Changes in the Request Structure
512 512
513 The request structure is the structure that gets passed down to low level 513 The request structure is the structure that gets passed down to low level
514 drivers. The block layer make_request function builds up a request structure, 514 drivers. The block layer make_request function builds up a request structure,
515 places it on the queue and invokes the drivers request_fn. The driver makes 515 places it on the queue and invokes the drivers request_fn. The driver makes
516 use of block layer helper routine elv_next_request to pull the next request 516 use of block layer helper routine elv_next_request to pull the next request
517 off the queue. Control or diagnostic functions might bypass block and directly 517 off the queue. Control or diagnostic functions might bypass block and directly
518 invoke underlying driver entry points passing in a specially constructed 518 invoke underlying driver entry points passing in a specially constructed
519 request structure. 519 request structure.
520 520
521 Only some relevant fields (mainly those which changed or may be referred 521 Only some relevant fields (mainly those which changed or may be referred
522 to in some of the discussion here) are listed below, not necessarily in 522 to in some of the discussion here) are listed below, not necessarily in
523 the order in which they occur in the structure (see include/linux/blkdev.h) 523 the order in which they occur in the structure (see include/linux/blkdev.h)
524 Refer to Documentation/block/request.txt for details about all the request 524 Refer to Documentation/block/request.txt for details about all the request
525 structure fields and a quick reference about the layers which are 525 structure fields and a quick reference about the layers which are
526 supposed to use or modify those fields. 526 supposed to use or modify those fields.
527 527
528 struct request { 528 struct request {
529 struct list_head queuelist; /* Not meant to be directly accessed by 529 struct list_head queuelist; /* Not meant to be directly accessed by
530 the driver. 530 the driver.
531 Used by q->elv_next_request_fn 531 Used by q->elv_next_request_fn
532 rq->queue is gone 532 rq->queue is gone
533 */ 533 */
534 . 534 .
535 . 535 .
536 unsigned char cmd[16]; /* prebuilt command data block */ 536 unsigned char cmd[16]; /* prebuilt command data block */
537 unsigned long flags; /* also includes earlier rq->cmd settings */ 537 unsigned long flags; /* also includes earlier rq->cmd settings */
538 . 538 .
539 . 539 .
540 sector_t sector; /* this field is now of type sector_t instead of int 540 sector_t sector; /* this field is now of type sector_t instead of int
541 preparation for 64 bit sectors */ 541 preparation for 64 bit sectors */
542 . 542 .
543 . 543 .
544 544
545 /* Number of scatter-gather DMA addr+len pairs after 545 /* Number of scatter-gather DMA addr+len pairs after
546 * physical address coalescing is performed. 546 * physical address coalescing is performed.
547 */ 547 */
548 unsigned short nr_phys_segments; 548 unsigned short nr_phys_segments;
549 549
550 /* Number of scatter-gather addr+len pairs after 550 /* Number of scatter-gather addr+len pairs after
551 * physical and DMA remapping hardware coalescing is performed. 551 * physical and DMA remapping hardware coalescing is performed.
552 * This is the number of scatter-gather entries the driver 552 * This is the number of scatter-gather entries the driver
553 * will actually have to deal with after DMA mapping is done. 553 * will actually have to deal with after DMA mapping is done.
554 */ 554 */
555 unsigned short nr_hw_segments; 555 unsigned short nr_hw_segments;
556 556
557 /* Various sector counts */ 557 /* Various sector counts */
558 unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ 558 unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
559 unsigned long hard_nr_sectors; /* block internal copy of above */ 559 unsigned long hard_nr_sectors; /* block internal copy of above */
560 unsigned int current_nr_sectors; /* no. of sectors left in the 560 unsigned int current_nr_sectors; /* no. of sectors left in the
561 current segment:driver modifiable */ 561 current segment:driver modifiable */
562 unsigned long hard_cur_sectors; /* block internal copy of the above */ 562 unsigned long hard_cur_sectors; /* block internal copy of the above */
563 . 563 .
564 . 564 .
565 int tag; /* command tag associated with request */ 565 int tag; /* command tag associated with request */
566 void *special; /* same as before */ 566 void *special; /* same as before */
567 char *buffer; /* valid only for low memory buffers upto 567 char *buffer; /* valid only for low memory buffers upto
568 current_nr_sectors */ 568 current_nr_sectors */
569 . 569 .
570 . 570 .
571 struct bio *bio, *biotail; /* bio list instead of bh */ 571 struct bio *bio, *biotail; /* bio list instead of bh */
572 struct request_list *rl; 572 struct request_list *rl;
573 } 573 }
574 574
575 See the rq_flag_bits definitions for an explanation of the various flags 575 See the rq_flag_bits definitions for an explanation of the various flags
576 available. Some bits are used by the block layer or i/o scheduler. 576 available. Some bits are used by the block layer or i/o scheduler.
577 577
578 The behaviour of the various sector counts are almost the same as before, 578 The behaviour of the various sector counts are almost the same as before,
579 except that since we have multi-segment bios, current_nr_sectors refers 579 except that since we have multi-segment bios, current_nr_sectors refers
580 to the numbers of sectors in the current segment being processed which could 580 to the numbers of sectors in the current segment being processed which could
581 be one of the many segments in the current bio (i.e i/o completion unit). 581 be one of the many segments in the current bio (i.e i/o completion unit).
582 The nr_sectors value refers to the total number of sectors in the whole 582 The nr_sectors value refers to the total number of sectors in the whole
583 request that remain to be transferred (no change). The purpose of the 583 request that remain to be transferred (no change). The purpose of the
584 hard_xxx values is for block to remember these counts every time it hands 584 hard_xxx values is for block to remember these counts every time it hands
585 over the request to the driver. These values are updated by block on 585 over the request to the driver. These values are updated by block on
586 end_that_request_first, i.e. every time the driver completes a part of the 586 end_that_request_first, i.e. every time the driver completes a part of the
587 transfer and invokes block end*request helpers to mark this. The 587 transfer and invokes block end*request helpers to mark this. The
588 driver should not modify these values. The block layer sets up the 588 driver should not modify these values. The block layer sets up the
589 nr_sectors and current_nr_sectors fields (based on the corresponding 589 nr_sectors and current_nr_sectors fields (based on the corresponding
590 hard_xxx values and the number of bytes transferred) and updates it on 590 hard_xxx values and the number of bytes transferred) and updates it on
591 every transfer that invokes end_that_request_first. It does the same for the 591 every transfer that invokes end_that_request_first. It does the same for the
592 buffer, bio, bio->bi_idx fields too. 592 buffer, bio, bio->bi_idx fields too.
593 593
594 The buffer field is just a virtual address mapping of the current segment 594 The buffer field is just a virtual address mapping of the current segment
595 of the i/o buffer in cases where the buffer resides in low-memory. For high 595 of the i/o buffer in cases where the buffer resides in low-memory. For high
596 memory i/o, this field is not valid and must not be used by drivers. 596 memory i/o, this field is not valid and must not be used by drivers.
597 597
598 Code that sets up its own request structures and passes them down to 598 Code that sets up its own request structures and passes them down to
599 a driver needs to be careful about interoperation with the block layer helper 599 a driver needs to be careful about interoperation with the block layer helper
600 functions which the driver uses. (Section 1.3) 600 functions which the driver uses. (Section 1.3)
601 601
602 3. Using bios 602 3. Using bios
603 603
604 3.1 Setup/Teardown 604 3.1 Setup/Teardown
605 605
606 There are routines for managing the allocation, and reference counting, and 606 There are routines for managing the allocation, and reference counting, and
607 freeing of bios (bio_alloc, bio_get, bio_put). 607 freeing of bios (bio_alloc, bio_get, bio_put).
608 608
609 This makes use of Ingo Molnar's mempool implementation, which enables 609 This makes use of Ingo Molnar's mempool implementation, which enables
610 subsystems like bio to maintain their own reserve memory pools for guaranteed 610 subsystems like bio to maintain their own reserve memory pools for guaranteed
611 deadlock-free allocations during extreme VM load. For example, the VM 611 deadlock-free allocations during extreme VM load. For example, the VM
612 subsystem makes use of the block layer to writeout dirty pages in order to be 612 subsystem makes use of the block layer to writeout dirty pages in order to be
613 able to free up memory space, a case which needs careful handling. The 613 able to free up memory space, a case which needs careful handling. The
614 allocation logic draws from the preallocated emergency reserve in situations 614 allocation logic draws from the preallocated emergency reserve in situations
615 where it cannot allocate through normal means. If the pool is empty and it 615 where it cannot allocate through normal means. If the pool is empty and it
616 can wait, then it would trigger action that would help free up memory or 616 can wait, then it would trigger action that would help free up memory or
617 replenish the pool (without deadlocking) and wait for availability in the pool. 617 replenish the pool (without deadlocking) and wait for availability in the pool.
618 If it is in IRQ context, and hence not in a position to do this, allocation 618 If it is in IRQ context, and hence not in a position to do this, allocation
619 could fail if the pool is empty. In general mempool always first tries to 619 could fail if the pool is empty. In general mempool always first tries to
620 perform allocation without having to wait, even if it means digging into the 620 perform allocation without having to wait, even if it means digging into the
621 pool as long it is not less that 50% full. 621 pool as long it is not less that 50% full.
622 622
623 On a free, memory is released to the pool or directly freed depending on 623 On a free, memory is released to the pool or directly freed depending on
624 the current availability in the pool. The mempool interface lets the 624 the current availability in the pool. The mempool interface lets the
625 subsystem specify the routines to be used for normal alloc and free. In the 625 subsystem specify the routines to be used for normal alloc and free. In the
626 case of bio, these routines make use of the standard slab allocator. 626 case of bio, these routines make use of the standard slab allocator.
627 627
628 The caller of bio_alloc is expected to taken certain steps to avoid 628 The caller of bio_alloc is expected to taken certain steps to avoid
629 deadlocks, e.g. avoid trying to allocate more memory from the pool while 629 deadlocks, e.g. avoid trying to allocate more memory from the pool while
630 already holding memory obtained from the pool. 630 already holding memory obtained from the pool.
631 [TBD: This is a potential issue, though a rare possibility 631 [TBD: This is a potential issue, though a rare possibility
632 in the bounce bio allocation that happens in the current code, since 632 in the bounce bio allocation that happens in the current code, since
633 it ends up allocating a second bio from the same pool while 633 it ends up allocating a second bio from the same pool while
634 holding the original bio ] 634 holding the original bio ]
635 635
636 Memory allocated from the pool should be released back within a limited 636 Memory allocated from the pool should be released back within a limited
637 amount of time (in the case of bio, that would be after the i/o is completed). 637 amount of time (in the case of bio, that would be after the i/o is completed).
638 This ensures that if part of the pool has been used up, some work (in this 638 This ensures that if part of the pool has been used up, some work (in this
639 case i/o) must already be in progress and memory would be available when it 639 case i/o) must already be in progress and memory would be available when it
640 is over. If allocating from multiple pools in the same code path, the order 640 is over. If allocating from multiple pools in the same code path, the order
641 or hierarchy of allocation needs to be consistent, just the way one deals 641 or hierarchy of allocation needs to be consistent, just the way one deals
642 with multiple locks. 642 with multiple locks.
643 643
644 The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) 644 The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())
645 for a non-clone bio. There are the 6 pools setup for different size biovecs, 645 for a non-clone bio. There are the 6 pools setup for different size biovecs,
646 so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the 646 so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
647 given size from these slabs. 647 given size from these slabs.
648 648
649 The bi_destructor() routine takes into account the possibility of the bio 649 The bi_destructor() routine takes into account the possibility of the bio
650 having originated from a different source (see later discussions on 650 having originated from a different source (see later discussions on
651 n/w to block transfers and kvec_cb) 651 n/w to block transfers and kvec_cb)
652 652
653 The bio_get() routine may be used to hold an extra reference on a bio prior 653 The bio_get() routine may be used to hold an extra reference on a bio prior
654 to i/o submission, if the bio fields are likely to be accessed after the 654 to i/o submission, if the bio fields are likely to be accessed after the
655 i/o is issued (since the bio may otherwise get freed in case i/o completion 655 i/o is issued (since the bio may otherwise get freed in case i/o completion
656 happens in the meantime). 656 happens in the meantime).
657 657
658 The bio_clone() routine may be used to duplicate a bio, where the clone 658 The bio_clone() routine may be used to duplicate a bio, where the clone
659 shares the bio_vec_list with the original bio (i.e. both point to the 659 shares the bio_vec_list with the original bio (i.e. both point to the
660 same bio_vec_list). This would typically be used for splitting i/o requests 660 same bio_vec_list). This would typically be used for splitting i/o requests
661 in lvm or md. 661 in lvm or md.
662 662
663 3.2 Generic bio helper Routines 663 3.2 Generic bio helper Routines
664 664
665 3.2.1 Traversing segments and completion units in a request 665 3.2.1 Traversing segments and completion units in a request
666 666
667 The macros bio_for_each_segment() and rq_for_each_bio() should be used for 667 The macros bio_for_each_segment() and rq_for_each_bio() should be used for
668 traversing the bios in the request list (drivers should avoid directly 668 traversing the bios in the request list (drivers should avoid directly
669 trying to do it themselves). Using these helpers should also make it easier 669 trying to do it themselves). Using these helpers should also make it easier
670 to cope with block changes in the future. 670 to cope with block changes in the future.
671 671
672 rq_for_each_bio(bio, rq) 672 rq_for_each_bio(bio, rq)
673 bio_for_each_segment(bio_vec, bio, i) 673 bio_for_each_segment(bio_vec, bio, i)
674 /* bio_vec is now current segment */ 674 /* bio_vec is now current segment */
675 675
676 I/O completion callbacks are per-bio rather than per-segment, so drivers 676 I/O completion callbacks are per-bio rather than per-segment, so drivers
677 that traverse bio chains on completion need to keep that in mind. Drivers 677 that traverse bio chains on completion need to keep that in mind. Drivers
678 which don't make a distinction between segments and completion units would 678 which don't make a distinction between segments and completion units would
679 need to be reorganized to support multi-segment bios. 679 need to be reorganized to support multi-segment bios.
680 680
681 3.2.2 Setting up DMA scatterlists 681 3.2.2 Setting up DMA scatterlists
682 682
683 The blk_rq_map_sg() helper routine would be used for setting up scatter 683 The blk_rq_map_sg() helper routine would be used for setting up scatter
684 gather lists from a request, so a driver need not do it on its own. 684 gather lists from a request, so a driver need not do it on its own.
685 685
686 nr_segments = blk_rq_map_sg(q, rq, scatterlist); 686 nr_segments = blk_rq_map_sg(q, rq, scatterlist);
687 687
688 The helper routine provides a level of abstraction which makes it easier 688 The helper routine provides a level of abstraction which makes it easier
689 to modify the internals of request to scatterlist conversion down the line 689 to modify the internals of request to scatterlist conversion down the line
690 without breaking drivers. The blk_rq_map_sg routine takes care of several 690 without breaking drivers. The blk_rq_map_sg routine takes care of several
691 things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER 691 things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER
692 is set) and correct segment accounting to avoid exceeding the limits which 692 is set) and correct segment accounting to avoid exceeding the limits which
693 the i/o hardware can handle, based on various queue properties. 693 the i/o hardware can handle, based on various queue properties.
694 694
695 - Prevents a clustered segment from crossing a 4GB mem boundary 695 - Prevents a clustered segment from crossing a 4GB mem boundary
696 - Avoids building segments that would exceed the number of physical 696 - Avoids building segments that would exceed the number of physical
697 memory segments that the driver can handle (phys_segments) and the 697 memory segments that the driver can handle (phys_segments) and the
698 number that the underlying hardware can handle at once, accounting for 698 number that the underlying hardware can handle at once, accounting for
699 DMA remapping (hw_segments) (i.e. IOMMU aware limits). 699 DMA remapping (hw_segments) (i.e. IOMMU aware limits).
700 700
701 Routines which the low level driver can use to set up the segment limits: 701 Routines which the low level driver can use to set up the segment limits:
702 702
703 blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of 703 blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
704 hw data segments in a request (i.e. the maximum number of address/length 704 hw data segments in a request (i.e. the maximum number of address/length
705 pairs the host adapter can actually hand to the device at once) 705 pairs the host adapter can actually hand to the device at once)
706 706
707 blk_queue_max_phys_segments() : Sets an upper limit on the maximum number 707 blk_queue_max_phys_segments() : Sets an upper limit on the maximum number
708 of physical data segments in a request (i.e. the largest sized scatter list 708 of physical data segments in a request (i.e. the largest sized scatter list
709 a driver could handle) 709 a driver could handle)
710 710
711 3.2.3 I/O completion 711 3.2.3 I/O completion
712 712
713 The existing generic block layer helper routines end_request, 713 The existing generic block layer helper routines end_request,
714 end_that_request_first and end_that_request_last can be used for i/o 714 end_that_request_first and end_that_request_last can be used for i/o
715 completion (and setting things up so the rest of the i/o or the next 715 completion (and setting things up so the rest of the i/o or the next
716 request can be kicked of) as before. With the introduction of multi-page 716 request can be kicked of) as before. With the introduction of multi-page
717 bio support, end_that_request_first requires an additional argument indicating 717 bio support, end_that_request_first requires an additional argument indicating
718 the number of sectors completed. 718 the number of sectors completed.
719 719
720 3.2.4 Implications for drivers that do not interpret bios (don't handle 720 3.2.4 Implications for drivers that do not interpret bios (don't handle
721 multiple segments) 721 multiple segments)
722 722
723 Drivers that do not interpret bios e.g those which do not handle multiple 723 Drivers that do not interpret bios e.g those which do not handle multiple
724 segments and do not support i/o into high memory addresses (require bounce 724 segments and do not support i/o into high memory addresses (require bounce
725 buffers) and expect only virtually mapped buffers, can access the rq->buffer 725 buffers) and expect only virtually mapped buffers, can access the rq->buffer
726 field. As before the driver should use current_nr_sectors to determine the 726 field. As before the driver should use current_nr_sectors to determine the
727 size of remaining data in the current segment (that is the maximum it can 727 size of remaining data in the current segment (that is the maximum it can
728 transfer in one go unless it interprets segments), and rely on the block layer 728 transfer in one go unless it interprets segments), and rely on the block layer
729 end_request, or end_that_request_first/last to take care of all accounting 729 end_request, or end_that_request_first/last to take care of all accounting
730 and transparent mapping of the next bio segment when a segment boundary 730 and transparent mapping of the next bio segment when a segment boundary
731 is crossed on completion of a transfer. (The end*request* functions should 731 is crossed on completion of a transfer. (The end*request* functions should
732 be used if only if the request has come down from block/bio path, not for 732 be used if only if the request has come down from block/bio path, not for
733 direct access requests which only specify rq->buffer without a valid rq->bio) 733 direct access requests which only specify rq->buffer without a valid rq->bio)
734 734
735 3.2.5 Generic request command tagging 735 3.2.5 Generic request command tagging
736 736
737 3.2.5.1 Tag helpers 737 3.2.5.1 Tag helpers
738 738
739 Block now offers some simple generic functionality to help support command 739 Block now offers some simple generic functionality to help support command
740 queueing (typically known as tagged command queueing), ie manage more than 740 queueing (typically known as tagged command queueing), ie manage more than
741 one outstanding command on a queue at any given time. 741 one outstanding command on a queue at any given time.
742 742
743 blk_queue_init_tags(request_queue_t *q, int depth) 743 blk_queue_init_tags(request_queue_t *q, int depth)
744 744
745 Initialize internal command tagging structures for a maximum 745 Initialize internal command tagging structures for a maximum
746 depth of 'depth'. 746 depth of 'depth'.
747 747
748 blk_queue_free_tags((request_queue_t *q) 748 blk_queue_free_tags((request_queue_t *q)
749 749
750 Teardown tag info associated with the queue. This will be done 750 Teardown tag info associated with the queue. This will be done
751 automatically by block if blk_queue_cleanup() is called on a queue 751 automatically by block if blk_queue_cleanup() is called on a queue
752 that is using tagging. 752 that is using tagging.
753 753
754 The above are initialization and exit management, the main helpers during 754 The above are initialization and exit management, the main helpers during
755 normal operations are: 755 normal operations are:
756 756
757 blk_queue_start_tag(request_queue_t *q, struct request *rq) 757 blk_queue_start_tag(request_queue_t *q, struct request *rq)
758 758
759 Start tagged operation for this request. A free tag number between 759 Start tagged operation for this request. A free tag number between
760 0 and 'depth' is assigned to the request (rq->tag holds this number), 760 0 and 'depth' is assigned to the request (rq->tag holds this number),
761 and 'rq' is added to the internal tag management. If the maximum depth 761 and 'rq' is added to the internal tag management. If the maximum depth
762 for this queue is already achieved (or if the tag wasn't started for 762 for this queue is already achieved (or if the tag wasn't started for
763 some other reason), 1 is returned. Otherwise 0 is returned. 763 some other reason), 1 is returned. Otherwise 0 is returned.
764 764
765 blk_queue_end_tag(request_queue_t *q, struct request *rq) 765 blk_queue_end_tag(request_queue_t *q, struct request *rq)
766 766
767 End tagged operation on this request. 'rq' is removed from the internal 767 End tagged operation on this request. 'rq' is removed from the internal
768 book keeping structures. 768 book keeping structures.
769 769
770 To minimize struct request and queue overhead, the tag helpers utilize some 770 To minimize struct request and queue overhead, the tag helpers utilize some
771 of the same request members that are used for normal request queue management. 771 of the same request members that are used for normal request queue management.
772 This means that a request cannot both be an active tag and be on the queue 772 This means that a request cannot both be an active tag and be on the queue
773 list at the same time. blk_queue_start_tag() will remove the request, but 773 list at the same time. blk_queue_start_tag() will remove the request, but
774 the driver must remember to call blk_queue_end_tag() before signalling 774 the driver must remember to call blk_queue_end_tag() before signalling
775 completion of the request to the block layer. This means ending tag 775 completion of the request to the block layer. This means ending tag
776 operations before calling end_that_request_last()! For an example of a user 776 operations before calling end_that_request_last()! For an example of a user
777 of these helpers, see the IDE tagged command queueing support. 777 of these helpers, see the IDE tagged command queueing support.
778 778
779 Certain hardware conditions may dictate a need to invalidate the block tag 779 Certain hardware conditions may dictate a need to invalidate the block tag
780 queue. For instance, on IDE any tagged request error needs to clear both 780 queue. For instance, on IDE any tagged request error needs to clear both
781 the hardware and software block queue and enable the driver to sanely restart 781 the hardware and software block queue and enable the driver to sanely restart
782 all the outstanding requests. There's a third helper to do that: 782 all the outstanding requests. There's a third helper to do that:
783 783
784 blk_queue_invalidate_tags(request_queue_t *q) 784 blk_queue_invalidate_tags(request_queue_t *q)
785 785
786 Clear the internal block tag queue and re-add all the pending requests 786 Clear the internal block tag queue and re-add all the pending requests
787 to the request queue. The driver will receive them again on the 787 to the request queue. The driver will receive them again on the
788 next request_fn run, just like it did the first time it encountered 788 next request_fn run, just like it did the first time it encountered
789 them. 789 them.
790 790
791 3.2.5.2 Tag info 791 3.2.5.2 Tag info
792 792
793 Some block functions exist to query current tag status or to go from a 793 Some block functions exist to query current tag status or to go from a
794 tag number to the associated request. These are, in no particular order: 794 tag number to the associated request. These are, in no particular order:
795 795
796 blk_queue_tagged(q) 796 blk_queue_tagged(q)
797 797
798 Returns 1 if the queue 'q' is using tagging, 0 if not. 798 Returns 1 if the queue 'q' is using tagging, 0 if not.
799 799
800 blk_queue_tag_request(q, tag) 800 blk_queue_tag_request(q, tag)
801 801
802 Returns a pointer to the request associated with tag 'tag'. 802 Returns a pointer to the request associated with tag 'tag'.
803 803
804 blk_queue_tag_depth(q) 804 blk_queue_tag_depth(q)
805 805
806 Return current queue depth. 806 Return current queue depth.
807 807
808 blk_queue_tag_queue(q) 808 blk_queue_tag_queue(q)
809 809
810 Returns 1 if the queue can accept a new queued command, 0 if we are 810 Returns 1 if the queue can accept a new queued command, 0 if we are
811 at the maximum depth already. 811 at the maximum depth already.
812 812
813 blk_queue_rq_tagged(rq) 813 blk_queue_rq_tagged(rq)
814 814
815 Returns 1 if the request 'rq' is tagged. 815 Returns 1 if the request 'rq' is tagged.
816 816
817 3.2.5.2 Internal structure 817 3.2.5.2 Internal structure
818 818
819 Internally, block manages tags in the blk_queue_tag structure: 819 Internally, block manages tags in the blk_queue_tag structure:
820 820
821 struct blk_queue_tag { 821 struct blk_queue_tag {
822 struct request **tag_index; /* array or pointers to rq */ 822 struct request **tag_index; /* array or pointers to rq */
823 unsigned long *tag_map; /* bitmap of free tags */ 823 unsigned long *tag_map; /* bitmap of free tags */
824 struct list_head busy_list; /* fifo list of busy tags */ 824 struct list_head busy_list; /* fifo list of busy tags */
825 int busy; /* queue depth */ 825 int busy; /* queue depth */
826 int max_depth; /* max queue depth */ 826 int max_depth; /* max queue depth */
827 }; 827 };
828 828
829 Most of the above is simple and straight forward, however busy_list may need 829 Most of the above is simple and straight forward, however busy_list may need
830 a bit of explaining. Normally we don't care too much about request ordering, 830 a bit of explaining. Normally we don't care too much about request ordering,
831 but in the event of any barrier requests in the tag queue we need to ensure 831 but in the event of any barrier requests in the tag queue we need to ensure
832 that requests are restarted in the order they were queue. This may happen 832 that requests are restarted in the order they were queue. This may happen
833 if the driver needs to use blk_queue_invalidate_tags(). 833 if the driver needs to use blk_queue_invalidate_tags().
834 834
835 Tagging also defines a new request flag, REQ_QUEUED. This is set whenever 835 Tagging also defines a new request flag, REQ_QUEUED. This is set whenever
836 a request is currently tagged. You should not use this flag directly, 836 a request is currently tagged. You should not use this flag directly,
837 blk_rq_tagged(rq) is the portable way to do so. 837 blk_rq_tagged(rq) is the portable way to do so.
838 838
839 3.3 I/O Submission 839 3.3 I/O Submission
840 840
841 The routine submit_bio() is used to submit a single io. Higher level i/o 841 The routine submit_bio() is used to submit a single io. Higher level i/o
842 routines make use of this: 842 routines make use of this:
843 843
844 (a) Buffered i/o: 844 (a) Buffered i/o:
845 The routine submit_bh() invokes submit_bio() on a bio corresponding to the 845 The routine submit_bh() invokes submit_bio() on a bio corresponding to the
846 bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. 846 bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.
847 847
848 (b) Kiobuf i/o (for raw/direct i/o): 848 (b) Kiobuf i/o (for raw/direct i/o):
849 The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and 849 The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and
850 maps the array to one or more multi-page bios, issuing submit_bio() to 850 maps the array to one or more multi-page bios, issuing submit_bio() to
851 perform the i/o on each of these. 851 perform the i/o on each of these.
852 852
853 The embedded bh array in the kiobuf structure has been removed and no 853 The embedded bh array in the kiobuf structure has been removed and no
854 preallocation of bios is done for kiobufs. [The intent is to remove the 854 preallocation of bios is done for kiobufs. [The intent is to remove the
855 blocks array as well, but it's currently in there to kludge around direct i/o.] 855 blocks array as well, but it's currently in there to kludge around direct i/o.]
856 Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. 856 Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.
857 857
858 Todo/Observation: 858 Todo/Observation:
859 859
860 A single kiobuf structure is assumed to correspond to a contiguous range 860 A single kiobuf structure is assumed to correspond to a contiguous range
861 of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. 861 of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
862 So right now it wouldn't work for direct i/o on non-contiguous blocks. 862 So right now it wouldn't work for direct i/o on non-contiguous blocks.
863 This is to be resolved. The eventual direction is to replace kiobuf 863 This is to be resolved. The eventual direction is to replace kiobuf
864 by kvec's. 864 by kvec's.
865 865
866 Badari Pulavarty has a patch to implement direct i/o correctly using 866 Badari Pulavarty has a patch to implement direct i/o correctly using
867 bio and kvec. 867 bio and kvec.
868 868
869 869
870 (c) Page i/o: 870 (c) Page i/o:
871 Todo/Under discussion: 871 Todo/Under discussion:
872 872
873 Andrew Morton's multi-page bio patches attempt to issue multi-page 873 Andrew Morton's multi-page bio patches attempt to issue multi-page
874 writeouts (and reads) from the page cache, by directly building up 874 writeouts (and reads) from the page cache, by directly building up
875 large bios for submission completely bypassing the usage of buffer 875 large bios for submission completely bypassing the usage of buffer
876 heads. This work is still in progress. 876 heads. This work is still in progress.
877 877
878 Christoph Hellwig had some code that uses bios for page-io (rather than 878 Christoph Hellwig had some code that uses bios for page-io (rather than
879 bh). This isn't included in bio as yet. Christoph was also working on a 879 bh). This isn't included in bio as yet. Christoph was also working on a
880 design for representing virtual/real extents as an entity and modifying 880 design for representing virtual/real extents as an entity and modifying
881 some of the address space ops interfaces to utilize this abstraction rather 881 some of the address space ops interfaces to utilize this abstraction rather
882 than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf 882 than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
883 abstraction, but intended to be as lightweight as possible). 883 abstraction, but intended to be as lightweight as possible).
884 884
885 (d) Direct access i/o: 885 (d) Direct access i/o:
886 Direct access requests that do not contain bios would be submitted differently 886 Direct access requests that do not contain bios would be submitted differently
887 as discussed earlier in section 1.3. 887 as discussed earlier in section 1.3.
888 888
889 Aside: 889 Aside:
890 890
891 Kvec i/o: 891 Kvec i/o:
892 892
893 Ben LaHaise's aio code uses a slightly different structure instead 893 Ben LaHaise's aio code uses a slightly different structure instead
894 of kiobufs, called a kvec_cb. This contains an array of <page, offset, len> 894 of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
895 tuples (very much like the networking code), together with a callback function 895 tuples (very much like the networking code), together with a callback function
896 and data pointer. This is embedded into a brw_cb structure when passed 896 and data pointer. This is embedded into a brw_cb structure when passed
897 to brw_kvec_async(). 897 to brw_kvec_async().
898 898
899 Now it should be possible to directly map these kvecs to a bio. Just as while 899 Now it should be possible to directly map these kvecs to a bio. Just as while
900 cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec 900 cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec
901 array pointer to point to the veclet array in kvecs. 901 array pointer to point to the veclet array in kvecs.
902 902
903 TBD: In order for this to work, some changes are needed in the way multi-page 903 TBD: In order for this to work, some changes are needed in the way multi-page
904 bios are handled today. The values of the tuples in such a vector passed in 904 bios are handled today. The values of the tuples in such a vector passed in
905 from higher level code should not be modified by the block layer in the course 905 from higher level code should not be modified by the block layer in the course
906 of its request processing, since that would make it hard for the higher layer 906 of its request processing, since that would make it hard for the higher layer
907 to continue to use the vector descriptor (kvec) after i/o completes. Instead, 907 to continue to use the vector descriptor (kvec) after i/o completes. Instead,
908 all such transient state should either be maintained in the request structure, 908 all such transient state should either be maintained in the request structure,
909 and passed on in some way to the endio completion routine. 909 and passed on in some way to the endio completion routine.
910 910
911 911
912 4. The I/O scheduler 912 4. The I/O scheduler
913 I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch 913 I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch
914 queue and specific I/O schedulers. Unless stated otherwise, elevator is used 914 queue and specific I/O schedulers. Unless stated otherwise, elevator is used
915 to refer to both parts and I/O scheduler to specific I/O schedulers. 915 to refer to both parts and I/O scheduler to specific I/O schedulers.
916 916
917 Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c. 917 Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c.
918 The generic dispatch queue is responsible for properly ordering barrier 918 The generic dispatch queue is responsible for properly ordering barrier
919 requests, requeueing, handling non-fs requests and all other subtleties. 919 requests, requeueing, handling non-fs requests and all other subtleties.
920 920
921 Specific I/O schedulers are responsible for ordering normal filesystem 921 Specific I/O schedulers are responsible for ordering normal filesystem
922 requests. They can also choose to delay certain requests to improve 922 requests. They can also choose to delay certain requests to improve
923 throughput or whatever purpose. As the plural form indicates, there are 923 throughput or whatever purpose. As the plural form indicates, there are
924 multiple I/O schedulers. They can be built as modules but at least one should 924 multiple I/O schedulers. They can be built as modules but at least one should
925 be built inside the kernel. Each queue can choose different one and can also 925 be built inside the kernel. Each queue can choose different one and can also
926 change to another one dynamically. 926 change to another one dynamically.
927 927
928 A block layer call to the i/o scheduler follows the convention elv_xxx(). This 928 A block layer call to the i/o scheduler follows the convention elv_xxx(). This
929 calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh, 929 calls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,
930 xxx and xxx might not match exactly, but use your imagination. If an elevator 930 xxx and xxx might not match exactly, but use your imagination. If an elevator
931 doesn't implement a function, the switch does nothing or some minimal house 931 doesn't implement a function, the switch does nothing or some minimal house
932 keeping work. 932 keeping work.
933 933
934 4.1. I/O scheduler API 934 4.1. I/O scheduler API
935 935
936 The functions an elevator may implement are: (* are mandatory) 936 The functions an elevator may implement are: (* are mandatory)
937 elevator_merge_fn called to query requests for merge with a bio 937 elevator_merge_fn called to query requests for merge with a bio
938 938
939 elevator_merge_req_fn called when two requests get merged. the one 939 elevator_merge_req_fn called when two requests get merged. the one
940 which gets merged into the other one will be 940 which gets merged into the other one will be
941 never seen by I/O scheduler again. IOW, after 941 never seen by I/O scheduler again. IOW, after
942 being merged, the request is gone. 942 being merged, the request is gone.
943 943
944 elevator_merged_fn called when a request in the scheduler has been 944 elevator_merged_fn called when a request in the scheduler has been
945 involved in a merge. It is used in the deadline 945 involved in a merge. It is used in the deadline
946 scheduler for example, to reposition the request 946 scheduler for example, to reposition the request
947 if its sorting order has changed. 947 if its sorting order has changed.
948 948
949 elevator_dispatch_fn fills the dispatch queue with ready requests. 949 elevator_dispatch_fn fills the dispatch queue with ready requests.
950 I/O schedulers are free to postpone requests by 950 I/O schedulers are free to postpone requests by
951 not filling the dispatch queue unless @force 951 not filling the dispatch queue unless @force
952 is non-zero. Once dispatched, I/O schedulers 952 is non-zero. Once dispatched, I/O schedulers
953 are not allowed to manipulate the requests - 953 are not allowed to manipulate the requests -
954 they belong to generic dispatch queue. 954 they belong to generic dispatch queue.
955 955
956 elevator_add_req_fn called to add a new request into the scheduler 956 elevator_add_req_fn called to add a new request into the scheduler
957 957
958 elevator_queue_empty_fn returns true if the merge queue is empty. 958 elevator_queue_empty_fn returns true if the merge queue is empty.
959 Drivers shouldn't use this, but rather check 959 Drivers shouldn't use this, but rather check
960 if elv_next_request is NULL (without losing the 960 if elv_next_request is NULL (without losing the
961 request if one exists!) 961 request if one exists!)
962 962
963 elevator_former_req_fn 963 elevator_former_req_fn
964 elevator_latter_req_fn These return the request before or after the 964 elevator_latter_req_fn These return the request before or after the
965 one specified in disk sort order. Used by the 965 one specified in disk sort order. Used by the
966 block layer to find merge possibilities. 966 block layer to find merge possibilities.
967 967
968 elevator_completed_req_fn called when a request is completed. 968 elevator_completed_req_fn called when a request is completed.
969 969
970 elevator_may_queue_fn returns true if the scheduler wants to allow the 970 elevator_may_queue_fn returns true if the scheduler wants to allow the
971 current context to queue a new request even if 971 current context to queue a new request even if
972 it is over the queue limit. This must be used 972 it is over the queue limit. This must be used
973 very carefully!! 973 very carefully!!
974 974
975 elevator_set_req_fn 975 elevator_set_req_fn
976 elevator_put_req_fn Must be used to allocate and free any elevator 976 elevator_put_req_fn Must be used to allocate and free any elevator
977 specific storage for a request. 977 specific storage for a request.
978 978
979 elevator_activate_req_fn Called when device driver first sees a request. 979 elevator_activate_req_fn Called when device driver first sees a request.
980 I/O schedulers can use this callback to 980 I/O schedulers can use this callback to
981 determine when actual execution of a request 981 determine when actual execution of a request
982 starts. 982 starts.
983 elevator_deactivate_req_fn Called when device driver decides to delay 983 elevator_deactivate_req_fn Called when device driver decides to delay
984 a request by requeueing it. 984 a request by requeueing it.
985 985
986 elevator_init_fn 986 elevator_init_fn
987 elevator_exit_fn Allocate and free any elevator specific storage 987 elevator_exit_fn Allocate and free any elevator specific storage
988 for a queue. 988 for a queue.
989 989
990 4.2 Request flows seen by I/O schedulers 990 4.2 Request flows seen by I/O schedulers
991 All requests seen by I/O schedulers strictly follow one of the following three 991 All requests seen by I/O schedulers strictly follow one of the following three
992 flows. 992 flows.
993 993
994 set_req_fn -> 994 set_req_fn ->
995 995
996 i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> 996 i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn ->
997 (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn 997 (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn
998 ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn 998 ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn
999 iii. [none] 999 iii. [none]
1000 1000
1001 -> put_req_fn 1001 -> put_req_fn
1002 1002
1003 4.3 I/O scheduler implementation 1003 4.3 I/O scheduler implementation
1004 The generic i/o scheduler algorithm attempts to sort/merge/batch requests for 1004 The generic i/o scheduler algorithm attempts to sort/merge/batch requests for
1005 optimal disk scan and request servicing performance (based on generic 1005 optimal disk scan and request servicing performance (based on generic
1006 principles and device capabilities), optimized for: 1006 principles and device capabilities), optimized for:
1007 i. improved throughput 1007 i. improved throughput
1008 ii. improved latency 1008 ii. improved latency
1009 iii. better utilization of h/w & CPU time 1009 iii. better utilization of h/w & CPU time
1010 1010
1011 Characteristics: 1011 Characteristics:
1012 1012
1013 i. Binary tree 1013 i. Binary tree
1014 AS and deadline i/o schedulers use red black binary trees for disk position 1014 AS and deadline i/o schedulers use red black binary trees for disk position
1015 sorting and searching, and a fifo linked list for time-based searching. This 1015 sorting and searching, and a fifo linked list for time-based searching. This
1016 gives good scalability and good availablility of information. Requests are 1016 gives good scalability and good availablility of information. Requests are
1017 almost always dispatched in disk sort order, so a cache is kept of the next 1017 almost always dispatched in disk sort order, so a cache is kept of the next
1018 request in sort order to prevent binary tree lookups. 1018 request in sort order to prevent binary tree lookups.
1019 1019
1020 This arrangement is not a generic block layer characteristic however, so 1020 This arrangement is not a generic block layer characteristic however, so
1021 elevators may implement queues as they please. 1021 elevators may implement queues as they please.
1022 1022
1023 ii. Merge hash 1023 ii. Merge hash
1024 AS and deadline use a hash table indexed by the last sector of a request. This 1024 AS and deadline use a hash table indexed by the last sector of a request. This
1025 enables merging code to quickly look up "back merge" candidates, even when 1025 enables merging code to quickly look up "back merge" candidates, even when
1026 multiple I/O streams are being performed at once on one disk. 1026 multiple I/O streams are being performed at once on one disk.
1027 1027
1028 "Front merges", a new request being merged at the front of an existing request, 1028 "Front merges", a new request being merged at the front of an existing request,
1029 are far less common than "back merges" due to the nature of most I/O patterns. 1029 are far less common than "back merges" due to the nature of most I/O patterns.
1030 Front merges are handled by the binary trees in AS and deadline schedulers. 1030 Front merges are handled by the binary trees in AS and deadline schedulers.
1031 1031
1032 iii. Plugging the queue to batch requests in anticipation of opportunities for 1032 iii. Plugging the queue to batch requests in anticipation of opportunities for
1033 merge/sort optimizations 1033 merge/sort optimizations
1034 1034
1035 This is just the same as in 2.4 so far, though per-device unplugging 1035 This is just the same as in 2.4 so far, though per-device unplugging
1036 support is anticipated for 2.5. Also with a priority-based i/o scheduler, 1036 support is anticipated for 2.5. Also with a priority-based i/o scheduler,
1037 such decisions could be based on request priorities. 1037 such decisions could be based on request priorities.
1038 1038
1039 Plugging is an approach that the current i/o scheduling algorithm resorts to so 1039 Plugging is an approach that the current i/o scheduling algorithm resorts to so
1040 that it collects up enough requests in the queue to be able to take 1040 that it collects up enough requests in the queue to be able to take
1041 advantage of the sorting/merging logic in the elevator. If the 1041 advantage of the sorting/merging logic in the elevator. If the
1042 queue is empty when a request comes in, then it plugs the request queue 1042 queue is empty when a request comes in, then it plugs the request queue
1043 (sort of like plugging the bottom of a vessel to get fluid to build up) 1043 (sort of like plugging the bottom of a vessel to get fluid to build up)
1044 till it fills up with a few more requests, before starting to service 1044 till it fills up with a few more requests, before starting to service
1045 the requests. This provides an opportunity to merge/sort the requests before 1045 the requests. This provides an opportunity to merge/sort the requests before
1046 passing them down to the device. There are various conditions when the queue is 1046 passing them down to the device. There are various conditions when the queue is
1047 unplugged (to open up the flow again), either through a scheduled task or 1047 unplugged (to open up the flow again), either through a scheduled task or
1048 could be on demand. For example wait_on_buffer sets the unplugging going 1048 could be on demand. For example wait_on_buffer sets the unplugging going
1049 (by running tq_disk) so the read gets satisfied soon. So in the read case, 1049 (by running tq_disk) so the read gets satisfied soon. So in the read case,
1050 the queue gets explicitly unplugged as part of waiting for completion, 1050 the queue gets explicitly unplugged as part of waiting for completion,
1051 in fact all queues get unplugged as a side-effect. 1051 in fact all queues get unplugged as a side-effect.
1052 1052
1053 Aside: 1053 Aside:
1054 This is kind of controversial territory, as it's not clear if plugging is 1054 This is kind of controversial territory, as it's not clear if plugging is
1055 always the right thing to do. Devices typically have their own queues, 1055 always the right thing to do. Devices typically have their own queues,
1056 and allowing a big queue to build up in software, while letting the device be 1056 and allowing a big queue to build up in software, while letting the device be
1057 idle for a while may not always make sense. The trick is to handle the fine 1057 idle for a while may not always make sense. The trick is to handle the fine
1058 balance between when to plug and when to open up. Also now that we have 1058 balance between when to plug and when to open up. Also now that we have
1059 multi-page bios being queued in one shot, we may not need to wait to merge 1059 multi-page bios being queued in one shot, we may not need to wait to merge
1060 a big request from the broken up pieces coming by. 1060 a big request from the broken up pieces coming by.
1061 1061
1062 Per-queue granularity unplugging (still a Todo) may help reduce some of the 1062 Per-queue granularity unplugging (still a Todo) may help reduce some of the
1063 concerns with just a single tq_disk flush approach. Something like 1063 concerns with just a single tq_disk flush approach. Something like
1064 blk_kick_queue() to unplug a specific queue (right away ?) 1064 blk_kick_queue() to unplug a specific queue (right away ?)
1065 or optionally, all queues, is in the plan. 1065 or optionally, all queues, is in the plan.
1066 1066
1067 4.4 I/O contexts 1067 4.4 I/O contexts
1068 I/O contexts provide a dynamically allocated per process data area. They may 1068 I/O contexts provide a dynamically allocated per process data area. They may
1069 be used in I/O schedulers, and in the block layer (could be used for IO statis, 1069 be used in I/O schedulers, and in the block layer (could be used for IO statis,
1070 priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.c 1070 priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.c
1071 for an example of usage in an i/o scheduler. 1071 for an example of usage in an i/o scheduler.
1072 1072
1073 1073
1074 5. Scalability related changes 1074 5. Scalability related changes
1075 1075
1076 5.1 Granular Locking: io_request_lock replaced by a per-queue lock 1076 5.1 Granular Locking: io_request_lock replaced by a per-queue lock
1077 1077
1078 The global io_request_lock has been removed as of 2.5, to avoid 1078 The global io_request_lock has been removed as of 2.5, to avoid
1079 the scalability bottleneck it was causing, and has been replaced by more 1079 the scalability bottleneck it was causing, and has been replaced by more
1080 granular locking. The request queue structure has a pointer to the 1080 granular locking. The request queue structure has a pointer to the
1081 lock to be used for that queue. As a result, locking can now be 1081 lock to be used for that queue. As a result, locking can now be
1082 per-queue, with a provision for sharing a lock across queues if 1082 per-queue, with a provision for sharing a lock across queues if
1083 necessary (e.g the scsi layer sets the queue lock pointers to the 1083 necessary (e.g the scsi layer sets the queue lock pointers to the
1084 corresponding adapter lock, which results in a per host locking 1084 corresponding adapter lock, which results in a per host locking
1085 granularity). The locking semantics are the same, i.e. locking is 1085 granularity). The locking semantics are the same, i.e. locking is
1086 still imposed by the block layer, grabbing the lock before 1086 still imposed by the block layer, grabbing the lock before
1087 request_fn execution which it means that lots of older drivers 1087 request_fn execution which it means that lots of older drivers
1088 should still be SMP safe. Drivers are free to drop the queue 1088 should still be SMP safe. Drivers are free to drop the queue
1089 lock themselves, if required. Drivers that explicitly used the 1089 lock themselves, if required. Drivers that explicitly used the
1090 io_request_lock for serialization need to be modified accordingly. 1090 io_request_lock for serialization need to be modified accordingly.
1091 Usually it's as easy as adding a global lock: 1091 Usually it's as easy as adding a global lock:
1092 1092
1093 static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED; 1093 static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED;
1094 1094
1095 and passing the address to that lock to blk_init_queue(). 1095 and passing the address to that lock to blk_init_queue().
1096 1096
1097 5.2 64 bit sector numbers (sector_t prepares for 64 bit support) 1097 5.2 64 bit sector numbers (sector_t prepares for 64 bit support)
1098 1098
1099 The sector number used in the bio structure has been changed to sector_t, 1099 The sector number used in the bio structure has been changed to sector_t,
1100 which could be defined as 64 bit in preparation for 64 bit sector support. 1100 which could be defined as 64 bit in preparation for 64 bit sector support.
1101 1101
1102 6. Other Changes/Implications 1102 6. Other Changes/Implications
1103 1103
1104 6.1 Partition re-mapping handled by the generic block layer 1104 6.1 Partition re-mapping handled by the generic block layer
1105 1105
1106 In 2.5 some of the gendisk/partition related code has been reorganized. 1106 In 2.5 some of the gendisk/partition related code has been reorganized.
1107 Now the generic block layer performs partition-remapping early and thus 1107 Now the generic block layer performs partition-remapping early and thus
1108 provides drivers with a sector number relative to whole device, rather than 1108 provides drivers with a sector number relative to whole device, rather than
1109 having to take partition number into account in order to arrive at the true 1109 having to take partition number into account in order to arrive at the true
1110 sector number. The routine blk_partition_remap() is invoked by 1110 sector number. The routine blk_partition_remap() is invoked by
1111 generic_make_request even before invoking the queue specific make_request_fn, 1111 generic_make_request even before invoking the queue specific make_request_fn,
1112 so the i/o scheduler also gets to operate on whole disk sector numbers. This 1112 so the i/o scheduler also gets to operate on whole disk sector numbers. This
1113 should typically not require changes to block drivers, it just never gets 1113 should typically not require changes to block drivers, it just never gets
1114 to invoke its own partition sector offset calculations since all bios 1114 to invoke its own partition sector offset calculations since all bios
1115 sent are offset from the beginning of the device. 1115 sent are offset from the beginning of the device.
1116 1116
1117 1117
1118 7. A Few Tips on Migration of older drivers 1118 7. A Few Tips on Migration of older drivers
1119 1119
1120 Old-style drivers that just use CURRENT and ignores clustered requests, 1120 Old-style drivers that just use CURRENT and ignores clustered requests,
1121 may not need much change. The generic layer will automatically handle 1121 may not need much change. The generic layer will automatically handle
1122 clustered requests, multi-page bios, etc for the driver. 1122 clustered requests, multi-page bios, etc for the driver.
1123 1123
1124 For a low performance driver or hardware that is PIO driven or just doesn't 1124 For a low performance driver or hardware that is PIO driven or just doesn't
1125 support scatter-gather changes should be minimal too. 1125 support scatter-gather changes should be minimal too.
1126 1126
1127 The following are some points to keep in mind when converting old drivers 1127 The following are some points to keep in mind when converting old drivers
1128 to bio. 1128 to bio.
1129 1129
1130 Drivers should use elv_next_request to pick up requests and are no longer 1130 Drivers should use elv_next_request to pick up requests and are no longer
1131 supposed to handle looping directly over the request list. 1131 supposed to handle looping directly over the request list.
1132 (struct request->queue has been removed) 1132 (struct request->queue has been removed)
1133 1133
1134 Now end_that_request_first takes an additional number_of_sectors argument. 1134 Now end_that_request_first takes an additional number_of_sectors argument.
1135 It used to handle always just the first buffer_head in a request, now 1135 It used to handle always just the first buffer_head in a request, now
1136 it will loop and handle as many sectors (on a bio-segment granularity) 1136 it will loop and handle as many sectors (on a bio-segment granularity)
1137 as specified. 1137 as specified.
1138 1138
1139 Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the 1139 Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
1140 right thing to use is bio_endio(bio, uptodate) instead. 1140 right thing to use is bio_endio(bio, uptodate) instead.
1141 1141
1142 If the driver is dropping the io_request_lock from its request_fn strategy, 1142 If the driver is dropping the io_request_lock from its request_fn strategy,
1143 then it just needs to replace that with q->queue_lock instead. 1143 then it just needs to replace that with q->queue_lock instead.
1144 1144
1145 As described in Sec 1.1, drivers can set max sector size, max segment size 1145 As described in Sec 1.1, drivers can set max sector size, max segment size
1146 etc per queue now. Drivers that used to define their own merge functions i 1146 etc per queue now. Drivers that used to define their own merge functions i
1147 to handle things like this can now just use the blk_queue_* functions at 1147 to handle things like this can now just use the blk_queue_* functions at
1148 blk_init_queue time. 1148 blk_init_queue time.
1149 1149
1150 Drivers no longer have to map a {partition, sector offset} into the 1150 Drivers no longer have to map a {partition, sector offset} into the
1151 correct absolute location anymore, this is done by the block layer, so 1151 correct absolute location anymore, this is done by the block layer, so
1152 where a driver received a request ala this before: 1152 where a driver received a request ala this before:
1153 1153
1154 rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ 1154 rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */
1155 rq->sector = 0; /* first sector on hda5 */ 1155 rq->sector = 0; /* first sector on hda5 */
1156 1156
1157 it will now see 1157 it will now see
1158 1158
1159 rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ 1159 rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */
1160 rq->sector = 123128; /* offset from start of disk */ 1160 rq->sector = 123128; /* offset from start of disk */
1161 1161
1162 As mentioned, there is no virtual mapping of a bio. For DMA, this is 1162 As mentioned, there is no virtual mapping of a bio. For DMA, this is
1163 not a problem as the driver probably never will need a virtual mapping. 1163 not a problem as the driver probably never will need a virtual mapping.
1164 Instead it needs a bus mapping (pci_map_page for a single segment or 1164 Instead it needs a bus mapping (pci_map_page for a single segment or
1165 use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For 1165 use blk_rq_map_sg for scatter gather) to be able to ship it to the driver. For
1166 PIO drivers (or drivers that need to revert to PIO transfer once in a 1166 PIO drivers (or drivers that need to revert to PIO transfer once in a
1167 while (IDE for example)), where the CPU is doing the actual data 1167 while (IDE for example)), where the CPU is doing the actual data
1168 transfer a virtual mapping is needed. If the driver supports highmem I/O, 1168 transfer a virtual mapping is needed. If the driver supports highmem I/O,
1169 (Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to 1169 (Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq to
1170 temporarily map a bio into the virtual address space. 1170 temporarily map a bio into the virtual address space.
1171 1171
1172 1172
1173 8. Prior/Related/Impacted patches 1173 8. Prior/Related/Impacted patches
1174 1174
1175 8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) 1175 8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)
1176 - orig kiobuf & raw i/o patches (now in 2.4 tree) 1176 - orig kiobuf & raw i/o patches (now in 2.4 tree)
1177 - direct kiobuf based i/o to devices (no intermediate bh's) 1177 - direct kiobuf based i/o to devices (no intermediate bh's)
1178 - page i/o using kiobuf 1178 - page i/o using kiobuf
1179 - kiobuf splitting for lvm (mkp) 1179 - kiobuf splitting for lvm (mkp)
1180 - elevator support for kiobuf request merging (axboe) 1180 - elevator support for kiobuf request merging (axboe)
1181 8.2. Zero-copy networking (Dave Miller) 1181 8.2. Zero-copy networking (Dave Miller)
1182 8.3. SGI XFS - pagebuf patches - use of kiobufs 1182 8.3. SGI XFS - pagebuf patches - use of kiobufs
1183 8.4. Multi-page pioent patch for bio (Christoph Hellwig) 1183 8.4. Multi-page pioent patch for bio (Christoph Hellwig)
1184 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 1184 8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11
1185 8.6. Async i/o implementation patch (Ben LaHaise) 1185 8.6. Async i/o implementation patch (Ben LaHaise)
1186 8.7. EVMS layering design (IBM EVMS team) 1186 8.7. EVMS layering design (IBM EVMS team)
1187 8.8. Larger page cache size patch (Ben LaHaise) and 1187 8.8. Larger page cache size patch (Ben LaHaise) and
1188 Large page size (Daniel Phillips) 1188 Large page size (Daniel Phillips)
1189 => larger contiguous physical memory buffers 1189 => larger contiguous physical memory buffers
1190 8.9. VM reservations patch (Ben LaHaise) 1190 8.9. VM reservations patch (Ben LaHaise)
1191 8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) 1191 8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)
1192 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ 1192 8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+
1193 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, 1193 8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar,
1194 Badari) 1194 Badari)
1195 8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) 1195 8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)
1196 8.14 IDE Taskfile i/o patch (Andre Hedrick) 1196 8.14 IDE Taskfile i/o patch (Andre Hedrick)
1197 8.15 Multi-page writeout and readahead patches (Andrew Morton) 1197 8.15 Multi-page writeout and readahead patches (Andrew Morton)
1198 8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) 1198 8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)
1199 1199
1200 9. Other References: 1200 9. Other References:
1201 1201
1202 9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml, 1202 9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml,
1203 and Linus' comments - Jan 2001) 1203 and Linus' comments - Jan 2001)
1204 9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan 1204 9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan
1205 et al - Feb-March 2001 (many of the initial thoughts that led to bio were 1205 et al - Feb-March 2001 (many of the initial thoughts that led to bio were
1206 brought up in this discussion thread) 1206 brought up in this discussion thread)
1207 9.3 Discussions on mempool on lkml - Dec 2001. 1207 9.3 Discussions on mempool on lkml - Dec 2001.
1208 1208
1209 1209
Documentation/driver-model/overview.txt
1 The Linux Kernel Device Model 1 The Linux Kernel Device Model
2 2
3 Patrick Mochel <mochel@digitalimplant.org> 3 Patrick Mochel <mochel@digitalimplant.org>
4 4
5 Drafted 26 August 2002 5 Drafted 26 August 2002
6 Updated 31 January 2006 6 Updated 31 January 2006
7 7
8 8
9 Overview 9 Overview
10 ~~~~~~~~ 10 ~~~~~~~~
11 11
12 The Linux Kernel Driver Model is a unification of all the disparate driver 12 The Linux Kernel Driver Model is a unification of all the disparate driver
13 models that were previously used in the kernel. It is intended to augment the 13 models that were previously used in the kernel. It is intended to augment the
14 bus-specific drivers for bridges and devices by consolidating a set of data 14 bus-specific drivers for bridges and devices by consolidating a set of data
15 and operations into globally accessible data structures. 15 and operations into globally accessible data structures.
16 16
17 Traditional driver models implemented some sort of tree-like structure 17 Traditional driver models implemented some sort of tree-like structure
18 (sometimes just a list) for the devices they control. There wasn't any 18 (sometimes just a list) for the devices they control. There wasn't any
19 uniformity across the different bus types. 19 uniformity across the different bus types.
20 20
21 The current driver model provides a common, uniform data model for describing 21 The current driver model provides a common, uniform data model for describing
22 a bus and the devices that can appear under the bus. The unified bus 22 a bus and the devices that can appear under the bus. The unified bus
23 model includes a set of common attributes which all busses carry, and a set 23 model includes a set of common attributes which all busses carry, and a set
24 of common callbacks, such as device discovery during bus probing, bus 24 of common callbacks, such as device discovery during bus probing, bus
25 shutdown, bus power management, etc. 25 shutdown, bus power management, etc.
26 26
27 The common device and bridge interface reflects the goals of the modern 27 The common device and bridge interface reflects the goals of the modern
28 computer: namely the ability to do seamless device "plug and play", power 28 computer: namely the ability to do seamless device "plug and play", power
29 management, and hot plug. In particular, the model dictated by Intel and 29 management, and hot plug. In particular, the model dictated by Intel and
30 Microsoft (namely ACPI) ensures that almost every device on almost any bus 30 Microsoft (namely ACPI) ensures that almost every device on almost any bus
31 on an x86-compatible system can work within this paradigm. Of course, 31 on an x86-compatible system can work within this paradigm. Of course,
32 not every bus is able to support all such operations, although most 32 not every bus is able to support all such operations, although most
33 buses support a most of those operations. 33 buses support a most of those operations.
34 34
35 35
36 Downstream Access 36 Downstream Access
37 ~~~~~~~~~~~~~~~~~ 37 ~~~~~~~~~~~~~~~~~
38 38
39 Common data fields have been moved out of individual bus layers into a common 39 Common data fields have been moved out of individual bus layers into a common
40 data structure. These fields must still be accessed by the bus layers, 40 data structure. These fields must still be accessed by the bus layers,
41 and sometimes by the device-specific drivers. 41 and sometimes by the device-specific drivers.
42 42
43 Other bus layers are encouraged to do what has been done for the PCI layer. 43 Other bus layers are encouraged to do what has been done for the PCI layer.
44 struct pci_dev now looks like this: 44 struct pci_dev now looks like this:
45 45
46 struct pci_dev { 46 struct pci_dev {
47 ... 47 ...
48 48
49 struct device dev; 49 struct device dev;
50 }; 50 };
51 51
52 Note first that it is statically allocated. This means only one allocation on 52 Note first that it is statically allocated. This means only one allocation on
53 device discovery. Note also that it is at the _end_ of struct pci_dev. This is 53 device discovery. Note also that it is at the _end_ of struct pci_dev. This is
54 to make people think about what they're doing when switching between the bus 54 to make people think about what they're doing when switching between the bus
55 driver and the global driver; and to prevent against mindless casts between 55 driver and the global driver; and to prevent against mindless casts between
56 the two. 56 the two.
57 57
58 The PCI bus layer freely accesses the fields of struct device. It knows about 58 The PCI bus layer freely accesses the fields of struct device. It knows about
59 the structure of struct pci_dev, and it should know the structure of struct 59 the structure of struct pci_dev, and it should know the structure of struct
60 device. Individual PCI device drivers that have been converted the the current 60 device. Individual PCI device drivers that have been converted to the current
61 driver model generally do not and should not touch the fields of struct device, 61 driver model generally do not and should not touch the fields of struct device,
62 unless there is a strong compelling reason to do so. 62 unless there is a strong compelling reason to do so.
63 63
64 This abstraction is prevention of unnecessary pain during transitional phases. 64 This abstraction is prevention of unnecessary pain during transitional phases.
65 If the name of the field changes or is removed, then every downstream driver 65 If the name of the field changes or is removed, then every downstream driver
66 will break. On the other hand, if only the bus layer (and not the device 66 will break. On the other hand, if only the bus layer (and not the device
67 layer) accesses struct device, it is only that layer that needs to change. 67 layer) accesses struct device, it is only that layer that needs to change.
68 68
69 69
70 User Interface 70 User Interface
71 ~~~~~~~~~~~~~~ 71 ~~~~~~~~~~~~~~
72 72
73 By virtue of having a complete hierarchical view of all the devices in the 73 By virtue of having a complete hierarchical view of all the devices in the
74 system, exporting a complete hierarchical view to userspace becomes relatively 74 system, exporting a complete hierarchical view to userspace becomes relatively
75 easy. This has been accomplished by implementing a special purpose virtual 75 easy. This has been accomplished by implementing a special purpose virtual
76 file system named sysfs. It is hence possible for the user to mount the 76 file system named sysfs. It is hence possible for the user to mount the
77 whole sysfs filesystem anywhere in userspace. 77 whole sysfs filesystem anywhere in userspace.
78 78
79 This can be done permanently by providing the following entry into the 79 This can be done permanently by providing the following entry into the
80 /etc/fstab (under the provision that the mount point does exist, of course): 80 /etc/fstab (under the provision that the mount point does exist, of course):
81 81
82 none /sys sysfs defaults 0 0 82 none /sys sysfs defaults 0 0
83 83
84 Or by hand on the command line: 84 Or by hand on the command line:
85 85
86 # mount -t sysfs sysfs /sys 86 # mount -t sysfs sysfs /sys
87 87
88 Whenever a device is inserted into the tree, a directory is created for it. 88 Whenever a device is inserted into the tree, a directory is created for it.
89 This directory may be populated at each layer of discovery - the global layer, 89 This directory may be populated at each layer of discovery - the global layer,
90 the bus layer, or the device layer. 90 the bus layer, or the device layer.
91 91
92 The global layer currently creates two files - 'name' and 'power'. The 92 The global layer currently creates two files - 'name' and 'power'. The
93 former only reports the name of the device. The latter reports the 93 former only reports the name of the device. The latter reports the
94 current power state of the device. It will also be used to set the current 94 current power state of the device. It will also be used to set the current
95 power state. 95 power state.
96 96
97 The bus layer may also create files for the devices it finds while probing the 97 The bus layer may also create files for the devices it finds while probing the
98 bus. For example, the PCI layer currently creates 'irq' and 'resource' files 98 bus. For example, the PCI layer currently creates 'irq' and 'resource' files
99 for each PCI device. 99 for each PCI device.
100 100
101 A device-specific driver may also export files in its directory to expose 101 A device-specific driver may also export files in its directory to expose
102 device-specific data or tunable interfaces. 102 device-specific data or tunable interfaces.
103 103
104 More information about the sysfs directory layout can be found in 104 More information about the sysfs directory layout can be found in
105 the other documents in this directory and in the file 105 the other documents in this directory and in the file
106 Documentation/filesystems/sysfs.txt. 106 Documentation/filesystems/sysfs.txt.
107 107
108 108
Documentation/exception.txt
1 Kernel level exception handling in Linux 2.1.8 1 Kernel level exception handling in Linux 2.1.8
2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> 2 Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
3 3
4 When a process runs in kernel mode, it often has to access user 4 When a process runs in kernel mode, it often has to access user
5 mode memory whose address has been passed by an untrusted program. 5 mode memory whose address has been passed by an untrusted program.
6 To protect itself the kernel has to verify this address. 6 To protect itself the kernel has to verify this address.
7 7
8 In older versions of Linux this was done with the 8 In older versions of Linux this was done with the
9 int verify_area(int type, const void * addr, unsigned long size) 9 int verify_area(int type, const void * addr, unsigned long size)
10 function (which has since been replaced by access_ok()). 10 function (which has since been replaced by access_ok()).
11 11
12 This function verified that the memory area starting at address 12 This function verified that the memory area starting at address
13 addr and of size size was accessible for the operation specified 13 'addr' and of size 'size' was accessible for the operation specified
14 in type (read or write). To do this, verify_read had to look up the 14 in type (read or write). To do this, verify_read had to look up the
15 virtual memory area (vma) that contained the address addr. In the 15 virtual memory area (vma) that contained the address addr. In the
16 normal case (correctly working program), this test was successful. 16 normal case (correctly working program), this test was successful.
17 It only failed for a few buggy programs. In some kernel profiling 17 It only failed for a few buggy programs. In some kernel profiling
18 tests, this normally unneeded verification used up a considerable 18 tests, this normally unneeded verification used up a considerable
19 amount of time. 19 amount of time.
20 20
21 To overcome this situation, Linus decided to let the virtual memory 21 To overcome this situation, Linus decided to let the virtual memory
22 hardware present in every Linux-capable CPU handle this test. 22 hardware present in every Linux-capable CPU handle this test.
23 23
24 How does this work? 24 How does this work?
25 25
26 Whenever the kernel tries to access an address that is currently not 26 Whenever the kernel tries to access an address that is currently not
27 accessible, the CPU generates a page fault exception and calls the 27 accessible, the CPU generates a page fault exception and calls the
28 page fault handler 28 page fault handler
29 29
30 void do_page_fault(struct pt_regs *regs, unsigned long error_code) 30 void do_page_fault(struct pt_regs *regs, unsigned long error_code)
31 31
32 in arch/i386/mm/fault.c. The parameters on the stack are set up by 32 in arch/i386/mm/fault.c. The parameters on the stack are set up by
33 the low level assembly glue in arch/i386/kernel/entry.S. The parameter 33 the low level assembly glue in arch/i386/kernel/entry.S. The parameter
34 regs is a pointer to the saved registers on the stack, error_code 34 regs is a pointer to the saved registers on the stack, error_code
35 contains a reason code for the exception. 35 contains a reason code for the exception.
36 36
37 do_page_fault first obtains the unaccessible address from the CPU 37 do_page_fault first obtains the unaccessible address from the CPU
38 control register CR2. If the address is within the virtual address 38 control register CR2. If the address is within the virtual address
39 space of the process, the fault probably occurred, because the page 39 space of the process, the fault probably occurred, because the page
40 was not swapped in, write protected or something similar. However, 40 was not swapped in, write protected or something similar. However,
41 we are interested in the other case: the address is not valid, there 41 we are interested in the other case: the address is not valid, there
42 is no vma that contains this address. In this case, the kernel jumps 42 is no vma that contains this address. In this case, the kernel jumps
43 to the bad_area label. 43 to the bad_area label.
44 44
45 There it uses the address of the instruction that caused the exception 45 There it uses the address of the instruction that caused the exception
46 (i.e. regs->eip) to find an address where the execution can continue 46 (i.e. regs->eip) to find an address where the execution can continue
47 (fixup). If this search is successful, the fault handler modifies the 47 (fixup). If this search is successful, the fault handler modifies the
48 return address (again regs->eip) and returns. The execution will 48 return address (again regs->eip) and returns. The execution will
49 continue at the address in fixup. 49 continue at the address in fixup.
50 50
51 Where does fixup point to? 51 Where does fixup point to?
52 52
53 Since we jump to the contents of fixup, fixup obviously points 53 Since we jump to the contents of fixup, fixup obviously points
54 to executable code. This code is hidden inside the user access macros. 54 to executable code. This code is hidden inside the user access macros.
55 I have picked the get_user macro defined in include/asm/uaccess.h as an 55 I have picked the get_user macro defined in include/asm/uaccess.h as an
56 example. The definition is somewhat hard to follow, so let's peek at 56 example. The definition is somewhat hard to follow, so let's peek at
57 the code generated by the preprocessor and the compiler. I selected 57 the code generated by the preprocessor and the compiler. I selected
58 the get_user call in drivers/char/console.c for a detailed examination. 58 the get_user call in drivers/char/console.c for a detailed examination.
59 59
60 The original code in console.c line 1405: 60 The original code in console.c line 1405:
61 get_user(c, buf); 61 get_user(c, buf);
62 62
63 The preprocessor output (edited to become somewhat readable): 63 The preprocessor output (edited to become somewhat readable):
64 64
65 ( 65 (
66 { 66 {
67 long __gu_err = - 14 , __gu_val = 0; 67 long __gu_err = - 14 , __gu_val = 0;
68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); 68 const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) || 69 if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
70 (((sizeof(*(buf))) <= 0xC0000000UL) && 70 (((sizeof(*(buf))) <= 0xC0000000UL) &&
71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) 71 ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
72 do { 72 do {
73 __gu_err = 0; 73 __gu_err = 0;
74 switch ((sizeof(*(buf)))) { 74 switch ((sizeof(*(buf)))) {
75 case 1: 75 case 1:
76 __asm__ __volatile__( 76 __asm__ __volatile__(
77 "1: mov" "b" " %2,%" "b" "1\n" 77 "1: mov" "b" " %2,%" "b" "1\n"
78 "2:\n" 78 "2:\n"
79 ".section .fixup,\"ax\"\n" 79 ".section .fixup,\"ax\"\n"
80 "3: movl %3,%0\n" 80 "3: movl %3,%0\n"
81 " xor" "b" " %" "b" "1,%" "b" "1\n" 81 " xor" "b" " %" "b" "1,%" "b" "1\n"
82 " jmp 2b\n" 82 " jmp 2b\n"
83 ".section __ex_table,\"a\"\n" 83 ".section __ex_table,\"a\"\n"
84 " .align 4\n" 84 " .align 4\n"
85 " .long 1b,3b\n" 85 " .long 1b,3b\n"
86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) 86 ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; 87 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
88 break; 88 break;
89 case 2: 89 case 2:
90 __asm__ __volatile__( 90 __asm__ __volatile__(
91 "1: mov" "w" " %2,%" "w" "1\n" 91 "1: mov" "w" " %2,%" "w" "1\n"
92 "2:\n" 92 "2:\n"
93 ".section .fixup,\"ax\"\n" 93 ".section .fixup,\"ax\"\n"
94 "3: movl %3,%0\n" 94 "3: movl %3,%0\n"
95 " xor" "w" " %" "w" "1,%" "w" "1\n" 95 " xor" "w" " %" "w" "1,%" "w" "1\n"
96 " jmp 2b\n" 96 " jmp 2b\n"
97 ".section __ex_table,\"a\"\n" 97 ".section __ex_table,\"a\"\n"
98 " .align 4\n" 98 " .align 4\n"
99 " .long 1b,3b\n" 99 " .long 1b,3b\n"
100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 100 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); 101 ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
102 break; 102 break;
103 case 4: 103 case 4:
104 __asm__ __volatile__( 104 __asm__ __volatile__(
105 "1: mov" "l" " %2,%" "" "1\n" 105 "1: mov" "l" " %2,%" "" "1\n"
106 "2:\n" 106 "2:\n"
107 ".section .fixup,\"ax\"\n" 107 ".section .fixup,\"ax\"\n"
108 "3: movl %3,%0\n" 108 "3: movl %3,%0\n"
109 " xor" "l" " %" "" "1,%" "" "1\n" 109 " xor" "l" " %" "" "1,%" "" "1\n"
110 " jmp 2b\n" 110 " jmp 2b\n"
111 ".section __ex_table,\"a\"\n" 111 ".section __ex_table,\"a\"\n"
112 " .align 4\n" " .long 1b,3b\n" 112 " .align 4\n" " .long 1b,3b\n"
113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) 113 ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); 114 ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
115 break; 115 break;
116 default: 116 default:
117 (__gu_val) = __get_user_bad(); 117 (__gu_val) = __get_user_bad();
118 } 118 }
119 } while (0) ; 119 } while (0) ;
120 ((c)) = (__typeof__(*((buf))))__gu_val; 120 ((c)) = (__typeof__(*((buf))))__gu_val;
121 __gu_err; 121 __gu_err;
122 } 122 }
123 ); 123 );
124 124
125 WOW! Black GCC/assembly magic. This is impossible to follow, so let's 125 WOW! Black GCC/assembly magic. This is impossible to follow, so let's
126 see what code gcc generates: 126 see what code gcc generates:
127 127
128 > xorl %edx,%edx 128 > xorl %edx,%edx
129 > movl current_set,%eax 129 > movl current_set,%eax
130 > cmpl $24,788(%eax) 130 > cmpl $24,788(%eax)
131 > je .L1424 131 > je .L1424
132 > cmpl $-1073741825,64(%esp) 132 > cmpl $-1073741825,64(%esp)
133 > ja .L1423 133 > ja .L1423
134 > .L1424: 134 > .L1424:
135 > movl %edx,%eax 135 > movl %edx,%eax
136 > movl 64(%esp),%ebx 136 > movl 64(%esp),%ebx
137 > #APP 137 > #APP
138 > 1: movb (%ebx),%dl /* this is the actual user access */ 138 > 1: movb (%ebx),%dl /* this is the actual user access */
139 > 2: 139 > 2:
140 > .section .fixup,"ax" 140 > .section .fixup,"ax"
141 > 3: movl $-14,%eax 141 > 3: movl $-14,%eax
142 > xorb %dl,%dl 142 > xorb %dl,%dl
143 > jmp 2b 143 > jmp 2b
144 > .section __ex_table,"a" 144 > .section __ex_table,"a"
145 > .align 4 145 > .align 4
146 > .long 1b,3b 146 > .long 1b,3b
147 > .text 147 > .text
148 > #NO_APP 148 > #NO_APP
149 > .L1423: 149 > .L1423:
150 > movzbl %dl,%esi 150 > movzbl %dl,%esi
151 151
152 The optimizer does a good job and gives us something we can actually 152 The optimizer does a good job and gives us something we can actually
153 understand. Can we? The actual user access is quite obvious. Thanks 153 understand. Can we? The actual user access is quite obvious. Thanks
154 to the unified address space we can just access the address in user 154 to the unified address space we can just access the address in user
155 memory. But what does the .section stuff do????? 155 memory. But what does the .section stuff do?????
156 156
157 To understand this we have to look at the final kernel: 157 To understand this we have to look at the final kernel:
158 158
159 > objdump --section-headers vmlinux 159 > objdump --section-headers vmlinux
160 > 160 >
161 > vmlinux: file format elf32-i386 161 > vmlinux: file format elf32-i386
162 > 162 >
163 > Sections: 163 > Sections:
164 > Idx Name Size VMA LMA File off Algn 164 > Idx Name Size VMA LMA File off Algn
165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4 165 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4
166 > CONTENTS, ALLOC, LOAD, READONLY, CODE 166 > CONTENTS, ALLOC, LOAD, READONLY, CODE
167 > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0 167 > 1 .fixup 000016bc c0198f40 c0198f40 00099f40 2**0
168 > CONTENTS, ALLOC, LOAD, READONLY, CODE 168 > CONTENTS, ALLOC, LOAD, READONLY, CODE
169 > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2 169 > 2 .rodata 0000f127 c019a5fc c019a5fc 0009b5fc 2**2
170 > CONTENTS, ALLOC, LOAD, READONLY, DATA 170 > CONTENTS, ALLOC, LOAD, READONLY, DATA
171 > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2 171 > 3 __ex_table 000015c0 c01a9724 c01a9724 000aa724 2**2
172 > CONTENTS, ALLOC, LOAD, READONLY, DATA 172 > CONTENTS, ALLOC, LOAD, READONLY, DATA
173 > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4 173 > 4 .data 0000ea58 c01abcf0 c01abcf0 000abcf0 2**4
174 > CONTENTS, ALLOC, LOAD, DATA 174 > CONTENTS, ALLOC, LOAD, DATA
175 > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2 175 > 5 .bss 00018e21 c01ba748 c01ba748 000ba748 2**2
176 > ALLOC 176 > ALLOC
177 > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0 177 > 6 .comment 00000ec4 00000000 00000000 000ba748 2**0
178 > CONTENTS, READONLY 178 > CONTENTS, READONLY
179 > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0 179 > 7 .note 00001068 00000ec4 00000ec4 000bb60c 2**0
180 > CONTENTS, READONLY 180 > CONTENTS, READONLY
181 181
182 There are obviously 2 non standard ELF sections in the generated object 182 There are obviously 2 non standard ELF sections in the generated object
183 file. But first we want to find out what happened to our code in the 183 file. But first we want to find out what happened to our code in the
184 final kernel executable: 184 final kernel executable:
185 185
186 > objdump --disassemble --section=.text vmlinux 186 > objdump --disassemble --section=.text vmlinux
187 > 187 >
188 > c017e785 <do_con_write+c1> xorl %edx,%edx 188 > c017e785 <do_con_write+c1> xorl %edx,%edx
189 > c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax 189 > c017e787 <do_con_write+c3> movl 0xc01c7bec,%eax
190 > c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax) 190 > c017e78c <do_con_write+c8> cmpl $0x18,0x314(%eax)
191 > c017e793 <do_con_write+cf> je c017e79f <do_con_write+db> 191 > c017e793 <do_con_write+cf> je c017e79f <do_con_write+db>
192 > c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1) 192 > c017e795 <do_con_write+d1> cmpl $0xbfffffff,0x40(%esp,1)
193 > c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3> 193 > c017e79d <do_con_write+d9> ja c017e7a7 <do_con_write+e3>
194 > c017e79f <do_con_write+db> movl %edx,%eax 194 > c017e79f <do_con_write+db> movl %edx,%eax
195 > c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx 195 > c017e7a1 <do_con_write+dd> movl 0x40(%esp,1),%ebx
196 > c017e7a5 <do_con_write+e1> movb (%ebx),%dl 196 > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
197 > c017e7a7 <do_con_write+e3> movzbl %dl,%esi 197 > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
198 198
199 The whole user memory access is reduced to 10 x86 machine instructions. 199 The whole user memory access is reduced to 10 x86 machine instructions.
200 The instructions bracketed in the .section directives are no longer 200 The instructions bracketed in the .section directives are no longer
201 in the normal execution path. They are located in a different section 201 in the normal execution path. They are located in a different section
202 of the executable file: 202 of the executable file:
203 203
204 > objdump --disassemble --section=.fixup vmlinux 204 > objdump --disassemble --section=.fixup vmlinux
205 > 205 >
206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax 206 > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
207 > c0199ffa <.fixup+10ba> xorb %dl,%dl 207 > c0199ffa <.fixup+10ba> xorb %dl,%dl
208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> 208 > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
209 209
210 And finally: 210 And finally:
211 > objdump --full-contents --section=__ex_table vmlinux 211 > objdump --full-contents --section=__ex_table vmlinux
212 > 212 >
213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ 213 > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ 214 > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ 215 > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
216 216
217 or in human readable byte order: 217 or in human readable byte order:
218 218
219 > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................ 219 > c01aa7c4 c017c093 c0199fe0 c017c097 c017c099 ................
220 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ 220 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
221 ^^^^^^^^^^^^^^^^^ 221 ^^^^^^^^^^^^^^^^^
222 this is the interesting part! 222 this is the interesting part!
223 > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................ 223 > c01aa7e4 c0180a08 c019a001 c0180a0a c019a004 ................
224 224
225 What happened? The assembly directives 225 What happened? The assembly directives
226 226
227 .section .fixup,"ax" 227 .section .fixup,"ax"
228 .section __ex_table,"a" 228 .section __ex_table,"a"
229 229
230 told the assembler to move the following code to the specified 230 told the assembler to move the following code to the specified
231 sections in the ELF object file. So the instructions 231 sections in the ELF object file. So the instructions
232 3: movl $-14,%eax 232 3: movl $-14,%eax
233 xorb %dl,%dl 233 xorb %dl,%dl
234 jmp 2b 234 jmp 2b
235 ended up in the .fixup section of the object file and the addresses 235 ended up in the .fixup section of the object file and the addresses
236 .long 1b,3b 236 .long 1b,3b
237 ended up in the __ex_table section of the object file. 1b and 3b 237 ended up in the __ex_table section of the object file. 1b and 3b
238 are local labels. The local label 1b (1b stands for next label 1 238 are local labels. The local label 1b (1b stands for next label 1
239 backward) is the address of the instruction that might fault, i.e. 239 backward) is the address of the instruction that might fault, i.e.
240 in our case the address of the label 1 is c017e7a5: 240 in our case the address of the label 1 is c017e7a5:
241 the original assembly code: > 1: movb (%ebx),%dl 241 the original assembly code: > 1: movb (%ebx),%dl
242 and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl 242 and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
243 243
244 The local label 3 (backwards again) is the address of the code to handle 244 The local label 3 (backwards again) is the address of the code to handle
245 the fault, in our case the actual value is c0199ff5: 245 the fault, in our case the actual value is c0199ff5:
246 the original assembly code: > 3: movl $-14,%eax 246 the original assembly code: > 3: movl $-14,%eax
247 and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax 247 and linked in vmlinux : > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
248 248
249 The assembly code 249 The assembly code
250 > .section __ex_table,"a" 250 > .section __ex_table,"a"
251 > .align 4 251 > .align 4
252 > .long 1b,3b 252 > .long 1b,3b
253 253
254 becomes the value pair 254 becomes the value pair
255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ 255 > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
256 ^this is ^this is 256 ^this is ^this is
257 1b 3b 257 1b 3b
258 c017e7a5,c0199ff5 in the exception table of the kernel. 258 c017e7a5,c0199ff5 in the exception table of the kernel.
259 259
260 So, what actually happens if a fault from kernel mode with no suitable 260 So, what actually happens if a fault from kernel mode with no suitable
261 vma occurs? 261 vma occurs?
262 262
263 1.) access to invalid address: 263 1.) access to invalid address:
264 > c017e7a5 <do_con_write+e1> movb (%ebx),%dl 264 > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
265 2.) MMU generates exception 265 2.) MMU generates exception
266 3.) CPU calls do_page_fault 266 3.) CPU calls do_page_fault
267 4.) do page fault calls search_exception_table (regs->eip == c017e7a5); 267 4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
268 5.) search_exception_table looks up the address c017e7a5 in the 268 5.) search_exception_table looks up the address c017e7a5 in the
269 exception table (i.e. the contents of the ELF section __ex_table) 269 exception table (i.e. the contents of the ELF section __ex_table)
270 and returns the address of the associated fault handle code c0199ff5. 270 and returns the address of the associated fault handle code c0199ff5.
271 6.) do_page_fault modifies its own return address to point to the fault 271 6.) do_page_fault modifies its own return address to point to the fault
272 handle code and returns. 272 handle code and returns.
273 7.) execution continues in the fault handling code. 273 7.) execution continues in the fault handling code.
274 8.) 8a) EAX becomes -EFAULT (== -14) 274 8.) 8a) EAX becomes -EFAULT (== -14)
275 8b) DL becomes zero (the value we "read" from user space) 275 8b) DL becomes zero (the value we "read" from user space)
276 8c) execution continues at local label 2 (address of the 276 8c) execution continues at local label 2 (address of the
277 instruction immediately after the faulting user access). 277 instruction immediately after the faulting user access).
278 278
279 The steps 8a to 8c in a certain way emulate the faulting instruction. 279 The steps 8a to 8c in a certain way emulate the faulting instruction.
280 280
281 That's it, mostly. If you look at our example, you might ask why 281 That's it, mostly. If you look at our example, you might ask why
282 we set EAX to -EFAULT in the exception handler code. Well, the 282 we set EAX to -EFAULT in the exception handler code. Well, the
283 get_user macro actually returns a value: 0, if the user access was 283 get_user macro actually returns a value: 0, if the user access was
284 successful, -EFAULT on failure. Our original code did not test this 284 successful, -EFAULT on failure. Our original code did not test this
285 return value, however the inline assembly code in get_user tries to 285 return value, however the inline assembly code in get_user tries to
286 return -EFAULT. GCC selected EAX to return this value. 286 return -EFAULT. GCC selected EAX to return this value.
287 287
288 NOTE: 288 NOTE:
289 Due to the way that the exception table is built and needs to be ordered, 289 Due to the way that the exception table is built and needs to be ordered,
290 only use exceptions for code in the .text section. Any other section 290 only use exceptions for code in the .text section. Any other section
291 will cause the exception table to not be sorted correctly, and the 291 will cause the exception table to not be sorted correctly, and the
292 exceptions will fail. 292 exceptions will fail.
293 293
Documentation/fb/fbcon.txt
1 The Framebuffer Console 1 The Framebuffer Console
2 ======================= 2 =======================
3 3
4 The framebuffer console (fbcon), as its name implies, is a text 4 The framebuffer console (fbcon), as its name implies, is a text
5 console running on top of the framebuffer device. It has the functionality of 5 console running on top of the framebuffer device. It has the functionality of
6 any standard text console driver, such as the VGA console, with the added 6 any standard text console driver, such as the VGA console, with the added
7 features that can be attributed to the graphical nature of the framebuffer. 7 features that can be attributed to the graphical nature of the framebuffer.
8 8
9 In the x86 architecture, the framebuffer console is optional, and 9 In the x86 architecture, the framebuffer console is optional, and
10 some even treat it as a toy. For other architectures, it is the only available 10 some even treat it as a toy. For other architectures, it is the only available
11 display device, text or graphical. 11 display device, text or graphical.
12 12
13 What are the features of fbcon? The framebuffer console supports 13 What are the features of fbcon? The framebuffer console supports
14 high resolutions, varying font types, display rotation, primitive multihead, 14 high resolutions, varying font types, display rotation, primitive multihead,
15 etc. Theoretically, multi-colored fonts, blending, aliasing, and any feature 15 etc. Theoretically, multi-colored fonts, blending, aliasing, and any feature
16 made available by the underlying graphics card are also possible. 16 made available by the underlying graphics card are also possible.
17 17
18 A. Configuration 18 A. Configuration
19 19
20 The framebuffer console can be enabled by using your favorite kernel 20 The framebuffer console can be enabled by using your favorite kernel
21 configuration tool. It is under Device Drivers->Graphics Support->Support for 21 configuration tool. It is under Device Drivers->Graphics Support->Support for
22 framebuffer devices->Framebuffer Console Support. Select 'y' to compile 22 framebuffer devices->Framebuffer Console Support. Select 'y' to compile
23 support statically, or 'm' for module support. The module will be fbcon. 23 support statically, or 'm' for module support. The module will be fbcon.
24 24
25 In order for fbcon to activate, at least one framebuffer driver is 25 In order for fbcon to activate, at least one framebuffer driver is
26 required, so choose from any of the numerous drivers available. For x86 26 required, so choose from any of the numerous drivers available. For x86
27 systems, they almost universally have VGA cards, so vga16fb and vesafb will 27 systems, they almost universally have VGA cards, so vga16fb and vesafb will
28 always be available. However, using a chipset-specific driver will give you 28 always be available. However, using a chipset-specific driver will give you
29 more speed and features, such as the ability to change the video mode 29 more speed and features, such as the ability to change the video mode
30 dynamically. 30 dynamically.
31 31
32 To display the penguin logo, choose any logo available in Logo 32 To display the penguin logo, choose any logo available in Logo
33 Configuration->Boot up logo. 33 Configuration->Boot up logo.
34 34
35 Also, you will need to select at least one compiled-in fonts, but if 35 Also, you will need to select at least one compiled-in fonts, but if
36 you don't do anything, the kernel configuration tool will select one for you, 36 you don't do anything, the kernel configuration tool will select one for you,
37 usually an 8x16 font. 37 usually an 8x16 font.
38 38
39 GOTCHA: A common bug report is enabling the framebuffer without enabling the 39 GOTCHA: A common bug report is enabling the framebuffer without enabling the
40 framebuffer console. Depending on the driver, you may get a blanked or 40 framebuffer console. Depending on the driver, you may get a blanked or
41 garbled display, but the system still boots to completion. If you are 41 garbled display, but the system still boots to completion. If you are
42 fortunate to have a driver that does not alter the graphics chip, then you 42 fortunate to have a driver that does not alter the graphics chip, then you
43 will still get a VGA console. 43 will still get a VGA console.
44 44
45 B. Loading 45 B. Loading
46 46
47 Possible scenarios: 47 Possible scenarios:
48 48
49 1. Driver and fbcon are compiled statically 49 1. Driver and fbcon are compiled statically
50 50
51 Usually, fbcon will automatically take over your console. The notable 51 Usually, fbcon will automatically take over your console. The notable
52 exception is vesafb. It needs to be explicitly activated with the 52 exception is vesafb. It needs to be explicitly activated with the
53 vga= boot option parameter. 53 vga= boot option parameter.
54 54
55 2. Driver is compiled statically, fbcon is compiled as a module 55 2. Driver is compiled statically, fbcon is compiled as a module
56 56
57 Depending on the driver, you either get a standard console, or a 57 Depending on the driver, you either get a standard console, or a
58 garbled display, as mentioned above. To get a framebuffer console, 58 garbled display, as mentioned above. To get a framebuffer console,
59 do a 'modprobe fbcon'. 59 do a 'modprobe fbcon'.
60 60
61 3. Driver is compiled as a module, fbcon is compiled statically 61 3. Driver is compiled as a module, fbcon is compiled statically
62 62
63 You get your standard console. Once the driver is loaded with 63 You get your standard console. Once the driver is loaded with
64 'modprobe xxxfb', fbcon automatically takes over the console with 64 'modprobe xxxfb', fbcon automatically takes over the console with
65 the possible exception of using the fbcon=map:n option. See below. 65 the possible exception of using the fbcon=map:n option. See below.
66 66
67 4. Driver and fbcon are compiled as a module. 67 4. Driver and fbcon are compiled as a module.
68 68
69 You can load them in any order. Once both are loaded, fbcon will take 69 You can load them in any order. Once both are loaded, fbcon will take
70 over the console. 70 over the console.
71 71
72 C. Boot options 72 C. Boot options
73 73
74 The framebuffer console has several, largely unknown, boot options 74 The framebuffer console has several, largely unknown, boot options
75 that can change its behavior. 75 that can change its behavior.
76 76
77 1. fbcon=font:<name> 77 1. fbcon=font:<name>
78 78
79 Select the initial font to use. The value 'name' can be any of the 79 Select the initial font to use. The value 'name' can be any of the
80 compiled-in fonts: VGA8x16, 7x14, 10x18, VGA8x8, MINI4x6, RomanLarge, 80 compiled-in fonts: VGA8x16, 7x14, 10x18, VGA8x8, MINI4x6, RomanLarge,
81 SUN8x16, SUN12x22, ProFont6x11, Acorn8x8, PEARL8x8. 81 SUN8x16, SUN12x22, ProFont6x11, Acorn8x8, PEARL8x8.
82 82
83 Note, not all drivers can handle font with widths not divisible by 8, 83 Note, not all drivers can handle font with widths not divisible by 8,
84 such as vga16fb. 84 such as vga16fb.
85 85
86 2. fbcon=scrollback:<value>[k] 86 2. fbcon=scrollback:<value>[k]
87 87
88 The scrollback buffer is memory that is used to preserve display 88 The scrollback buffer is memory that is used to preserve display
89 contents that has already scrolled past your view. This is accessed 89 contents that has already scrolled past your view. This is accessed
90 by using the Shift-PageUp key combination. The value 'value' is any 90 by using the Shift-PageUp key combination. The value 'value' is any
91 integer. It defaults to 32KB. The 'k' suffix is optional, and will 91 integer. It defaults to 32KB. The 'k' suffix is optional, and will
92 multiply the 'value' by 1024. 92 multiply the 'value' by 1024.
93 93
94 3. fbcon=map:<0123> 94 3. fbcon=map:<0123>
95 95
96 This is an interesting option. It tells which driver gets mapped to 96 This is an interesting option. It tells which driver gets mapped to
97 which console. The value '0123' is a sequence that gets repeated until 97 which console. The value '0123' is a sequence that gets repeated until
98 the total length is 64 which is the number of consoles available. In 98 the total length is 64 which is the number of consoles available. In
99 the above example, it is expanded to 012301230123... and the mapping 99 the above example, it is expanded to 012301230123... and the mapping
100 will be: 100 will be:
101 101
102 tty | 1 2 3 4 5 6 7 8 9 ... 102 tty | 1 2 3 4 5 6 7 8 9 ...
103 fb | 0 1 2 3 0 1 2 3 0 ... 103 fb | 0 1 2 3 0 1 2 3 0 ...
104 104
105 ('cat /proc/fb' should tell you what the fb numbers are) 105 ('cat /proc/fb' should tell you what the fb numbers are)
106 106
107 One side effect that may be useful is using a map value that exceeds 107 One side effect that may be useful is using a map value that exceeds
108 the number of loaded fb drivers. For example, if only one driver is 108 the number of loaded fb drivers. For example, if only one driver is
109 available, fb0, adding fbcon=map:1 tells fbcon not to take over the 109 available, fb0, adding fbcon=map:1 tells fbcon not to take over the
110 console. 110 console.
111 111
112 Later on, when you want to map the console the to the framebuffer 112 Later on, when you want to map the console the to the framebuffer
113 device, you can use the con2fbmap utility. 113 device, you can use the con2fbmap utility.
114 114
115 4. fbcon=vc:<n1>-<n2> 115 4. fbcon=vc:<n1>-<n2>
116 116
117 This option tells fbcon to take over only a range of consoles as 117 This option tells fbcon to take over only a range of consoles as
118 specified by the values 'n1' and 'n2'. The rest of the consoles 118 specified by the values 'n1' and 'n2'. The rest of the consoles
119 outside the given range will still be controlled by the standard 119 outside the given range will still be controlled by the standard
120 console driver. 120 console driver.
121 121
122 NOTE: For x86 machines, the standard console is the VGA console which 122 NOTE: For x86 machines, the standard console is the VGA console which
123 is typically located on the same video card. Thus, the consoles that 123 is typically located on the same video card. Thus, the consoles that
124 are controlled by the VGA console will be garbled. 124 are controlled by the VGA console will be garbled.
125 125
126 4. fbcon=rotate:<n> 126 4. fbcon=rotate:<n>
127 127
128 This option changes the orientation angle of the console display. The 128 This option changes the orientation angle of the console display. The
129 value 'n' accepts the following: 129 value 'n' accepts the following:
130 130
131 0 - normal orientation (0 degree) 131 0 - normal orientation (0 degree)
132 1 - clockwise orientation (90 degrees) 132 1 - clockwise orientation (90 degrees)
133 2 - upside down orientation (180 degrees) 133 2 - upside down orientation (180 degrees)
134 3 - counterclockwise orientation (270 degrees) 134 3 - counterclockwise orientation (270 degrees)
135 135
136 The angle can be changed anytime afterwards by 'echoing' the same 136 The angle can be changed anytime afterwards by 'echoing' the same
137 numbers to any one of the 2 attributes found in 137 numbers to any one of the 2 attributes found in
138 /sys/class/graphics/fbcon 138 /sys/class/graphics/fbcon
139 139
140 rotate - rotate the display of the active console 140 rotate - rotate the display of the active console
141 rotate_all - rotate the display of all consoles 141 rotate_all - rotate the display of all consoles
142 142
143 Console rotation will only become available if Console Rotation 143 Console rotation will only become available if Console Rotation
144 Support is compiled in your kernel. 144 Support is compiled in your kernel.
145 145
146 NOTE: This is purely console rotation. Any other applications that 146 NOTE: This is purely console rotation. Any other applications that
147 use the framebuffer will remain at their 'normal'orientation. 147 use the framebuffer will remain at their 'normal'orientation.
148 Actually, the underlying fb driver is totally ignorant of console 148 Actually, the underlying fb driver is totally ignorant of console
149 rotation. 149 rotation.
150 150
151 C. Attaching, Detaching and Unloading 151 C. Attaching, Detaching and Unloading
152 152
153 Before going on on how to attach, detach and unload the framebuffer console, an 153 Before going on on how to attach, detach and unload the framebuffer console, an
154 illustration of the dependencies may help. 154 illustration of the dependencies may help.
155 155
156 The console layer, as with most subsystems, needs a driver that interfaces with 156 The console layer, as with most subsystems, needs a driver that interfaces with
157 the hardware. Thus, in a VGA console: 157 the hardware. Thus, in a VGA console:
158 158
159 console ---> VGA driver ---> hardware. 159 console ---> VGA driver ---> hardware.
160 160
161 Assuming the VGA driver can be unloaded, one must first unbind the VGA driver 161 Assuming the VGA driver can be unloaded, one must first unbind the VGA driver
162 from the console layer before unloading the driver. The VGA driver cannot be 162 from the console layer before unloading the driver. The VGA driver cannot be
163 unloaded if it is still bound to the console layer. (See 163 unloaded if it is still bound to the console layer. (See
164 Documentation/console/console.txt for more information). 164 Documentation/console/console.txt for more information).
165 165
166 This is more complicated in the case of the the framebuffer console (fbcon), 166 This is more complicated in the case of the framebuffer console (fbcon),
167 because fbcon is an intermediate layer between the console and the drivers: 167 because fbcon is an intermediate layer between the console and the drivers:
168 168
169 console ---> fbcon ---> fbdev drivers ---> hardware 169 console ---> fbcon ---> fbdev drivers ---> hardware
170 170
171 The fbdev drivers cannot be unloaded if it's bound to fbcon, and fbcon cannot 171 The fbdev drivers cannot be unloaded if it's bound to fbcon, and fbcon cannot
172 be unloaded if it's bound to the console layer. 172 be unloaded if it's bound to the console layer.
173 173
174 So to unload the fbdev drivers, one must first unbind fbcon from the console, 174 So to unload the fbdev drivers, one must first unbind fbcon from the console,
175 then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from 175 then unbind the fbdev drivers from fbcon. Fortunately, unbinding fbcon from
176 the console layer will automatically unbind framebuffer drivers from 176 the console layer will automatically unbind framebuffer drivers from
177 fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from 177 fbcon. Thus, there is no need to explicitly unbind the fbdev drivers from
178 fbcon. 178 fbcon.
179 179
180 So, how do we unbind fbcon from the console? Part of the answer is in 180 So, how do we unbind fbcon from the console? Part of the answer is in
181 Documentation/console/console.txt. To summarize: 181 Documentation/console/console.txt. To summarize:
182 182
183 Echo a value to the bind file that represents the framebuffer console 183 Echo a value to the bind file that represents the framebuffer console
184 driver. So assuming vtcon1 represents fbcon, then: 184 driver. So assuming vtcon1 represents fbcon, then:
185 185
186 echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to 186 echo 1 > sys/class/vtconsole/vtcon1/bind - attach framebuffer console to
187 console layer 187 console layer
188 echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from 188 echo 0 > sys/class/vtconsole/vtcon1/bind - detach framebuffer console from
189 console layer 189 console layer
190 190
191 If fbcon is detached from the console layer, your boot console driver (which is 191 If fbcon is detached from the console layer, your boot console driver (which is
192 usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will 192 usually VGA text mode) will take over. A few drivers (rivafb and i810fb) will
193 restore VGA text mode for you. With the rest, before detaching fbcon, you 193 restore VGA text mode for you. With the rest, before detaching fbcon, you
194 must take a few additional steps to make sure that your VGA text mode is 194 must take a few additional steps to make sure that your VGA text mode is
195 restored properly. The following is one of the several methods that you can do: 195 restored properly. The following is one of the several methods that you can do:
196 196
197 1. Download or install vbetool. This utility is included with most 197 1. Download or install vbetool. This utility is included with most
198 distributions nowadays, and is usually part of the suspend/resume tool. 198 distributions nowadays, and is usually part of the suspend/resume tool.
199 199
200 2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set 200 2. In your kernel configuration, ensure that CONFIG_FRAMEBUFFER_CONSOLE is set
201 to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers. 201 to 'y' or 'm'. Enable one or more of your favorite framebuffer drivers.
202 202
203 3. Boot into text mode and as root run: 203 3. Boot into text mode and as root run:
204 204
205 vbetool vbestate save > <vga state file> 205 vbetool vbestate save > <vga state file>
206 206
207 The above command saves the register contents of your graphics 207 The above command saves the register contents of your graphics
208 hardware to <vga state file>. You need to do this step only once as 208 hardware to <vga state file>. You need to do this step only once as
209 the state file can be reused. 209 the state file can be reused.
210 210
211 4. If fbcon is compiled as a module, load fbcon by doing: 211 4. If fbcon is compiled as a module, load fbcon by doing:
212 212
213 modprobe fbcon 213 modprobe fbcon
214 214
215 5. Now to detach fbcon: 215 5. Now to detach fbcon:
216 216
217 vbetool vbestate restore < <vga state file> && \ 217 vbetool vbestate restore < <vga state file> && \
218 echo 0 > /sys/class/vtconsole/vtcon1/bind 218 echo 0 > /sys/class/vtconsole/vtcon1/bind
219 219
220 6. That's it, you're back to VGA mode. And if you compiled fbcon as a module, 220 6. That's it, you're back to VGA mode. And if you compiled fbcon as a module,
221 you can unload it by 'rmmod fbcon' 221 you can unload it by 'rmmod fbcon'
222 222
223 7. To reattach fbcon: 223 7. To reattach fbcon:
224 224
225 echo 1 > /sys/class/vtconsole/vtcon1/bind 225 echo 1 > /sys/class/vtconsole/vtcon1/bind
226 226
227 8. Once fbcon is unbound, all drivers registered to the system will also 227 8. Once fbcon is unbound, all drivers registered to the system will also
228 become unbound. This means that fbcon and individual framebuffer drivers 228 become unbound. This means that fbcon and individual framebuffer drivers
229 can be unloaded or reloaded at will. Reloading the drivers or fbcon will 229 can be unloaded or reloaded at will. Reloading the drivers or fbcon will
230 automatically bind the console, fbcon and the drivers together. Unloading 230 automatically bind the console, fbcon and the drivers together. Unloading
231 all the drivers without unloading fbcon will make it impossible for the 231 all the drivers without unloading fbcon will make it impossible for the
232 console to bind fbcon. 232 console to bind fbcon.
233 233
234 Notes for vesafb users: 234 Notes for vesafb users:
235 ======================= 235 =======================
236 236
237 Unfortunately, if your bootline includes a vga=xxx parameter that sets the 237 Unfortunately, if your bootline includes a vga=xxx parameter that sets the
238 hardware in graphics mode, such as when loading vesafb, vgacon will not load. 238 hardware in graphics mode, such as when loading vesafb, vgacon will not load.
239 Instead, vgacon will replace the default boot console with dummycon, and you 239 Instead, vgacon will replace the default boot console with dummycon, and you
240 won't get any display after detaching fbcon. Your machine is still alive, so 240 won't get any display after detaching fbcon. Your machine is still alive, so
241 you can reattach vesafb. However, to reattach vesafb, you need to do one of 241 you can reattach vesafb. However, to reattach vesafb, you need to do one of
242 the following: 242 the following:
243 243
244 Variation 1: 244 Variation 1:
245 245
246 a. Before detaching fbcon, do 246 a. Before detaching fbcon, do
247 247
248 vbetool vbemode save > <vesa state file> # do once for each vesafb mode, 248 vbetool vbemode save > <vesa state file> # do once for each vesafb mode,
249 # the file can be reused 249 # the file can be reused
250 250
251 b. Detach fbcon as in step 5. 251 b. Detach fbcon as in step 5.
252 252
253 c. Attach fbcon 253 c. Attach fbcon
254 254
255 vbetool vbestate restore < <vesa state file> && \ 255 vbetool vbestate restore < <vesa state file> && \
256 echo 1 > /sys/class/vtconsole/vtcon1/bind 256 echo 1 > /sys/class/vtconsole/vtcon1/bind
257 257
258 Variation 2: 258 Variation 2:
259 259
260 a. Before detaching fbcon, do: 260 a. Before detaching fbcon, do:
261 echo <ID> > /sys/class/tty/console/bind 261 echo <ID> > /sys/class/tty/console/bind
262 262
263 263
264 vbetool vbemode get 264 vbetool vbemode get
265 265
266 b. Take note of the mode number 266 b. Take note of the mode number
267 267
268 b. Detach fbcon as in step 5. 268 b. Detach fbcon as in step 5.
269 269
270 c. Attach fbcon: 270 c. Attach fbcon:
271 271
272 vbetool vbemode set <mode number> && \ 272 vbetool vbemode set <mode number> && \
273 echo 1 > /sys/class/vtconsole/vtcon1/bind 273 echo 1 > /sys/class/vtconsole/vtcon1/bind
274 274
275 Samples: 275 Samples:
276 ======== 276 ========
277 277
278 Here are 2 sample bash scripts that you can use to bind or unbind the 278 Here are 2 sample bash scripts that you can use to bind or unbind the
279 framebuffer console driver if you are in an X86 box: 279 framebuffer console driver if you are in an X86 box:
280 280
281 --------------------------------------------------------------------------- 281 ---------------------------------------------------------------------------
282 #!/bin/bash 282 #!/bin/bash
283 # Unbind fbcon 283 # Unbind fbcon
284 284
285 # Change this to where your actual vgastate file is located 285 # Change this to where your actual vgastate file is located
286 # Or Use VGASTATE=$1 to indicate the state file at runtime 286 # Or Use VGASTATE=$1 to indicate the state file at runtime
287 VGASTATE=/tmp/vgastate 287 VGASTATE=/tmp/vgastate
288 288
289 # path to vbetool 289 # path to vbetool
290 VBETOOL=/usr/local/bin 290 VBETOOL=/usr/local/bin
291 291
292 292
293 for (( i = 0; i < 16; i++)) 293 for (( i = 0; i < 16; i++))
294 do 294 do
295 if test -x /sys/class/vtconsole/vtcon$i; then 295 if test -x /sys/class/vtconsole/vtcon$i; then
296 if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ 296 if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \
297 = 1 ]; then 297 = 1 ]; then
298 if test -x $VBETOOL/vbetool; then 298 if test -x $VBETOOL/vbetool; then
299 echo Unbinding vtcon$i 299 echo Unbinding vtcon$i
300 $VBETOOL/vbetool vbestate restore < $VGASTATE 300 $VBETOOL/vbetool vbestate restore < $VGASTATE
301 echo 0 > /sys/class/vtconsole/vtcon$i/bind 301 echo 0 > /sys/class/vtconsole/vtcon$i/bind
302 fi 302 fi
303 fi 303 fi
304 fi 304 fi
305 done 305 done
306 306
307 --------------------------------------------------------------------------- 307 ---------------------------------------------------------------------------
308 #!/bin/bash 308 #!/bin/bash
309 # Bind fbcon 309 # Bind fbcon
310 310
311 for (( i = 0; i < 16; i++)) 311 for (( i = 0; i < 16; i++))
312 do 312 do
313 if test -x /sys/class/vtconsole/vtcon$i; then 313 if test -x /sys/class/vtconsole/vtcon$i; then
314 if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \ 314 if [ `cat /sys/class/vtconsole/vtcon$i/name | grep -c "frame buffer"` \
315 = 1 ]; then 315 = 1 ]; then
316 echo Unbinding vtcon$i 316 echo Unbinding vtcon$i
317 echo 1 > /sys/class/vtconsole/vtcon$i/bind 317 echo 1 > /sys/class/vtconsole/vtcon$i/bind
318 fi 318 fi
319 fi 319 fi
320 done 320 done
321 --------------------------------------------------------------------------- 321 ---------------------------------------------------------------------------
322 322
323 -- 323 --
324 Antonino Daplas <adaplas@pol.net> 324 Antonino Daplas <adaplas@pol.net>
325 325
Documentation/filesystems/directory-locking
1 Locking scheme used for directory operations is based on two 1 Locking scheme used for directory operations is based on two
2 kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). 2 kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem).
3 3
4 For our purposes all operations fall in 5 classes: 4 For our purposes all operations fall in 5 classes:
5 5
6 1) read access. Locking rules: caller locks directory we are accessing. 6 1) read access. Locking rules: caller locks directory we are accessing.
7 7
8 2) object creation. Locking rules: same as above. 8 2) object creation. Locking rules: same as above.
9 9
10 3) object removal. Locking rules: caller locks parent, finds victim, 10 3) object removal. Locking rules: caller locks parent, finds victim,
11 locks victim and calls the method. 11 locks victim and calls the method.
12 12
13 4) rename() that is _not_ cross-directory. Locking rules: caller locks 13 4) rename() that is _not_ cross-directory. Locking rules: caller locks
14 the parent, finds source and target, if target already exists - locks it 14 the parent, finds source and target, if target already exists - locks it
15 and then calls the method. 15 and then calls the method.
16 16
17 5) link creation. Locking rules: 17 5) link creation. Locking rules:
18 * lock parent 18 * lock parent
19 * check that source is not a directory 19 * check that source is not a directory
20 * lock source 20 * lock source
21 * call the method. 21 * call the method.
22 22
23 6) cross-directory rename. The trickiest in the whole bunch. Locking 23 6) cross-directory rename. The trickiest in the whole bunch. Locking
24 rules: 24 rules:
25 * lock the filesystem 25 * lock the filesystem
26 * lock parents in "ancestors first" order. 26 * lock parents in "ancestors first" order.
27 * find source and target. 27 * find source and target.
28 * if old parent is equal to or is a descendent of target 28 * if old parent is equal to or is a descendent of target
29 fail with -ENOTEMPTY 29 fail with -ENOTEMPTY
30 * if new parent is equal to or is a descendent of source 30 * if new parent is equal to or is a descendent of source
31 fail with -ELOOP 31 fail with -ELOOP
32 * if target exists - lock it. 32 * if target exists - lock it.
33 * call the method. 33 * call the method.
34 34
35 35
36 The rules above obviously guarantee that all directories that are going to be 36 The rules above obviously guarantee that all directories that are going to be
37 read, modified or removed by method will be locked by caller. 37 read, modified or removed by method will be locked by caller.
38 38
39 39
40 If no directory is its own ancestor, the scheme above is deadlock-free. 40 If no directory is its own ancestor, the scheme above is deadlock-free.
41 Proof: 41 Proof:
42 42
43 First of all, at any moment we have a partial ordering of the 43 First of all, at any moment we have a partial ordering of the
44 objects - A < B iff A is an ancestor of B. 44 objects - A < B iff A is an ancestor of B.
45 45
46 That ordering can change. However, the following is true: 46 That ordering can change. However, the following is true:
47 47
48 (1) if object removal or non-cross-directory rename holds lock on A and 48 (1) if object removal or non-cross-directory rename holds lock on A and
49 attempts to acquire lock on B, A will remain the parent of B until we 49 attempts to acquire lock on B, A will remain the parent of B until we
50 acquire the lock on B. (Proof: only cross-directory rename can change 50 acquire the lock on B. (Proof: only cross-directory rename can change
51 the parent of object and it would have to lock the parent). 51 the parent of object and it would have to lock the parent).
52 52
53 (2) if cross-directory rename holds the lock on filesystem, order will not 53 (2) if cross-directory rename holds the lock on filesystem, order will not
54 change until rename acquires all locks. (Proof: other cross-directory 54 change until rename acquires all locks. (Proof: other cross-directory
55 renames will be blocked on filesystem lock and we don't start changing 55 renames will be blocked on filesystem lock and we don't start changing
56 the order until we had acquired all locks). 56 the order until we had acquired all locks).
57 57
58 (3) any operation holds at most one lock on non-directory object and 58 (3) any operation holds at most one lock on non-directory object and
59 that lock is acquired after all other locks. (Proof: see descriptions 59 that lock is acquired after all other locks. (Proof: see descriptions
60 of operations). 60 of operations).
61 61
62 Now consider the minimal deadlock. Each process is blocked on 62 Now consider the minimal deadlock. Each process is blocked on
63 attempt to acquire some lock and already holds at least one lock. Let's 63 attempt to acquire some lock and already holds at least one lock. Let's
64 consider the set of contended locks. First of all, filesystem lock is 64 consider the set of contended locks. First of all, filesystem lock is
65 not contended, since any process blocked on it is not holding any locks. 65 not contended, since any process blocked on it is not holding any locks.
66 Thus all processes are blocked on ->i_sem. 66 Thus all processes are blocked on ->i_sem.
67 67
68 Non-directory objects are not contended due to (3). Thus link 68 Non-directory objects are not contended due to (3). Thus link
69 creation can't be a part of deadlock - it can't be blocked on source 69 creation can't be a part of deadlock - it can't be blocked on source
70 and it means that it doesn't hold any locks. 70 and it means that it doesn't hold any locks.
71 71
72 Any contended object is either held by cross-directory rename or 72 Any contended object is either held by cross-directory rename or
73 has a child that is also contended. Indeed, suppose that it is held by 73 has a child that is also contended. Indeed, suppose that it is held by
74 operation other than cross-directory rename. Then the lock this operation 74 operation other than cross-directory rename. Then the lock this operation
75 is blocked on belongs to child of that object due to (1). 75 is blocked on belongs to child of that object due to (1).
76 76
77 It means that one of the operations is cross-directory rename. 77 It means that one of the operations is cross-directory rename.
78 Otherwise the set of contended objects would be infinite - each of them 78 Otherwise the set of contended objects would be infinite - each of them
79 would have a contended child and we had assumed that no object is its 79 would have a contended child and we had assumed that no object is its
80 own descendent. Moreover, there is exactly one cross-directory rename 80 own descendent. Moreover, there is exactly one cross-directory rename
81 (see above). 81 (see above).
82 82
83 Consider the object blocking the cross-directory rename. One 83 Consider the object blocking the cross-directory rename. One
84 of its descendents is locked by cross-directory rename (otherwise we 84 of its descendents is locked by cross-directory rename (otherwise we
85 would again have an infinite set of of contended objects). But that 85 would again have an infinite set of contended objects). But that
86 means that cross-directory rename is taking locks out of order. Due 86 means that cross-directory rename is taking locks out of order. Due
87 to (2) the order hadn't changed since we had acquired filesystem lock. 87 to (2) the order hadn't changed since we had acquired filesystem lock.
88 But locking rules for cross-directory rename guarantee that we do not 88 But locking rules for cross-directory rename guarantee that we do not
89 try to acquire lock on descendent before the lock on ancestor. 89 try to acquire lock on descendent before the lock on ancestor.
90 Contradiction. I.e. deadlock is impossible. Q.E.D. 90 Contradiction. I.e. deadlock is impossible. Q.E.D.
91 91
92 92
93 These operations are guaranteed to avoid loop creation. Indeed, 93 These operations are guaranteed to avoid loop creation. Indeed,
94 the only operation that could introduce loops is cross-directory rename. 94 the only operation that could introduce loops is cross-directory rename.
95 Since the only new (parent, child) pair added by rename() is (new parent, 95 Since the only new (parent, child) pair added by rename() is (new parent,
96 source), such loop would have to contain these objects and the rest of it 96 source), such loop would have to contain these objects and the rest of it
97 would have to exist before rename(). I.e. at the moment of loop creation 97 would have to exist before rename(). I.e. at the moment of loop creation
98 rename() responsible for that would be holding filesystem lock and new parent 98 rename() responsible for that would be holding filesystem lock and new parent
99 would have to be equal to or a descendent of source. But that means that 99 would have to be equal to or a descendent of source. But that means that
100 new parent had been equal to or a descendent of source since the moment when 100 new parent had been equal to or a descendent of source since the moment when
101 we had acquired filesystem lock and rename() would fail with -ELOOP in that 101 we had acquired filesystem lock and rename() would fail with -ELOOP in that
102 case. 102 case.
103 103
104 While this locking scheme works for arbitrary DAGs, it relies on 104 While this locking scheme works for arbitrary DAGs, it relies on
105 ability to check that directory is a descendent of another object. Current 105 ability to check that directory is a descendent of another object. Current
106 implementation assumes that directory graph is a tree. This assumption is 106 implementation assumes that directory graph is a tree. This assumption is
107 also preserved by all operations (cross-directory rename on a tree that would 107 also preserved by all operations (cross-directory rename on a tree that would
108 not introduce a cycle will leave it a tree and link() fails for directories). 108 not introduce a cycle will leave it a tree and link() fails for directories).
109 109
110 Notice that "directory" in the above == "anything that might have 110 Notice that "directory" in the above == "anything that might have
111 children", so if we are going to introduce hybrid objects we will need 111 children", so if we are going to introduce hybrid objects we will need
112 either to make sure that link(2) doesn't work for them or to make changes 112 either to make sure that link(2) doesn't work for them or to make changes
113 in is_subdir() that would make it work even in presence of such beasts. 113 in is_subdir() that would make it work even in presence of such beasts.
114 114
Documentation/filesystems/files.txt
1 File management in the Linux kernel 1 File management in the Linux kernel
2 ----------------------------------- 2 -----------------------------------
3 3
4 This document describes how locking for files (struct file) 4 This document describes how locking for files (struct file)
5 and file descriptor table (struct files) works. 5 and file descriptor table (struct files) works.
6 6
7 Up until 2.6.12, the file descriptor table has been protected 7 Up until 2.6.12, the file descriptor table has been protected
8 with a lock (files->file_lock) and reference count (files->count). 8 with a lock (files->file_lock) and reference count (files->count).
9 ->file_lock protected accesses to all the file related fields 9 ->file_lock protected accesses to all the file related fields
10 of the table. ->count was used for sharing the file descriptor 10 of the table. ->count was used for sharing the file descriptor
11 table between tasks cloned with CLONE_FILES flag. Typically 11 table between tasks cloned with CLONE_FILES flag. Typically
12 this would be the case for posix threads. As with the common 12 this would be the case for posix threads. As with the common
13 refcounting model in the kernel, the last task doing 13 refcounting model in the kernel, the last task doing
14 a put_files_struct() frees the file descriptor (fd) table. 14 a put_files_struct() frees the file descriptor (fd) table.
15 The files (struct file) themselves are protected using 15 The files (struct file) themselves are protected using
16 reference count (->f_count). 16 reference count (->f_count).
17 17
18 In the new lock-free model of file descriptor management, 18 In the new lock-free model of file descriptor management,
19 the reference counting is similar, but the locking is 19 the reference counting is similar, but the locking is
20 based on RCU. The file descriptor table contains multiple 20 based on RCU. The file descriptor table contains multiple
21 elements - the fd sets (open_fds and close_on_exec, the 21 elements - the fd sets (open_fds and close_on_exec, the
22 array of file pointers, the sizes of the sets and the array 22 array of file pointers, the sizes of the sets and the array
23 etc.). In order for the updates to appear atomic to 23 etc.). In order for the updates to appear atomic to
24 a lock-free reader, all the elements of the file descriptor 24 a lock-free reader, all the elements of the file descriptor
25 table are in a separate structure - struct fdtable. 25 table are in a separate structure - struct fdtable.
26 files_struct contains a pointer to struct fdtable through 26 files_struct contains a pointer to struct fdtable through
27 which the actual fd table is accessed. Initially the 27 which the actual fd table is accessed. Initially the
28 fdtable is embedded in files_struct itself. On a subsequent 28 fdtable is embedded in files_struct itself. On a subsequent
29 expansion of fdtable, a new fdtable structure is allocated 29 expansion of fdtable, a new fdtable structure is allocated
30 and files->fdtab points to the new structure. The fdtable 30 and files->fdtab points to the new structure. The fdtable
31 structure is freed with RCU and lock-free readers either 31 structure is freed with RCU and lock-free readers either
32 see the old fdtable or the new fdtable making the update 32 see the old fdtable or the new fdtable making the update
33 appear atomic. Here are the locking rules for 33 appear atomic. Here are the locking rules for
34 the fdtable structure - 34 the fdtable structure -
35 35
36 1. All references to the fdtable must be done through 36 1. All references to the fdtable must be done through
37 the files_fdtable() macro : 37 the files_fdtable() macro :
38 38
39 struct fdtable *fdt; 39 struct fdtable *fdt;
40 40
41 rcu_read_lock(); 41 rcu_read_lock();
42 42
43 fdt = files_fdtable(files); 43 fdt = files_fdtable(files);
44 .... 44 ....
45 if (n <= fdt->max_fds) 45 if (n <= fdt->max_fds)
46 .... 46 ....
47 ... 47 ...
48 rcu_read_unlock(); 48 rcu_read_unlock();
49 49
50 files_fdtable() uses rcu_dereference() macro which takes care of 50 files_fdtable() uses rcu_dereference() macro which takes care of
51 the memory barrier requirements for lock-free dereference. 51 the memory barrier requirements for lock-free dereference.
52 The fdtable pointer must be read within the read-side 52 The fdtable pointer must be read within the read-side
53 critical section. 53 critical section.
54 54
55 2. Reading of the fdtable as described above must be protected 55 2. Reading of the fdtable as described above must be protected
56 by rcu_read_lock()/rcu_read_unlock(). 56 by rcu_read_lock()/rcu_read_unlock().
57 57
58 3. For any update to the the fd table, files->file_lock must 58 3. For any update to the fd table, files->file_lock must
59 be held. 59 be held.
60 60
61 4. To look up the file structure given an fd, a reader 61 4. To look up the file structure given an fd, a reader
62 must use either fcheck() or fcheck_files() APIs. These 62 must use either fcheck() or fcheck_files() APIs. These
63 take care of barrier requirements due to lock-free lookup. 63 take care of barrier requirements due to lock-free lookup.
64 An example : 64 An example :
65 65
66 struct file *file; 66 struct file *file;
67 67
68 rcu_read_lock(); 68 rcu_read_lock();
69 file = fcheck(fd); 69 file = fcheck(fd);
70 if (file) { 70 if (file) {
71 ... 71 ...
72 } 72 }
73 .... 73 ....
74 rcu_read_unlock(); 74 rcu_read_unlock();
75 75
76 5. Handling of the file structures is special. Since the look-up 76 5. Handling of the file structures is special. Since the look-up
77 of the fd (fget()/fget_light()) are lock-free, it is possible 77 of the fd (fget()/fget_light()) are lock-free, it is possible
78 that look-up may race with the last put() operation on the 78 that look-up may race with the last put() operation on the
79 file structure. This is avoided using the rcuref APIs 79 file structure. This is avoided using the rcuref APIs
80 on ->f_count : 80 on ->f_count :
81 81
82 rcu_read_lock(); 82 rcu_read_lock();
83 file = fcheck_files(files, fd); 83 file = fcheck_files(files, fd);
84 if (file) { 84 if (file) {
85 if (rcuref_inc_lf(&file->f_count)) 85 if (rcuref_inc_lf(&file->f_count))
86 *fput_needed = 1; 86 *fput_needed = 1;
87 else 87 else
88 /* Didn't get the reference, someone's freed */ 88 /* Didn't get the reference, someone's freed */
89 file = NULL; 89 file = NULL;
90 } 90 }
91 rcu_read_unlock(); 91 rcu_read_unlock();
92 .... 92 ....
93 return file; 93 return file;
94 94
95 rcuref_inc_lf() detects if refcounts is already zero or 95 rcuref_inc_lf() detects if refcounts is already zero or
96 goes to zero during increment. If it does, we fail 96 goes to zero during increment. If it does, we fail
97 fget()/fget_light(). 97 fget()/fget_light().
98 98
99 6. Since both fdtable and file structures can be looked up 99 6. Since both fdtable and file structures can be looked up
100 lock-free, they must be installed using rcu_assign_pointer() 100 lock-free, they must be installed using rcu_assign_pointer()
101 API. If they are looked up lock-free, rcu_dereference() 101 API. If they are looked up lock-free, rcu_dereference()
102 must be used. However it is advisable to use files_fdtable() 102 must be used. However it is advisable to use files_fdtable()
103 and fcheck()/fcheck_files() which take care of these issues. 103 and fcheck()/fcheck_files() which take care of these issues.
104 104
105 7. While updating, the fdtable pointer must be looked up while 105 7. While updating, the fdtable pointer must be looked up while
106 holding files->file_lock. If ->file_lock is dropped, then 106 holding files->file_lock. If ->file_lock is dropped, then
107 another thread expand the files thereby creating a new 107 another thread expand the files thereby creating a new
108 fdtable and making the earlier fdtable pointer stale. 108 fdtable and making the earlier fdtable pointer stale.
109 For example : 109 For example :
110 110
111 spin_lock(&files->file_lock); 111 spin_lock(&files->file_lock);
112 fd = locate_fd(files, file, start); 112 fd = locate_fd(files, file, start);
113 if (fd >= 0) { 113 if (fd >= 0) {
114 /* locate_fd() may have expanded fdtable, load the ptr */ 114 /* locate_fd() may have expanded fdtable, load the ptr */
115 fdt = files_fdtable(files); 115 fdt = files_fdtable(files);
116 FD_SET(fd, fdt->open_fds); 116 FD_SET(fd, fdt->open_fds);
117 FD_CLR(fd, fdt->close_on_exec); 117 FD_CLR(fd, fdt->close_on_exec);
118 spin_unlock(&files->file_lock); 118 spin_unlock(&files->file_lock);
119 ..... 119 .....
120 120
121 Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 121 Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
122 the fdtable pointer (fdt) must be loaded after locate_fd(). 122 the fdtable pointer (fdt) must be loaded after locate_fd().
123 123
124 124
Documentation/filesystems/spufs.txt
1 SPUFS(2) Linux Programmer's Manual SPUFS(2) 1 SPUFS(2) Linux Programmer's Manual SPUFS(2)
2 2
3 3
4 4
5 NAME 5 NAME
6 spufs - the SPU file system 6 spufs - the SPU file system
7 7
8 8
9 DESCRIPTION 9 DESCRIPTION
10 The SPU file system is used on PowerPC machines that implement the Cell 10 The SPU file system is used on PowerPC machines that implement the Cell
11 Broadband Engine Architecture in order to access Synergistic Processor 11 Broadband Engine Architecture in order to access Synergistic Processor
12 Units (SPUs). 12 Units (SPUs).
13 13
14 The file system provides a name space similar to posix shared memory or 14 The file system provides a name space similar to posix shared memory or
15 message queues. Users that have write permissions on the file system 15 message queues. Users that have write permissions on the file system
16 can use spu_create(2) to establish SPU contexts in the spufs root. 16 can use spu_create(2) to establish SPU contexts in the spufs root.
17 17
18 Every SPU context is represented by a directory containing a predefined 18 Every SPU context is represented by a directory containing a predefined
19 set of files. These files can be used for manipulating the state of the 19 set of files. These files can be used for manipulating the state of the
20 logical SPU. Users can change permissions on those files, but not actu- 20 logical SPU. Users can change permissions on those files, but not actu-
21 ally add or remove files. 21 ally add or remove files.
22 22
23 23
24 MOUNT OPTIONS 24 MOUNT OPTIONS
25 uid=<uid> 25 uid=<uid>
26 set the user owning the mount point, the default is 0 (root). 26 set the user owning the mount point, the default is 0 (root).
27 27
28 gid=<gid> 28 gid=<gid>
29 set the group owning the mount point, the default is 0 (root). 29 set the group owning the mount point, the default is 0 (root).
30 30
31 31
32 FILES 32 FILES
33 The files in spufs mostly follow the standard behavior for regular sys- 33 The files in spufs mostly follow the standard behavior for regular sys-
34 tem calls like read(2) or write(2), but often support only a subset of 34 tem calls like read(2) or write(2), but often support only a subset of
35 the operations supported on regular file systems. This list details the 35 the operations supported on regular file systems. This list details the
36 supported operations and the deviations from the behaviour in the 36 supported operations and the deviations from the behaviour in the
37 respective man pages. 37 respective man pages.
38 38
39 All files that support the read(2) operation also support readv(2) and 39 All files that support the read(2) operation also support readv(2) and
40 all files that support the write(2) operation also support writev(2). 40 all files that support the write(2) operation also support writev(2).
41 All files support the access(2) and stat(2) family of operations, but 41 All files support the access(2) and stat(2) family of operations, but
42 only the st_mode, st_nlink, st_uid and st_gid fields of struct stat 42 only the st_mode, st_nlink, st_uid and st_gid fields of struct stat
43 contain reliable information. 43 contain reliable information.
44 44
45 All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- 45 All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera-
46 tions, but will not be able to grant permissions that contradict the 46 tions, but will not be able to grant permissions that contradict the
47 possible operations, e.g. read access on the wbox file. 47 possible operations, e.g. read access on the wbox file.
48 48
49 The current set of files is: 49 The current set of files is:
50 50
51 51
52 /mem 52 /mem
53 the contents of the local storage memory of the SPU. This can be 53 the contents of the local storage memory of the SPU. This can be
54 accessed like a regular shared memory file and contains both code and 54 accessed like a regular shared memory file and contains both code and
55 data in the address space of the SPU. The possible operations on an 55 data in the address space of the SPU. The possible operations on an
56 open mem file are: 56 open mem file are:
57 57
58 read(2), pread(2), write(2), pwrite(2), lseek(2) 58 read(2), pread(2), write(2), pwrite(2), lseek(2)
59 These operate as documented, with the exception that seek(2), 59 These operate as documented, with the exception that seek(2),
60 write(2) and pwrite(2) are not supported beyond the end of the 60 write(2) and pwrite(2) are not supported beyond the end of the
61 file. The file size is the size of the local storage of the SPU, 61 file. The file size is the size of the local storage of the SPU,
62 which normally is 256 kilobytes. 62 which normally is 256 kilobytes.
63 63
64 mmap(2) 64 mmap(2)
65 Mapping mem into the process address space gives access to the 65 Mapping mem into the process address space gives access to the
66 SPU local storage within the process address space. Only 66 SPU local storage within the process address space. Only
67 MAP_SHARED mappings are allowed. 67 MAP_SHARED mappings are allowed.
68 68
69 69
70 /mbox 70 /mbox
71 The first SPU to CPU communication mailbox. This file is read-only and 71 The first SPU to CPU communication mailbox. This file is read-only and
72 can be read in units of 32 bits. The file can only be used in non- 72 can be read in units of 32 bits. The file can only be used in non-
73 blocking mode and it even poll() will not block on it. The possible 73 blocking mode and it even poll() will not block on it. The possible
74 operations on an open mbox file are: 74 operations on an open mbox file are:
75 75
76 read(2) 76 read(2)
77 If a count smaller than four is requested, read returns -1 and 77 If a count smaller than four is requested, read returns -1 and
78 sets errno to EINVAL. If there is no data available in the mail 78 sets errno to EINVAL. If there is no data available in the mail
79 box, the return value is set to -1 and errno becomes EAGAIN. 79 box, the return value is set to -1 and errno becomes EAGAIN.
80 When data has been read successfully, four bytes are placed in 80 When data has been read successfully, four bytes are placed in
81 the data buffer and the value four is returned. 81 the data buffer and the value four is returned.
82 82
83 83
84 /ibox 84 /ibox
85 The second SPU to CPU communication mailbox. This file is similar to 85 The second SPU to CPU communication mailbox. This file is similar to
86 the first mailbox file, but can be read in blocking I/O mode, and the 86 the first mailbox file, but can be read in blocking I/O mode, and the
87 poll family of system calls can be used to wait for it. The possible 87 poll family of system calls can be used to wait for it. The possible
88 operations on an open ibox file are: 88 operations on an open ibox file are:
89 89
90 read(2) 90 read(2)
91 If a count smaller than four is requested, read returns -1 and 91 If a count smaller than four is requested, read returns -1 and
92 sets errno to EINVAL. If there is no data available in the mail 92 sets errno to EINVAL. If there is no data available in the mail
93 box and the file descriptor has been opened with O_NONBLOCK, the 93 box and the file descriptor has been opened with O_NONBLOCK, the
94 return value is set to -1 and errno becomes EAGAIN. 94 return value is set to -1 and errno becomes EAGAIN.
95 95
96 If there is no data available in the mail box and the file 96 If there is no data available in the mail box and the file
97 descriptor has been opened without O_NONBLOCK, the call will 97 descriptor has been opened without O_NONBLOCK, the call will
98 block until the SPU writes to its interrupt mailbox channel. 98 block until the SPU writes to its interrupt mailbox channel.
99 When data has been read successfully, four bytes are placed in 99 When data has been read successfully, four bytes are placed in
100 the data buffer and the value four is returned. 100 the data buffer and the value four is returned.
101 101
102 poll(2) 102 poll(2)
103 Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever 103 Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever
104 data is available for reading. 104 data is available for reading.
105 105
106 106
107 /wbox 107 /wbox
108 The CPU to SPU communation mailbox. It is write-only can can be written 108 The CPU to SPU communation mailbox. It is write-only and can be written
109 in units of 32 bits. If the mailbox is full, write() will block and 109 in units of 32 bits. If the mailbox is full, write() will block and
110 poll can be used to wait for it becoming empty again. The possible 110 poll can be used to wait for it becoming empty again. The possible
111 operations on an open wbox file are: write(2) If a count smaller than 111 operations on an open wbox file are: write(2) If a count smaller than
112 four is requested, write returns -1 and sets errno to EINVAL. If there 112 four is requested, write returns -1 and sets errno to EINVAL. If there
113 is no space available in the mail box and the file descriptor has been 113 is no space available in the mail box and the file descriptor has been
114 opened with O_NONBLOCK, the return value is set to -1 and errno becomes 114 opened with O_NONBLOCK, the return value is set to -1 and errno becomes
115 EAGAIN. 115 EAGAIN.
116 116
117 If there is no space available in the mail box and the file descriptor 117 If there is no space available in the mail box and the file descriptor
118 has been opened without O_NONBLOCK, the call will block until the SPU 118 has been opened without O_NONBLOCK, the call will block until the SPU
119 reads from its PPE mailbox channel. When data has been read success- 119 reads from its PPE mailbox channel. When data has been read success-
120 fully, four bytes are placed in the data buffer and the value four is 120 fully, four bytes are placed in the data buffer and the value four is
121 returned. 121 returned.
122 122
123 poll(2) 123 poll(2)
124 Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever 124 Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever
125 space is available for writing. 125 space is available for writing.
126 126
127 127
128 /mbox_stat 128 /mbox_stat
129 /ibox_stat 129 /ibox_stat
130 /wbox_stat 130 /wbox_stat
131 Read-only files that contain the length of the current queue, i.e. how 131 Read-only files that contain the length of the current queue, i.e. how
132 many words can be read from mbox or ibox or how many words can be 132 many words can be read from mbox or ibox or how many words can be
133 written to wbox without blocking. The files can be read only in 4-byte 133 written to wbox without blocking. The files can be read only in 4-byte
134 units and return a big-endian binary integer number. The possible 134 units and return a big-endian binary integer number. The possible
135 operations on an open *box_stat file are: 135 operations on an open *box_stat file are:
136 136
137 read(2) 137 read(2)
138 If a count smaller than four is requested, read returns -1 and 138 If a count smaller than four is requested, read returns -1 and
139 sets errno to EINVAL. Otherwise, a four byte value is placed in 139 sets errno to EINVAL. Otherwise, a four byte value is placed in
140 the data buffer, containing the number of elements that can be 140 the data buffer, containing the number of elements that can be
141 read from (for mbox_stat and ibox_stat) or written to (for 141 read from (for mbox_stat and ibox_stat) or written to (for
142 wbox_stat) the respective mail box without blocking or resulting 142 wbox_stat) the respective mail box without blocking or resulting
143 in EAGAIN. 143 in EAGAIN.
144 144
145 145
146 /npc 146 /npc
147 /decr 147 /decr
148 /decr_status 148 /decr_status
149 /spu_tag_mask 149 /spu_tag_mask
150 /event_mask 150 /event_mask
151 /srr0 151 /srr0
152 Internal registers of the SPU. The representation is an ASCII string 152 Internal registers of the SPU. The representation is an ASCII string
153 with the numeric value of the next instruction to be executed. These 153 with the numeric value of the next instruction to be executed. These
154 can be used in read/write mode for debugging, but normal operation of 154 can be used in read/write mode for debugging, but normal operation of
155 programs should not rely on them because access to any of them except 155 programs should not rely on them because access to any of them except
156 npc requires an SPU context save and is therefore very inefficient. 156 npc requires an SPU context save and is therefore very inefficient.
157 157
158 The contents of these files are: 158 The contents of these files are:
159 159
160 npc Next Program Counter 160 npc Next Program Counter
161 161
162 decr SPU Decrementer 162 decr SPU Decrementer
163 163
164 decr_status Decrementer Status 164 decr_status Decrementer Status
165 165
166 spu_tag_mask MFC tag mask for SPU DMA 166 spu_tag_mask MFC tag mask for SPU DMA
167 167
168 event_mask Event mask for SPU interrupts 168 event_mask Event mask for SPU interrupts
169 169
170 srr0 Interrupt Return address register 170 srr0 Interrupt Return address register
171 171
172 172
173 The possible operations on an open npc, decr, decr_status, 173 The possible operations on an open npc, decr, decr_status,
174 spu_tag_mask, event_mask or srr0 file are: 174 spu_tag_mask, event_mask or srr0 file are:
175 175
176 read(2) 176 read(2)
177 When the count supplied to the read call is shorter than the 177 When the count supplied to the read call is shorter than the
178 required length for the pointer value plus a newline character, 178 required length for the pointer value plus a newline character,
179 subsequent reads from the same file descriptor will result in 179 subsequent reads from the same file descriptor will result in
180 completing the string, regardless of changes to the register by 180 completing the string, regardless of changes to the register by
181 a running SPU task. When a complete string has been read, all 181 a running SPU task. When a complete string has been read, all
182 subsequent read operations will return zero bytes and a new file 182 subsequent read operations will return zero bytes and a new file
183 descriptor needs to be opened to read the value again. 183 descriptor needs to be opened to read the value again.
184 184
185 write(2) 185 write(2)
186 A write operation on the file results in setting the register to 186 A write operation on the file results in setting the register to
187 the value given in the string. The string is parsed from the 187 the value given in the string. The string is parsed from the
188 beginning to the first non-numeric character or the end of the 188 beginning to the first non-numeric character or the end of the
189 buffer. Subsequent writes to the same file descriptor overwrite 189 buffer. Subsequent writes to the same file descriptor overwrite
190 the previous setting. 190 the previous setting.
191 191
192 192
193 /fpcr 193 /fpcr
194 This file gives access to the Floating Point Status and Control Regis- 194 This file gives access to the Floating Point Status and Control Regis-
195 ter as a four byte long file. The operations on the fpcr file are: 195 ter as a four byte long file. The operations on the fpcr file are:
196 196
197 read(2) 197 read(2)
198 If a count smaller than four is requested, read returns -1 and 198 If a count smaller than four is requested, read returns -1 and
199 sets errno to EINVAL. Otherwise, a four byte value is placed in 199 sets errno to EINVAL. Otherwise, a four byte value is placed in
200 the data buffer, containing the current value of the fpcr regis- 200 the data buffer, containing the current value of the fpcr regis-
201 ter. 201 ter.
202 202
203 write(2) 203 write(2)
204 If a count smaller than four is requested, write returns -1 and 204 If a count smaller than four is requested, write returns -1 and
205 sets errno to EINVAL. Otherwise, a four byte value is copied 205 sets errno to EINVAL. Otherwise, a four byte value is copied
206 from the data buffer, updating the value of the fpcr register. 206 from the data buffer, updating the value of the fpcr register.
207 207
208 208
209 /signal1 209 /signal1
210 /signal2 210 /signal2
211 The two signal notification channels of an SPU. These are read-write 211 The two signal notification channels of an SPU. These are read-write
212 files that operate on a 32 bit word. Writing to one of these files 212 files that operate on a 32 bit word. Writing to one of these files
213 triggers an interrupt on the SPU. The value writting to the signal 213 triggers an interrupt on the SPU. The value writting to the signal
214 files can be read from the SPU through a channel read or from host user 214 files can be read from the SPU through a channel read or from host user
215 space through the file. After the value has been read by the SPU, it 215 space through the file. After the value has been read by the SPU, it
216 is reset to zero. The possible operations on an open signal1 or sig- 216 is reset to zero. The possible operations on an open signal1 or sig-
217 nal2 file are: 217 nal2 file are:
218 218
219 read(2) 219 read(2)
220 If a count smaller than four is requested, read returns -1 and 220 If a count smaller than four is requested, read returns -1 and
221 sets errno to EINVAL. Otherwise, a four byte value is placed in 221 sets errno to EINVAL. Otherwise, a four byte value is placed in
222 the data buffer, containing the current value of the specified 222 the data buffer, containing the current value of the specified
223 signal notification register. 223 signal notification register.
224 224
225 write(2) 225 write(2)
226 If a count smaller than four is requested, write returns -1 and 226 If a count smaller than four is requested, write returns -1 and
227 sets errno to EINVAL. Otherwise, a four byte value is copied 227 sets errno to EINVAL. Otherwise, a four byte value is copied
228 from the data buffer, updating the value of the specified signal 228 from the data buffer, updating the value of the specified signal
229 notification register. The signal notification register will 229 notification register. The signal notification register will
230 either be replaced with the input data or will be updated to the 230 either be replaced with the input data or will be updated to the
231 bitwise OR or the old value and the input data, depending on the 231 bitwise OR or the old value and the input data, depending on the
232 contents of the signal1_type, or signal2_type respectively, 232 contents of the signal1_type, or signal2_type respectively,
233 file. 233 file.
234 234
235 235
236 /signal1_type 236 /signal1_type
237 /signal2_type 237 /signal2_type
238 These two files change the behavior of the signal1 and signal2 notifi- 238 These two files change the behavior of the signal1 and signal2 notifi-
239 cation files. The contain a numerical ASCII string which is read as 239 cation files. The contain a numerical ASCII string which is read as
240 either "1" or "0". In mode 0 (overwrite), the hardware replaces the 240 either "1" or "0". In mode 0 (overwrite), the hardware replaces the
241 contents of the signal channel with the data that is written to it. in 241 contents of the signal channel with the data that is written to it. in
242 mode 1 (logical OR), the hardware accumulates the bits that are subse- 242 mode 1 (logical OR), the hardware accumulates the bits that are subse-
243 quently written to it. The possible operations on an open signal1_type 243 quently written to it. The possible operations on an open signal1_type
244 or signal2_type file are: 244 or signal2_type file are:
245 245
246 read(2) 246 read(2)
247 When the count supplied to the read call is shorter than the 247 When the count supplied to the read call is shorter than the
248 required length for the digit plus a newline character, subse- 248 required length for the digit plus a newline character, subse-
249 quent reads from the same file descriptor will result in com- 249 quent reads from the same file descriptor will result in com-
250 pleting the string. When a complete string has been read, all 250 pleting the string. When a complete string has been read, all
251 subsequent read operations will return zero bytes and a new file 251 subsequent read operations will return zero bytes and a new file
252 descriptor needs to be opened to read the value again. 252 descriptor needs to be opened to read the value again.
253 253
254 write(2) 254 write(2)
255 A write operation on the file results in setting the register to 255 A write operation on the file results in setting the register to
256 the value given in the string. The string is parsed from the 256 the value given in the string. The string is parsed from the
257 beginning to the first non-numeric character or the end of the 257 beginning to the first non-numeric character or the end of the
258 buffer. Subsequent writes to the same file descriptor overwrite 258 buffer. Subsequent writes to the same file descriptor overwrite
259 the previous setting. 259 the previous setting.
260 260
261 261
262 EXAMPLES 262 EXAMPLES
263 /etc/fstab entry 263 /etc/fstab entry
264 none /spu spufs gid=spu 0 0 264 none /spu spufs gid=spu 0 0
265 265
266 266
267 AUTHORS 267 AUTHORS
268 Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, 268 Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>,
269 Ulrich Weigand <Ulrich.Weigand@de.ibm.com> 269 Ulrich Weigand <Ulrich.Weigand@de.ibm.com>
270 270
271 SEE ALSO 271 SEE ALSO
272 capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) 272 capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7)
273 273
274 274
275 275
276 Linux 2005-09-28 SPUFS(2) 276 Linux 2005-09-28 SPUFS(2)
277 277
278 ------------------------------------------------------------------------------ 278 ------------------------------------------------------------------------------
279 279
280 SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) 280 SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2)
281 281
282 282
283 283
284 NAME 284 NAME
285 spu_run - execute an spu context 285 spu_run - execute an spu context
286 286
287 287
288 SYNOPSIS 288 SYNOPSIS
289 #include <sys/spu.h> 289 #include <sys/spu.h>
290 290
291 int spu_run(int fd, unsigned int *npc, unsigned int *event); 291 int spu_run(int fd, unsigned int *npc, unsigned int *event);
292 292
293 DESCRIPTION 293 DESCRIPTION
294 The spu_run system call is used on PowerPC machines that implement the 294 The spu_run system call is used on PowerPC machines that implement the
295 Cell Broadband Engine Architecture in order to access Synergistic Pro- 295 Cell Broadband Engine Architecture in order to access Synergistic Pro-
296 cessor Units (SPUs). It uses the fd that was returned from spu_cre- 296 cessor Units (SPUs). It uses the fd that was returned from spu_cre-
297 ate(2) to address a specific SPU context. When the context gets sched- 297 ate(2) to address a specific SPU context. When the context gets sched-
298 uled to a physical SPU, it starts execution at the instruction pointer 298 uled to a physical SPU, it starts execution at the instruction pointer
299 passed in npc. 299 passed in npc.
300 300
301 Execution of SPU code happens synchronously, meaning that spu_run does 301 Execution of SPU code happens synchronously, meaning that spu_run does
302 not return while the SPU is still running. If there is a need to exe- 302 not return while the SPU is still running. If there is a need to exe-
303 cute SPU code in parallel with other code on either the main CPU or 303 cute SPU code in parallel with other code on either the main CPU or
304 other SPUs, you need to create a new thread of execution first, e.g. 304 other SPUs, you need to create a new thread of execution first, e.g.
305 using the pthread_create(3) call. 305 using the pthread_create(3) call.
306 306
307 When spu_run returns, the current value of the SPU instruction pointer 307 When spu_run returns, the current value of the SPU instruction pointer
308 is written back to npc, so you can call spu_run again without updating 308 is written back to npc, so you can call spu_run again without updating
309 the pointers. 309 the pointers.
310 310
311 event can be a NULL pointer or point to an extended status code that 311 event can be a NULL pointer or point to an extended status code that
312 gets filled when spu_run returns. It can be one of the following con- 312 gets filled when spu_run returns. It can be one of the following con-
313 stants: 313 stants:
314 314
315 SPE_EVENT_DMA_ALIGNMENT 315 SPE_EVENT_DMA_ALIGNMENT
316 A DMA alignment error 316 A DMA alignment error
317 317
318 SPE_EVENT_SPE_DATA_SEGMENT 318 SPE_EVENT_SPE_DATA_SEGMENT
319 A DMA segmentation error 319 A DMA segmentation error
320 320
321 SPE_EVENT_SPE_DATA_STORAGE 321 SPE_EVENT_SPE_DATA_STORAGE
322 A DMA storage error 322 A DMA storage error
323 323
324 If NULL is passed as the event argument, these errors will result in a 324 If NULL is passed as the event argument, these errors will result in a
325 signal delivered to the calling process. 325 signal delivered to the calling process.
326 326
327 RETURN VALUE 327 RETURN VALUE
328 spu_run returns the value of the spu_status register or -1 to indicate 328 spu_run returns the value of the spu_status register or -1 to indicate
329 an error and set errno to one of the error codes listed below. The 329 an error and set errno to one of the error codes listed below. The
330 spu_status register value contains a bit mask of status codes and 330 spu_status register value contains a bit mask of status codes and
331 optionally a 14 bit code returned from the stop-and-signal instruction 331 optionally a 14 bit code returned from the stop-and-signal instruction
332 on the SPU. The bit masks for the status codes are: 332 on the SPU. The bit masks for the status codes are:
333 333
334 0x02 SPU was stopped by stop-and-signal. 334 0x02 SPU was stopped by stop-and-signal.
335 335
336 0x04 SPU was stopped by halt. 336 0x04 SPU was stopped by halt.
337 337
338 0x08 SPU is waiting for a channel. 338 0x08 SPU is waiting for a channel.
339 339
340 0x10 SPU is in single-step mode. 340 0x10 SPU is in single-step mode.
341 341
342 0x20 SPU has tried to execute an invalid instruction. 342 0x20 SPU has tried to execute an invalid instruction.
343 343
344 0x40 SPU has tried to access an invalid channel. 344 0x40 SPU has tried to access an invalid channel.
345 345
346 0x3fff0000 346 0x3fff0000
347 The bits masked with this value contain the code returned from 347 The bits masked with this value contain the code returned from
348 stop-and-signal. 348 stop-and-signal.
349 349
350 There are always one or more of the lower eight bits set or an error 350 There are always one or more of the lower eight bits set or an error
351 code is returned from spu_run. 351 code is returned from spu_run.
352 352
353 ERRORS 353 ERRORS
354 EAGAIN or EWOULDBLOCK 354 EAGAIN or EWOULDBLOCK
355 fd is in non-blocking mode and spu_run would block. 355 fd is in non-blocking mode and spu_run would block.
356 356
357 EBADF fd is not a valid file descriptor. 357 EBADF fd is not a valid file descriptor.
358 358
359 EFAULT npc is not a valid pointer or status is neither NULL nor a valid 359 EFAULT npc is not a valid pointer or status is neither NULL nor a valid
360 pointer. 360 pointer.
361 361
362 EINTR A signal occurred while spu_run was in progress. The npc value 362 EINTR A signal occurred while spu_run was in progress. The npc value
363 has been updated to the new program counter value if necessary. 363 has been updated to the new program counter value if necessary.
364 364
365 EINVAL fd is not a file descriptor returned from spu_create(2). 365 EINVAL fd is not a file descriptor returned from spu_create(2).
366 366
367 ENOMEM Insufficient memory was available to handle a page fault result- 367 ENOMEM Insufficient memory was available to handle a page fault result-
368 ing from an MFC direct memory access. 368 ing from an MFC direct memory access.
369 369
370 ENOSYS the functionality is not provided by the current system, because 370 ENOSYS the functionality is not provided by the current system, because
371 either the hardware does not provide SPUs or the spufs module is 371 either the hardware does not provide SPUs or the spufs module is
372 not loaded. 372 not loaded.
373 373
374 374
375 NOTES 375 NOTES
376 spu_run is meant to be used from libraries that implement a more 376 spu_run is meant to be used from libraries that implement a more
377 abstract interface to SPUs, not to be used from regular applications. 377 abstract interface to SPUs, not to be used from regular applications.
378 See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 378 See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
379 ommended libraries. 379 ommended libraries.
380 380
381 381
382 CONFORMING TO 382 CONFORMING TO
383 This call is Linux specific and only implemented by the ppc64 architec- 383 This call is Linux specific and only implemented by the ppc64 architec-
384 ture. Programs using this system call are not portable. 384 ture. Programs using this system call are not portable.
385 385
386 386
387 BUGS 387 BUGS
388 The code does not yet fully implement all features lined out here. 388 The code does not yet fully implement all features lined out here.
389 389
390 390
391 AUTHOR 391 AUTHOR
392 Arnd Bergmann <arndb@de.ibm.com> 392 Arnd Bergmann <arndb@de.ibm.com>
393 393
394 SEE ALSO 394 SEE ALSO
395 capabilities(7), close(2), spu_create(2), spufs(7) 395 capabilities(7), close(2), spu_create(2), spufs(7)
396 396
397 397
398 398
399 Linux 2005-09-28 SPU_RUN(2) 399 Linux 2005-09-28 SPU_RUN(2)
400 400
401 ------------------------------------------------------------------------------ 401 ------------------------------------------------------------------------------
402 402
403 SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) 403 SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2)
404 404
405 405
406 406
407 NAME 407 NAME
408 spu_create - create a new spu context 408 spu_create - create a new spu context
409 409
410 410
411 SYNOPSIS 411 SYNOPSIS
412 #include <sys/types.h> 412 #include <sys/types.h>
413 #include <sys/spu.h> 413 #include <sys/spu.h>
414 414
415 int spu_create(const char *pathname, int flags, mode_t mode); 415 int spu_create(const char *pathname, int flags, mode_t mode);
416 416
417 DESCRIPTION 417 DESCRIPTION
418 The spu_create system call is used on PowerPC machines that implement 418 The spu_create system call is used on PowerPC machines that implement
419 the Cell Broadband Engine Architecture in order to access Synergistic 419 the Cell Broadband Engine Architecture in order to access Synergistic
420 Processor Units (SPUs). It creates a new logical context for an SPU in 420 Processor Units (SPUs). It creates a new logical context for an SPU in
421 pathname and returns a handle to associated with it. pathname must 421 pathname and returns a handle to associated with it. pathname must
422 point to a non-existing directory in the mount point of the SPU file 422 point to a non-existing directory in the mount point of the SPU file
423 system (spufs). When spu_create is successful, a directory gets cre- 423 system (spufs). When spu_create is successful, a directory gets cre-
424 ated on pathname and it is populated with files. 424 ated on pathname and it is populated with files.
425 425
426 The returned file handle can only be passed to spu_run(2) or closed, 426 The returned file handle can only be passed to spu_run(2) or closed,
427 other operations are not defined on it. When it is closed, all associ- 427 other operations are not defined on it. When it is closed, all associ-
428 ated directory entries in spufs are removed. When the last file handle 428 ated directory entries in spufs are removed. When the last file handle
429 pointing either inside of the context directory or to this file 429 pointing either inside of the context directory or to this file
430 descriptor is closed, the logical SPU context is destroyed. 430 descriptor is closed, the logical SPU context is destroyed.
431 431
432 The parameter flags can be zero or any bitwise or'd combination of the 432 The parameter flags can be zero or any bitwise or'd combination of the
433 following constants: 433 following constants:
434 434
435 SPU_RAWIO 435 SPU_RAWIO
436 Allow mapping of some of the hardware registers of the SPU into 436 Allow mapping of some of the hardware registers of the SPU into
437 user space. This flag requires the CAP_SYS_RAWIO capability, see 437 user space. This flag requires the CAP_SYS_RAWIO capability, see
438 capabilities(7). 438 capabilities(7).
439 439
440 The mode parameter specifies the permissions used for creating the new 440 The mode parameter specifies the permissions used for creating the new
441 directory in spufs. mode is modified with the user's umask(2) value 441 directory in spufs. mode is modified with the user's umask(2) value
442 and then used for both the directory and the files contained in it. The 442 and then used for both the directory and the files contained in it. The
443 file permissions mask out some more bits of mode because they typically 443 file permissions mask out some more bits of mode because they typically
444 support only read or write access. See stat(2) for a full list of the 444 support only read or write access. See stat(2) for a full list of the
445 possible mode values. 445 possible mode values.
446 446
447 447
448 RETURN VALUE 448 RETURN VALUE
449 spu_create returns a new file descriptor. It may return -1 to indicate 449 spu_create returns a new file descriptor. It may return -1 to indicate
450 an error condition and set errno to one of the error codes listed 450 an error condition and set errno to one of the error codes listed
451 below. 451 below.
452 452
453 453
454 ERRORS 454 ERRORS
455 EACCESS 455 EACCESS
456 The current user does not have write access on the spufs mount 456 The current user does not have write access on the spufs mount
457 point. 457 point.
458 458
459 EEXIST An SPU context already exists at the given path name. 459 EEXIST An SPU context already exists at the given path name.
460 460
461 EFAULT pathname is not a valid string pointer in the current address 461 EFAULT pathname is not a valid string pointer in the current address
462 space. 462 space.
463 463
464 EINVAL pathname is not a directory in the spufs mount point. 464 EINVAL pathname is not a directory in the spufs mount point.
465 465
466 ELOOP Too many symlinks were found while resolving pathname. 466 ELOOP Too many symlinks were found while resolving pathname.
467 467
468 EMFILE The process has reached its maximum open file limit. 468 EMFILE The process has reached its maximum open file limit.
469 469
470 ENAMETOOLONG 470 ENAMETOOLONG
471 pathname was too long. 471 pathname was too long.
472 472
473 ENFILE The system has reached the global open file limit. 473 ENFILE The system has reached the global open file limit.
474 474
475 ENOENT Part of pathname could not be resolved. 475 ENOENT Part of pathname could not be resolved.
476 476
477 ENOMEM The kernel could not allocate all resources required. 477 ENOMEM The kernel could not allocate all resources required.
478 478
479 ENOSPC There are not enough SPU resources available to create a new 479 ENOSPC There are not enough SPU resources available to create a new
480 context or the user specific limit for the number of SPU con- 480 context or the user specific limit for the number of SPU con-
481 texts has been reached. 481 texts has been reached.
482 482
483 ENOSYS the functionality is not provided by the current system, because 483 ENOSYS the functionality is not provided by the current system, because
484 either the hardware does not provide SPUs or the spufs module is 484 either the hardware does not provide SPUs or the spufs module is
485 not loaded. 485 not loaded.
486 486
487 ENOTDIR 487 ENOTDIR
488 A part of pathname is not a directory. 488 A part of pathname is not a directory.
489 489
490 490
491 491
492 NOTES 492 NOTES
493 spu_create is meant to be used from libraries that implement a more 493 spu_create is meant to be used from libraries that implement a more
494 abstract interface to SPUs, not to be used from regular applications. 494 abstract interface to SPUs, not to be used from regular applications.
495 See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- 495 See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec-
496 ommended libraries. 496 ommended libraries.
497 497
498 498
499 FILES 499 FILES
500 pathname must point to a location beneath the mount point of spufs. By 500 pathname must point to a location beneath the mount point of spufs. By
501 convention, it gets mounted in /spu. 501 convention, it gets mounted in /spu.
502 502
503 503
504 CONFORMING TO 504 CONFORMING TO
505 This call is Linux specific and only implemented by the ppc64 architec- 505 This call is Linux specific and only implemented by the ppc64 architec-
506 ture. Programs using this system call are not portable. 506 ture. Programs using this system call are not portable.
507 507
508 508
509 BUGS 509 BUGS
510 The code does not yet fully implement all features lined out here. 510 The code does not yet fully implement all features lined out here.
511 511
512 512
513 AUTHOR 513 AUTHOR
514 Arnd Bergmann <arndb@de.ibm.com> 514 Arnd Bergmann <arndb@de.ibm.com>
515 515
516 SEE ALSO 516 SEE ALSO
517 capabilities(7), close(2), spu_run(2), spufs(7) 517 capabilities(7), close(2), spu_run(2), spufs(7)
518 518
519 519
520 520
521 Linux 2005-09-28 SPU_CREATE(2) 521 Linux 2005-09-28 SPU_CREATE(2)
522 522
Documentation/filesystems/tmpfs.txt
1 Tmpfs is a file system which keeps all files in virtual memory. 1 Tmpfs is a file system which keeps all files in virtual memory.
2 2
3 3
4 Everything in tmpfs is temporary in the sense that no files will be 4 Everything in tmpfs is temporary in the sense that no files will be
5 created on your hard drive. If you unmount a tmpfs instance, 5 created on your hard drive. If you unmount a tmpfs instance,
6 everything stored therein is lost. 6 everything stored therein is lost.
7 7
8 tmpfs puts everything into the kernel internal caches and grows and 8 tmpfs puts everything into the kernel internal caches and grows and
9 shrinks to accommodate the files it contains and is able to swap 9 shrinks to accommodate the files it contains and is able to swap
10 unneeded pages out to swap space. It has maximum size limits which can 10 unneeded pages out to swap space. It has maximum size limits which can
11 be adjusted on the fly via 'mount -o remount ...' 11 be adjusted on the fly via 'mount -o remount ...'
12 12
13 If you compare it to ramfs (which was the template to create tmpfs) 13 If you compare it to ramfs (which was the template to create tmpfs)
14 you gain swapping and limit checking. Another similar thing is the RAM 14 you gain swapping and limit checking. Another similar thing is the RAM
15 disk (/dev/ram*), which simulates a fixed size hard disk in physical 15 disk (/dev/ram*), which simulates a fixed size hard disk in physical
16 RAM, where you have to create an ordinary filesystem on top. Ramdisks 16 RAM, where you have to create an ordinary filesystem on top. Ramdisks
17 cannot swap and you do not have the possibility to resize them. 17 cannot swap and you do not have the possibility to resize them.
18 18
19 Since tmpfs lives completely in the page cache and on swap, all tmpfs 19 Since tmpfs lives completely in the page cache and on swap, all tmpfs
20 pages currently in memory will show up as cached. It will not show up 20 pages currently in memory will show up as cached. It will not show up
21 as shared or something like that. Further on you can check the actual 21 as shared or something like that. Further on you can check the actual
22 RAM+swap use of a tmpfs instance with df(1) and du(1). 22 RAM+swap use of a tmpfs instance with df(1) and du(1).
23 23
24 24
25 tmpfs has the following uses: 25 tmpfs has the following uses:
26 26
27 1) There is always a kernel internal mount which you will not see at 27 1) There is always a kernel internal mount which you will not see at
28 all. This is used for shared anonymous mappings and SYSV shared 28 all. This is used for shared anonymous mappings and SYSV shared
29 memory. 29 memory.
30 30
31 This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not 31 This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not
32 set, the user visible part of tmpfs is not build. But the internal 32 set, the user visible part of tmpfs is not build. But the internal
33 mechanisms are always present. 33 mechanisms are always present.
34 34
35 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for 35 2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for
36 POSIX shared memory (shm_open, shm_unlink). Adding the following 36 POSIX shared memory (shm_open, shm_unlink). Adding the following
37 line to /etc/fstab should take care of this: 37 line to /etc/fstab should take care of this:
38 38
39 tmpfs /dev/shm tmpfs defaults 0 0 39 tmpfs /dev/shm tmpfs defaults 0 0
40 40
41 Remember to create the directory that you intend to mount tmpfs on 41 Remember to create the directory that you intend to mount tmpfs on
42 if necessary. 42 if necessary.
43 43
44 This mount is _not_ needed for SYSV shared memory. The internal 44 This mount is _not_ needed for SYSV shared memory. The internal
45 mount is used for that. (In the 2.3 kernel versions it was 45 mount is used for that. (In the 2.3 kernel versions it was
46 necessary to mount the predecessor of tmpfs (shm fs) to use SYSV 46 necessary to mount the predecessor of tmpfs (shm fs) to use SYSV
47 shared memory) 47 shared memory)
48 48
49 3) Some people (including me) find it very convenient to mount it 49 3) Some people (including me) find it very convenient to mount it
50 e.g. on /tmp and /var/tmp and have a big swap partition. And now 50 e.g. on /tmp and /var/tmp and have a big swap partition. And now
51 loop mounts of tmpfs files do work, so mkinitrd shipped by most 51 loop mounts of tmpfs files do work, so mkinitrd shipped by most
52 distributions should succeed with a tmpfs /tmp. 52 distributions should succeed with a tmpfs /tmp.
53 53
54 4) And probably a lot more I do not know about :-) 54 4) And probably a lot more I do not know about :-)
55 55
56 56
57 tmpfs has three mount options for sizing: 57 tmpfs has three mount options for sizing:
58 58
59 size: The limit of allocated bytes for this tmpfs instance. The 59 size: The limit of allocated bytes for this tmpfs instance. The
60 default is half of your physical RAM without swap. If you 60 default is half of your physical RAM without swap. If you
61 oversize your tmpfs instances the machine will deadlock 61 oversize your tmpfs instances the machine will deadlock
62 since the OOM handler will not be able to free that memory. 62 since the OOM handler will not be able to free that memory.
63 nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE. 63 nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE.
64 nr_inodes: The maximum number of inodes for this instance. The default 64 nr_inodes: The maximum number of inodes for this instance. The default
65 is half of the number of your physical RAM pages, or (on a 65 is half of the number of your physical RAM pages, or (on a
66 a machine with highmem) the number of lowmem RAM pages, 66 machine with highmem) the number of lowmem RAM pages,
67 whichever is the lower. 67 whichever is the lower.
68 68
69 These parameters accept a suffix k, m or g for kilo, mega and giga and 69 These parameters accept a suffix k, m or g for kilo, mega and giga and
70 can be changed on remount. The size parameter also accepts a suffix % 70 can be changed on remount. The size parameter also accepts a suffix %
71 to limit this tmpfs instance to that percentage of your physical RAM: 71 to limit this tmpfs instance to that percentage of your physical RAM:
72 the default, when neither size nor nr_blocks is specified, is size=50% 72 the default, when neither size nor nr_blocks is specified, is size=50%
73 73
74 If nr_blocks=0 (or size=0), blocks will not be limited in that instance; 74 If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
75 if nr_inodes=0, inodes will not be limited. It is generally unwise to 75 if nr_inodes=0, inodes will not be limited. It is generally unwise to
76 mount with such options, since it allows any user with write access to 76 mount with such options, since it allows any user with write access to
77 use up all the memory on the machine; but enhances the scalability of 77 use up all the memory on the machine; but enhances the scalability of
78 that instance in a system with many cpus making intensive use of it. 78 that instance in a system with many cpus making intensive use of it.
79 79
80 80
81 tmpfs has a mount option to set the NUMA memory allocation policy for 81 tmpfs has a mount option to set the NUMA memory allocation policy for
82 all files in that instance (if CONFIG_NUMA is enabled) - which can be 82 all files in that instance (if CONFIG_NUMA is enabled) - which can be
83 adjusted on the fly via 'mount -o remount ...' 83 adjusted on the fly via 'mount -o remount ...'
84 84
85 mpol=default prefers to allocate memory from the local node 85 mpol=default prefers to allocate memory from the local node
86 mpol=prefer:Node prefers to allocate memory from the given Node 86 mpol=prefer:Node prefers to allocate memory from the given Node
87 mpol=bind:NodeList allocates memory only from nodes in NodeList 87 mpol=bind:NodeList allocates memory only from nodes in NodeList
88 mpol=interleave prefers to allocate from each node in turn 88 mpol=interleave prefers to allocate from each node in turn
89 mpol=interleave:NodeList allocates from each node of NodeList in turn 89 mpol=interleave:NodeList allocates from each node of NodeList in turn
90 90
91 NodeList format is a comma-separated list of decimal numbers and ranges, 91 NodeList format is a comma-separated list of decimal numbers and ranges,
92 a range being two hyphen-separated decimal numbers, the smallest and 92 a range being two hyphen-separated decimal numbers, the smallest and
93 largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 93 largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
94 94
95 Note that trying to mount a tmpfs with an mpol option will fail if the 95 Note that trying to mount a tmpfs with an mpol option will fail if the
96 running kernel does not support NUMA; and will fail if its nodelist 96 running kernel does not support NUMA; and will fail if its nodelist
97 specifies a node >= MAX_NUMNODES. If your system relies on that tmpfs 97 specifies a node >= MAX_NUMNODES. If your system relies on that tmpfs
98 being mounted, but from time to time runs a kernel built without NUMA 98 being mounted, but from time to time runs a kernel built without NUMA
99 capability (perhaps a safe recovery kernel), or configured to support 99 capability (perhaps a safe recovery kernel), or configured to support
100 fewer nodes, then it is advisable to omit the mpol option from automatic 100 fewer nodes, then it is advisable to omit the mpol option from automatic
101 mount options. It can be added later, when the tmpfs is already mounted 101 mount options. It can be added later, when the tmpfs is already mounted
102 on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. 102 on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'.
103 103
104 104
105 To specify the initial root directory you can use the following mount 105 To specify the initial root directory you can use the following mount
106 options: 106 options:
107 107
108 mode: The permissions as an octal number 108 mode: The permissions as an octal number
109 uid: The user id 109 uid: The user id
110 gid: The group id 110 gid: The group id
111 111
112 These options do not have any effect on remount. You can change these 112 These options do not have any effect on remount. You can change these
113 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. 113 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
114 114
115 115
116 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' 116 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
117 will give you tmpfs instance on /mytmpfs which can allocate 10GB 117 will give you tmpfs instance on /mytmpfs which can allocate 10GB
118 RAM/SWAP in 10240 inodes and it is only accessible by root. 118 RAM/SWAP in 10240 inodes and it is only accessible by root.
119 119
120 120
121 Author: 121 Author:
122 Christoph Rohland <cr@sap.com>, 1.12.01 122 Christoph Rohland <cr@sap.com>, 1.12.01
123 Updated: 123 Updated:
124 Hugh Dickins <hugh@veritas.com>, 19 February 2006 124 Hugh Dickins <hugh@veritas.com>, 19 February 2006
125 125
Documentation/filesystems/vfat.txt
1 USING VFAT 1 USING VFAT
2 ---------------------------------------------------------------------- 2 ----------------------------------------------------------------------
3 To use the vfat filesystem, use the filesystem type 'vfat'. i.e. 3 To use the vfat filesystem, use the filesystem type 'vfat'. i.e.
4 mount -t vfat /dev/fd0 /mnt 4 mount -t vfat /dev/fd0 /mnt
5 5
6 No special partition formatter is required. mkdosfs will work fine 6 No special partition formatter is required. mkdosfs will work fine
7 if you want to format from within Linux. 7 if you want to format from within Linux.
8 8
9 VFAT MOUNT OPTIONS 9 VFAT MOUNT OPTIONS
10 ---------------------------------------------------------------------- 10 ----------------------------------------------------------------------
11 umask=### -- The permission mask (for files and directories, see umask(1)). 11 umask=### -- The permission mask (for files and directories, see umask(1)).
12 The default is the umask of current process. 12 The default is the umask of current process.
13 13
14 dmask=### -- The permission mask for the directory. 14 dmask=### -- The permission mask for the directory.
15 The default is the umask of current process. 15 The default is the umask of current process.
16 16
17 fmask=### -- The permission mask for files. 17 fmask=### -- The permission mask for files.
18 The default is the umask of current process. 18 The default is the umask of current process.
19 19
20 codepage=### -- Sets the codepage number for converting to shortname 20 codepage=### -- Sets the codepage number for converting to shortname
21 characters on FAT filesystem. 21 characters on FAT filesystem.
22 By default, FAT_DEFAULT_CODEPAGE setting is used. 22 By default, FAT_DEFAULT_CODEPAGE setting is used.
23 23
24 iocharset=name -- Character set to use for converting between the 24 iocharset=name -- Character set to use for converting between the
25 encoding is used for user visible filename and 16 bit 25 encoding is used for user visible filename and 16 bit
26 Unicode characters. Long filenames are stored on disk 26 Unicode characters. Long filenames are stored on disk
27 in Unicode format, but Unix for the most part doesn't 27 in Unicode format, but Unix for the most part doesn't
28 know how to deal with Unicode. 28 know how to deal with Unicode.
29 By default, FAT_DEFAULT_IOCHARSET setting is used. 29 By default, FAT_DEFAULT_IOCHARSET setting is used.
30 30
31 There is also an option of doing UTF-8 translations 31 There is also an option of doing UTF-8 translations
32 with the utf8 option. 32 with the utf8 option.
33 33
34 NOTE: "iocharset=utf8" is not recommended. If unsure, 34 NOTE: "iocharset=utf8" is not recommended. If unsure,
35 you should consider the following option instead. 35 you should consider the following option instead.
36 36
37 utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that 37 utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that
38 is used by the console. It can be be enabled for the 38 is used by the console. It can be enabled for the
39 filesystem with this option. If 'uni_xlate' gets set, 39 filesystem with this option. If 'uni_xlate' gets set,
40 UTF-8 gets disabled. 40 UTF-8 gets disabled.
41 41
42 uni_xlate=<bool> -- Translate unhandled Unicode characters to special 42 uni_xlate=<bool> -- Translate unhandled Unicode characters to special
43 escaped sequences. This would let you backup and 43 escaped sequences. This would let you backup and
44 restore filenames that are created with any Unicode 44 restore filenames that are created with any Unicode
45 characters. Until Linux supports Unicode for real, 45 characters. Until Linux supports Unicode for real,
46 this gives you an alternative. Without this option, 46 this gives you an alternative. Without this option,
47 a '?' is used when no translation is possible. The 47 a '?' is used when no translation is possible. The
48 escape character is ':' because it is otherwise 48 escape character is ':' because it is otherwise
49 illegal on the vfat filesystem. The escape sequence 49 illegal on the vfat filesystem. The escape sequence
50 that gets used is ':' and the four digits of hexadecimal 50 that gets used is ':' and the four digits of hexadecimal
51 unicode. 51 unicode.
52 52
53 nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will 53 nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will
54 end in '~1' or tilde followed by some number. If this 54 end in '~1' or tilde followed by some number. If this
55 option is set, then if the filename is 55 option is set, then if the filename is
56 "longfilename.txt" and "longfile.txt" does not 56 "longfilename.txt" and "longfile.txt" does not
57 currently exist in the directory, 'longfile.txt' will 57 currently exist in the directory, 'longfile.txt' will
58 be the short alias instead of 'longfi~1.txt'. 58 be the short alias instead of 'longfi~1.txt'.
59 59
60 quiet -- Stops printing certain warning messages. 60 quiet -- Stops printing certain warning messages.
61 61
62 check=s|r|n -- Case sensitivity checking setting. 62 check=s|r|n -- Case sensitivity checking setting.
63 s: strict, case sensitive 63 s: strict, case sensitive
64 r: relaxed, case insensitive 64 r: relaxed, case insensitive
65 n: normal, default setting, currently case insensitive 65 n: normal, default setting, currently case insensitive
66 66
67 shortname=lower|win95|winnt|mixed 67 shortname=lower|win95|winnt|mixed
68 -- Shortname display/create setting. 68 -- Shortname display/create setting.
69 lower: convert to lowercase for display, 69 lower: convert to lowercase for display,
70 emulate the Windows 95 rule for create. 70 emulate the Windows 95 rule for create.
71 win95: emulate the Windows 95 rule for display/create. 71 win95: emulate the Windows 95 rule for display/create.
72 winnt: emulate the Windows NT rule for display/create. 72 winnt: emulate the Windows NT rule for display/create.
73 mixed: emulate the Windows NT rule for display, 73 mixed: emulate the Windows NT rule for display,
74 emulate the Windows 95 rule for create. 74 emulate the Windows 95 rule for create.
75 Default setting is `lower'. 75 Default setting is `lower'.
76 76
77 <bool>: 0,1,yes,no,true,false 77 <bool>: 0,1,yes,no,true,false
78 78
79 TODO 79 TODO
80 ---------------------------------------------------------------------- 80 ----------------------------------------------------------------------
81 * Need to get rid of the raw scanning stuff. Instead, always use 81 * Need to get rid of the raw scanning stuff. Instead, always use
82 a get next directory entry approach. The only thing left that uses 82 a get next directory entry approach. The only thing left that uses
83 raw scanning is the directory renaming code. 83 raw scanning is the directory renaming code.
84 84
85 85
86 POSSIBLE PROBLEMS 86 POSSIBLE PROBLEMS
87 ---------------------------------------------------------------------- 87 ----------------------------------------------------------------------
88 * vfat_valid_longname does not properly checked reserved names. 88 * vfat_valid_longname does not properly checked reserved names.
89 * When a volume name is the same as a directory name in the root 89 * When a volume name is the same as a directory name in the root
90 directory of the filesystem, the directory name sometimes shows 90 directory of the filesystem, the directory name sometimes shows
91 up as an empty file. 91 up as an empty file.
92 * autoconv option does not work correctly. 92 * autoconv option does not work correctly.
93 93
94 BUG REPORTS 94 BUG REPORTS
95 ---------------------------------------------------------------------- 95 ----------------------------------------------------------------------
96 If you have trouble with the VFAT filesystem, mail bug reports to 96 If you have trouble with the VFAT filesystem, mail bug reports to
97 chaffee@bmrc.cs.berkeley.edu. Please specify the filename 97 chaffee@bmrc.cs.berkeley.edu. Please specify the filename
98 and the operation that gave you trouble. 98 and the operation that gave you trouble.
99 99
100 TEST SUITE 100 TEST SUITE
101 ---------------------------------------------------------------------- 101 ----------------------------------------------------------------------
102 If you plan to make any modifications to the vfat filesystem, please 102 If you plan to make any modifications to the vfat filesystem, please
103 get the test suite that comes with the vfat distribution at 103 get the test suite that comes with the vfat distribution at
104 104
105 http://bmrc.berkeley.edu/people/chaffee/vfat.html 105 http://bmrc.berkeley.edu/people/chaffee/vfat.html
106 106
107 This tests quite a few parts of the vfat filesystem and additional 107 This tests quite a few parts of the vfat filesystem and additional
108 tests for new features or untested features would be appreciated. 108 tests for new features or untested features would be appreciated.
109 109
110 NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM 110 NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM
111 ---------------------------------------------------------------------- 111 ----------------------------------------------------------------------
112 (This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu> 112 (This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu>
113 and lightly annotated by Gordon Chaffee). 113 and lightly annotated by Gordon Chaffee).
114 114
115 This document presents a very rough, technical overview of my 115 This document presents a very rough, technical overview of my
116 knowledge of the extended FAT file system used in Windows NT 3.5 and 116 knowledge of the extended FAT file system used in Windows NT 3.5 and
117 Windows 95. I don't guarantee that any of the following is correct, 117 Windows 95. I don't guarantee that any of the following is correct,
118 but it appears to be so. 118 but it appears to be so.
119 119
120 The extended FAT file system is almost identical to the FAT 120 The extended FAT file system is almost identical to the FAT
121 file system used in DOS versions up to and including 6.223410239847 121 file system used in DOS versions up to and including 6.223410239847
122 :-). The significant change has been the addition of long file names. 122 :-). The significant change has been the addition of long file names.
123 These names support up to 255 characters including spaces and lower 123 These names support up to 255 characters including spaces and lower
124 case characters as opposed to the traditional 8.3 short names. 124 case characters as opposed to the traditional 8.3 short names.
125 125
126 Here is the description of the traditional FAT entry in the current 126 Here is the description of the traditional FAT entry in the current
127 Windows 95 filesystem: 127 Windows 95 filesystem:
128 128
129 struct directory { // Short 8.3 names 129 struct directory { // Short 8.3 names
130 unsigned char name[8]; // file name 130 unsigned char name[8]; // file name
131 unsigned char ext[3]; // file extension 131 unsigned char ext[3]; // file extension
132 unsigned char attr; // attribute byte 132 unsigned char attr; // attribute byte
133 unsigned char lcase; // Case for base and extension 133 unsigned char lcase; // Case for base and extension
134 unsigned char ctime_ms; // Creation time, milliseconds 134 unsigned char ctime_ms; // Creation time, milliseconds
135 unsigned char ctime[2]; // Creation time 135 unsigned char ctime[2]; // Creation time
136 unsigned char cdate[2]; // Creation date 136 unsigned char cdate[2]; // Creation date
137 unsigned char adate[2]; // Last access date 137 unsigned char adate[2]; // Last access date
138 unsigned char reserved[2]; // reserved values (ignored) 138 unsigned char reserved[2]; // reserved values (ignored)
139 unsigned char time[2]; // time stamp 139 unsigned char time[2]; // time stamp
140 unsigned char date[2]; // date stamp 140 unsigned char date[2]; // date stamp
141 unsigned char start[2]; // starting cluster number 141 unsigned char start[2]; // starting cluster number
142 unsigned char size[4]; // size of the file 142 unsigned char size[4]; // size of the file
143 }; 143 };
144 144
145 The lcase field specifies if the base and/or the extension of an 8.3 145 The lcase field specifies if the base and/or the extension of an 8.3
146 name should be capitalized. This field does not seem to be used by 146 name should be capitalized. This field does not seem to be used by
147 Windows 95 but it is used by Windows NT. The case of filenames is not 147 Windows 95 but it is used by Windows NT. The case of filenames is not
148 completely compatible from Windows NT to Windows 95. It is not completely 148 completely compatible from Windows NT to Windows 95. It is not completely
149 compatible in the reverse direction, however. Filenames that fit in 149 compatible in the reverse direction, however. Filenames that fit in
150 the 8.3 namespace and are written on Windows NT to be lowercase will 150 the 8.3 namespace and are written on Windows NT to be lowercase will
151 show up as uppercase on Windows 95. 151 show up as uppercase on Windows 95.
152 152
153 Note that the "start" and "size" values are actually little 153 Note that the "start" and "size" values are actually little
154 endian integer values. The descriptions of the fields in this 154 endian integer values. The descriptions of the fields in this
155 structure are public knowledge and can be found elsewhere. 155 structure are public knowledge and can be found elsewhere.
156 156
157 With the extended FAT system, Microsoft has inserted extra 157 With the extended FAT system, Microsoft has inserted extra
158 directory entries for any files with extended names. (Any name which 158 directory entries for any files with extended names. (Any name which
159 legally fits within the old 8.3 encoding scheme does not have extra 159 legally fits within the old 8.3 encoding scheme does not have extra
160 entries.) I call these extra entries slots. Basically, a slot is a 160 entries.) I call these extra entries slots. Basically, a slot is a
161 specially formatted directory entry which holds up to 13 characters of 161 specially formatted directory entry which holds up to 13 characters of
162 a file's extended name. Think of slots as additional labeling for the 162 a file's extended name. Think of slots as additional labeling for the
163 directory entry of the file to which they correspond. Microsoft 163 directory entry of the file to which they correspond. Microsoft
164 prefers to refer to the 8.3 entry for a file as its alias and the 164 prefers to refer to the 8.3 entry for a file as its alias and the
165 extended slot directory entries as the file name. 165 extended slot directory entries as the file name.
166 166
167 The C structure for a slot directory entry follows: 167 The C structure for a slot directory entry follows:
168 168
169 struct slot { // Up to 13 characters of a long name 169 struct slot { // Up to 13 characters of a long name
170 unsigned char id; // sequence number for slot 170 unsigned char id; // sequence number for slot
171 unsigned char name0_4[10]; // first 5 characters in name 171 unsigned char name0_4[10]; // first 5 characters in name
172 unsigned char attr; // attribute byte 172 unsigned char attr; // attribute byte
173 unsigned char reserved; // always 0 173 unsigned char reserved; // always 0
174 unsigned char alias_checksum; // checksum for 8.3 alias 174 unsigned char alias_checksum; // checksum for 8.3 alias
175 unsigned char name5_10[12]; // 6 more characters in name 175 unsigned char name5_10[12]; // 6 more characters in name
176 unsigned char start[2]; // starting cluster number 176 unsigned char start[2]; // starting cluster number
177 unsigned char name11_12[4]; // last 2 characters in name 177 unsigned char name11_12[4]; // last 2 characters in name
178 }; 178 };
179 179
180 If the layout of the slots looks a little odd, it's only 180 If the layout of the slots looks a little odd, it's only
181 because of Microsoft's efforts to maintain compatibility with old 181 because of Microsoft's efforts to maintain compatibility with old
182 software. The slots must be disguised to prevent old software from 182 software. The slots must be disguised to prevent old software from
183 panicking. To this end, a number of measures are taken: 183 panicking. To this end, a number of measures are taken:
184 184
185 1) The attribute byte for a slot directory entry is always set 185 1) The attribute byte for a slot directory entry is always set
186 to 0x0f. This corresponds to an old directory entry with 186 to 0x0f. This corresponds to an old directory entry with
187 attributes of "hidden", "system", "read-only", and "volume 187 attributes of "hidden", "system", "read-only", and "volume
188 label". Most old software will ignore any directory 188 label". Most old software will ignore any directory
189 entries with the "volume label" bit set. Real volume label 189 entries with the "volume label" bit set. Real volume label
190 entries don't have the other three bits set. 190 entries don't have the other three bits set.
191 191
192 2) The starting cluster is always set to 0, an impossible 192 2) The starting cluster is always set to 0, an impossible
193 value for a DOS file. 193 value for a DOS file.
194 194
195 Because the extended FAT system is backward compatible, it is 195 Because the extended FAT system is backward compatible, it is
196 possible for old software to modify directory entries. Measures must 196 possible for old software to modify directory entries. Measures must
197 be taken to ensure the validity of slots. An extended FAT system can 197 be taken to ensure the validity of slots. An extended FAT system can
198 verify that a slot does in fact belong to an 8.3 directory entry by 198 verify that a slot does in fact belong to an 8.3 directory entry by
199 the following: 199 the following:
200 200
201 1) Positioning. Slots for a file always immediately proceed 201 1) Positioning. Slots for a file always immediately proceed
202 their corresponding 8.3 directory entry. In addition, each 202 their corresponding 8.3 directory entry. In addition, each
203 slot has an id which marks its order in the extended file 203 slot has an id which marks its order in the extended file
204 name. Here is a very abbreviated view of an 8.3 directory 204 name. Here is a very abbreviated view of an 8.3 directory
205 entry and its corresponding long name slots for the file 205 entry and its corresponding long name slots for the file
206 "My Big File.Extension which is long": 206 "My Big File.Extension which is long":
207 207
208 <proceeding files...> 208 <proceeding files...>
209 <slot #3, id = 0x43, characters = "h is long"> 209 <slot #3, id = 0x43, characters = "h is long">
210 <slot #2, id = 0x02, characters = "xtension whic"> 210 <slot #2, id = 0x02, characters = "xtension whic">
211 <slot #1, id = 0x01, characters = "My Big File.E"> 211 <slot #1, id = 0x01, characters = "My Big File.E">
212 <directory entry, name = "MYBIGFIL.EXT"> 212 <directory entry, name = "MYBIGFIL.EXT">
213 213
214 Note that the slots are stored from last to first. Slots 214 Note that the slots are stored from last to first. Slots
215 are numbered from 1 to N. The Nth slot is or'ed with 0x40 215 are numbered from 1 to N. The Nth slot is or'ed with 0x40
216 to mark it as the last one. 216 to mark it as the last one.
217 217
218 2) Checksum. Each slot has an "alias_checksum" value. The 218 2) Checksum. Each slot has an "alias_checksum" value. The
219 checksum is calculated from the 8.3 name using the 219 checksum is calculated from the 8.3 name using the
220 following algorithm: 220 following algorithm:
221 221
222 for (sum = i = 0; i < 11; i++) { 222 for (sum = i = 0; i < 11; i++) {
223 sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i] 223 sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i]
224 } 224 }
225 225
226 3) If there is free space in the final slot, a Unicode NULL (0x0000) 226 3) If there is free space in the final slot, a Unicode NULL (0x0000)
227 is stored after the final character. After that, all unused 227 is stored after the final character. After that, all unused
228 characters in the final slot are set to Unicode 0xFFFF. 228 characters in the final slot are set to Unicode 0xFFFF.
229 229
230 Finally, note that the extended name is stored in Unicode. Each Unicode 230 Finally, note that the extended name is stored in Unicode. Each Unicode
231 character takes two bytes. 231 character takes two bytes.
232 232
Documentation/filesystems/vfs.txt
1 1
2 Overview of the Linux Virtual File System 2 Overview of the Linux Virtual File System
3 3
4 Original author: Richard Gooch <rgooch@atnf.csiro.au> 4 Original author: Richard Gooch <rgooch@atnf.csiro.au>
5 5
6 Last updated on October 28, 2005 6 Last updated on October 28, 2005
7 7
8 Copyright (C) 1999 Richard Gooch 8 Copyright (C) 1999 Richard Gooch
9 Copyright (C) 2005 Pekka Enberg 9 Copyright (C) 2005 Pekka Enberg
10 10
11 This file is released under the GPLv2. 11 This file is released under the GPLv2.
12 12
13 13
14 Introduction 14 Introduction
15 ============ 15 ============
16 16
17 The Virtual File System (also known as the Virtual Filesystem Switch) 17 The Virtual File System (also known as the Virtual Filesystem Switch)
18 is the software layer in the kernel that provides the filesystem 18 is the software layer in the kernel that provides the filesystem
19 interface to userspace programs. It also provides an abstraction 19 interface to userspace programs. It also provides an abstraction
20 within the kernel which allows different filesystem implementations to 20 within the kernel which allows different filesystem implementations to
21 coexist. 21 coexist.
22 22
23 VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so 23 VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so
24 on are called from a process context. Filesystem locking is described 24 on are called from a process context. Filesystem locking is described
25 in the document Documentation/filesystems/Locking. 25 in the document Documentation/filesystems/Locking.
26 26
27 27
28 Directory Entry Cache (dcache) 28 Directory Entry Cache (dcache)
29 ------------------------------ 29 ------------------------------
30 30
31 The VFS implements the open(2), stat(2), chmod(2), and similar system 31 The VFS implements the open(2), stat(2), chmod(2), and similar system
32 calls. The pathname argument that is passed to them is used by the VFS 32 calls. The pathname argument that is passed to them is used by the VFS
33 to search through the directory entry cache (also known as the dentry 33 to search through the directory entry cache (also known as the dentry
34 cache or dcache). This provides a very fast look-up mechanism to 34 cache or dcache). This provides a very fast look-up mechanism to
35 translate a pathname (filename) into a specific dentry. Dentries live 35 translate a pathname (filename) into a specific dentry. Dentries live
36 in RAM and are never saved to disc: they exist only for performance. 36 in RAM and are never saved to disc: they exist only for performance.
37 37
38 The dentry cache is meant to be a view into your entire filespace. As 38 The dentry cache is meant to be a view into your entire filespace. As
39 most computers cannot fit all dentries in the RAM at the same time, 39 most computers cannot fit all dentries in the RAM at the same time,
40 some bits of the cache are missing. In order to resolve your pathname 40 some bits of the cache are missing. In order to resolve your pathname
41 into a dentry, the VFS may have to resort to creating dentries along 41 into a dentry, the VFS may have to resort to creating dentries along
42 the way, and then loading the inode. This is done by looking up the 42 the way, and then loading the inode. This is done by looking up the
43 inode. 43 inode.
44 44
45 45
46 The Inode Object 46 The Inode Object
47 ---------------- 47 ----------------
48 48
49 An individual dentry usually has a pointer to an inode. Inodes are 49 An individual dentry usually has a pointer to an inode. Inodes are
50 filesystem objects such as regular files, directories, FIFOs and other 50 filesystem objects such as regular files, directories, FIFOs and other
51 beasts. They live either on the disc (for block device filesystems) 51 beasts. They live either on the disc (for block device filesystems)
52 or in the memory (for pseudo filesystems). Inodes that live on the 52 or in the memory (for pseudo filesystems). Inodes that live on the
53 disc are copied into the memory when required and changes to the inode 53 disc are copied into the memory when required and changes to the inode
54 are written back to disc. A single inode can be pointed to by multiple 54 are written back to disc. A single inode can be pointed to by multiple
55 dentries (hard links, for example, do this). 55 dentries (hard links, for example, do this).
56 56
57 To look up an inode requires that the VFS calls the lookup() method of 57 To look up an inode requires that the VFS calls the lookup() method of
58 the parent directory inode. This method is installed by the specific 58 the parent directory inode. This method is installed by the specific
59 filesystem implementation that the inode lives in. Once the VFS has 59 filesystem implementation that the inode lives in. Once the VFS has
60 the required dentry (and hence the inode), we can do all those boring 60 the required dentry (and hence the inode), we can do all those boring
61 things like open(2) the file, or stat(2) it to peek at the inode 61 things like open(2) the file, or stat(2) it to peek at the inode
62 data. The stat(2) operation is fairly simple: once the VFS has the 62 data. The stat(2) operation is fairly simple: once the VFS has the
63 dentry, it peeks at the inode data and passes some of it back to 63 dentry, it peeks at the inode data and passes some of it back to
64 userspace. 64 userspace.
65 65
66 66
67 The File Object 67 The File Object
68 --------------- 68 ---------------
69 69
70 Opening a file requires another operation: allocation of a file 70 Opening a file requires another operation: allocation of a file
71 structure (this is the kernel-side implementation of file 71 structure (this is the kernel-side implementation of file
72 descriptors). The freshly allocated file structure is initialized with 72 descriptors). The freshly allocated file structure is initialized with
73 a pointer to the dentry and a set of file operation member functions. 73 a pointer to the dentry and a set of file operation member functions.
74 These are taken from the inode data. The open() file method is then 74 These are taken from the inode data. The open() file method is then
75 called so the specific filesystem implementation can do it's work. You 75 called so the specific filesystem implementation can do it's work. You
76 can see that this is another switch performed by the VFS. The file 76 can see that this is another switch performed by the VFS. The file
77 structure is placed into the file descriptor table for the process. 77 structure is placed into the file descriptor table for the process.
78 78
79 Reading, writing and closing files (and other assorted VFS operations) 79 Reading, writing and closing files (and other assorted VFS operations)
80 is done by using the userspace file descriptor to grab the appropriate 80 is done by using the userspace file descriptor to grab the appropriate
81 file structure, and then calling the required file structure method to 81 file structure, and then calling the required file structure method to
82 do whatever is required. For as long as the file is open, it keeps the 82 do whatever is required. For as long as the file is open, it keeps the
83 dentry in use, which in turn means that the VFS inode is still in use. 83 dentry in use, which in turn means that the VFS inode is still in use.
84 84
85 85
86 Registering and Mounting a Filesystem 86 Registering and Mounting a Filesystem
87 ===================================== 87 =====================================
88 88
89 To register and unregister a filesystem, use the following API 89 To register and unregister a filesystem, use the following API
90 functions: 90 functions:
91 91
92 #include <linux/fs.h> 92 #include <linux/fs.h>
93 93
94 extern int register_filesystem(struct file_system_type *); 94 extern int register_filesystem(struct file_system_type *);
95 extern int unregister_filesystem(struct file_system_type *); 95 extern int unregister_filesystem(struct file_system_type *);
96 96
97 The passed struct file_system_type describes your filesystem. When a 97 The passed struct file_system_type describes your filesystem. When a
98 request is made to mount a device onto a directory in your filespace, 98 request is made to mount a device onto a directory in your filespace,
99 the VFS will call the appropriate get_sb() method for the specific 99 the VFS will call the appropriate get_sb() method for the specific
100 filesystem. The dentry for the mount point will then be updated to 100 filesystem. The dentry for the mount point will then be updated to
101 point to the root inode for the new filesystem. 101 point to the root inode for the new filesystem.
102 102
103 You can see all filesystems that are registered to the kernel in the 103 You can see all filesystems that are registered to the kernel in the
104 file /proc/filesystems. 104 file /proc/filesystems.
105 105
106 106
107 struct file_system_type 107 struct file_system_type
108 ----------------------- 108 -----------------------
109 109
110 This describes the filesystem. As of kernel 2.6.13, the following 110 This describes the filesystem. As of kernel 2.6.13, the following
111 members are defined: 111 members are defined:
112 112
113 struct file_system_type { 113 struct file_system_type {
114 const char *name; 114 const char *name;
115 int fs_flags; 115 int fs_flags;
116 int (*get_sb) (struct file_system_type *, int, 116 int (*get_sb) (struct file_system_type *, int,
117 const char *, void *, struct vfsmount *); 117 const char *, void *, struct vfsmount *);
118 void (*kill_sb) (struct super_block *); 118 void (*kill_sb) (struct super_block *);
119 struct module *owner; 119 struct module *owner;
120 struct file_system_type * next; 120 struct file_system_type * next;
121 struct list_head fs_supers; 121 struct list_head fs_supers;
122 }; 122 };
123 123
124 name: the name of the filesystem type, such as "ext2", "iso9660", 124 name: the name of the filesystem type, such as "ext2", "iso9660",
125 "msdos" and so on 125 "msdos" and so on
126 126
127 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) 127 fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
128 128
129 get_sb: the method to call when a new instance of this 129 get_sb: the method to call when a new instance of this
130 filesystem should be mounted 130 filesystem should be mounted
131 131
132 kill_sb: the method to call when an instance of this filesystem 132 kill_sb: the method to call when an instance of this filesystem
133 should be unmounted 133 should be unmounted
134 134
135 owner: for internal VFS use: you should initialize this to THIS_MODULE in 135 owner: for internal VFS use: you should initialize this to THIS_MODULE in
136 most cases. 136 most cases.
137 137
138 next: for internal VFS use: you should initialize this to NULL 138 next: for internal VFS use: you should initialize this to NULL
139 139
140 The get_sb() method has the following arguments: 140 The get_sb() method has the following arguments:
141 141
142 struct super_block *sb: the superblock structure. This is partially 142 struct super_block *sb: the superblock structure. This is partially
143 initialized by the VFS and the rest must be initialized by the 143 initialized by the VFS and the rest must be initialized by the
144 get_sb() method 144 get_sb() method
145 145
146 int flags: mount flags 146 int flags: mount flags
147 147
148 const char *dev_name: the device name we are mounting. 148 const char *dev_name: the device name we are mounting.
149 149
150 void *data: arbitrary mount options, usually comes as an ASCII 150 void *data: arbitrary mount options, usually comes as an ASCII
151 string 151 string
152 152
153 int silent: whether or not to be silent on error 153 int silent: whether or not to be silent on error
154 154
155 The get_sb() method must determine if the block device specified 155 The get_sb() method must determine if the block device specified
156 in the superblock contains a filesystem of the type the method 156 in the superblock contains a filesystem of the type the method
157 supports. On success the method returns the superblock pointer, on 157 supports. On success the method returns the superblock pointer, on
158 failure it returns NULL. 158 failure it returns NULL.
159 159
160 The most interesting member of the superblock structure that the 160 The most interesting member of the superblock structure that the
161 get_sb() method fills in is the "s_op" field. This is a pointer to 161 get_sb() method fills in is the "s_op" field. This is a pointer to
162 a "struct super_operations" which describes the next level of the 162 a "struct super_operations" which describes the next level of the
163 filesystem implementation. 163 filesystem implementation.
164 164
165 Usually, a filesystem uses one of the generic get_sb() implementations 165 Usually, a filesystem uses one of the generic get_sb() implementations
166 and provides a fill_super() method instead. The generic methods are: 166 and provides a fill_super() method instead. The generic methods are:
167 167
168 get_sb_bdev: mount a filesystem residing on a block device 168 get_sb_bdev: mount a filesystem residing on a block device
169 169
170 get_sb_nodev: mount a filesystem that is not backed by a device 170 get_sb_nodev: mount a filesystem that is not backed by a device
171 171
172 get_sb_single: mount a filesystem which shares the instance between 172 get_sb_single: mount a filesystem which shares the instance between
173 all mounts 173 all mounts
174 174
175 A fill_super() method implementation has the following arguments: 175 A fill_super() method implementation has the following arguments:
176 176
177 struct super_block *sb: the superblock structure. The method fill_super() 177 struct super_block *sb: the superblock structure. The method fill_super()
178 must initialize this properly. 178 must initialize this properly.
179 179
180 void *data: arbitrary mount options, usually comes as an ASCII 180 void *data: arbitrary mount options, usually comes as an ASCII
181 string 181 string
182 182
183 int silent: whether or not to be silent on error 183 int silent: whether or not to be silent on error
184 184
185 185
186 The Superblock Object 186 The Superblock Object
187 ===================== 187 =====================
188 188
189 A superblock object represents a mounted filesystem. 189 A superblock object represents a mounted filesystem.
190 190
191 191
192 struct super_operations 192 struct super_operations
193 ----------------------- 193 -----------------------
194 194
195 This describes how the VFS can manipulate the superblock of your 195 This describes how the VFS can manipulate the superblock of your
196 filesystem. As of kernel 2.6.13, the following members are defined: 196 filesystem. As of kernel 2.6.13, the following members are defined:
197 197
198 struct super_operations { 198 struct super_operations {
199 struct inode *(*alloc_inode)(struct super_block *sb); 199 struct inode *(*alloc_inode)(struct super_block *sb);
200 void (*destroy_inode)(struct inode *); 200 void (*destroy_inode)(struct inode *);
201 201
202 void (*read_inode) (struct inode *); 202 void (*read_inode) (struct inode *);
203 203
204 void (*dirty_inode) (struct inode *); 204 void (*dirty_inode) (struct inode *);
205 int (*write_inode) (struct inode *, int); 205 int (*write_inode) (struct inode *, int);
206 void (*put_inode) (struct inode *); 206 void (*put_inode) (struct inode *);
207 void (*drop_inode) (struct inode *); 207 void (*drop_inode) (struct inode *);
208 void (*delete_inode) (struct inode *); 208 void (*delete_inode) (struct inode *);
209 void (*put_super) (struct super_block *); 209 void (*put_super) (struct super_block *);
210 void (*write_super) (struct super_block *); 210 void (*write_super) (struct super_block *);
211 int (*sync_fs)(struct super_block *sb, int wait); 211 int (*sync_fs)(struct super_block *sb, int wait);
212 void (*write_super_lockfs) (struct super_block *); 212 void (*write_super_lockfs) (struct super_block *);
213 void (*unlockfs) (struct super_block *); 213 void (*unlockfs) (struct super_block *);
214 int (*statfs) (struct dentry *, struct kstatfs *); 214 int (*statfs) (struct dentry *, struct kstatfs *);
215 int (*remount_fs) (struct super_block *, int *, char *); 215 int (*remount_fs) (struct super_block *, int *, char *);
216 void (*clear_inode) (struct inode *); 216 void (*clear_inode) (struct inode *);
217 void (*umount_begin) (struct super_block *); 217 void (*umount_begin) (struct super_block *);
218 218
219 void (*sync_inodes) (struct super_block *sb, 219 void (*sync_inodes) (struct super_block *sb,
220 struct writeback_control *wbc); 220 struct writeback_control *wbc);
221 int (*show_options)(struct seq_file *, struct vfsmount *); 221 int (*show_options)(struct seq_file *, struct vfsmount *);
222 222
223 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 223 ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
224 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 224 ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
225 }; 225 };
226 226
227 All methods are called without any locks being held, unless otherwise 227 All methods are called without any locks being held, unless otherwise
228 noted. This means that most methods can block safely. All methods are 228 noted. This means that most methods can block safely. All methods are
229 only called from a process context (i.e. not from an interrupt handler 229 only called from a process context (i.e. not from an interrupt handler
230 or bottom half). 230 or bottom half).
231 231
232 alloc_inode: this method is called by inode_alloc() to allocate memory 232 alloc_inode: this method is called by inode_alloc() to allocate memory
233 for struct inode and initialize it. If this function is not 233 for struct inode and initialize it. If this function is not
234 defined, a simple 'struct inode' is allocated. Normally 234 defined, a simple 'struct inode' is allocated. Normally
235 alloc_inode will be used to allocate a larger structure which 235 alloc_inode will be used to allocate a larger structure which
236 contains a 'struct inode' embedded within it. 236 contains a 'struct inode' embedded within it.
237 237
238 destroy_inode: this method is called by destroy_inode() to release 238 destroy_inode: this method is called by destroy_inode() to release
239 resources allocated for struct inode. It is only required if 239 resources allocated for struct inode. It is only required if
240 ->alloc_inode was defined and simply undoes anything done by 240 ->alloc_inode was defined and simply undoes anything done by
241 ->alloc_inode. 241 ->alloc_inode.
242 242
243 read_inode: this method is called to read a specific inode from the 243 read_inode: this method is called to read a specific inode from the
244 mounted filesystem. The i_ino member in the struct inode is 244 mounted filesystem. The i_ino member in the struct inode is
245 initialized by the VFS to indicate which inode to read. Other 245 initialized by the VFS to indicate which inode to read. Other
246 members are filled in by this method. 246 members are filled in by this method.
247 247
248 You can set this to NULL and use iget5_locked() instead of iget() 248 You can set this to NULL and use iget5_locked() instead of iget()
249 to read inodes. This is necessary for filesystems for which the 249 to read inodes. This is necessary for filesystems for which the
250 inode number is not sufficient to identify an inode. 250 inode number is not sufficient to identify an inode.
251 251
252 dirty_inode: this method is called by the VFS to mark an inode dirty. 252 dirty_inode: this method is called by the VFS to mark an inode dirty.
253 253
254 write_inode: this method is called when the VFS needs to write an 254 write_inode: this method is called when the VFS needs to write an
255 inode to disc. The second parameter indicates whether the write 255 inode to disc. The second parameter indicates whether the write
256 should be synchronous or not, not all filesystems check this flag. 256 should be synchronous or not, not all filesystems check this flag.
257 257
258 put_inode: called when the VFS inode is removed from the inode 258 put_inode: called when the VFS inode is removed from the inode
259 cache. 259 cache.
260 260
261 drop_inode: called when the last access to the inode is dropped, 261 drop_inode: called when the last access to the inode is dropped,
262 with the inode_lock spinlock held. 262 with the inode_lock spinlock held.
263 263
264 This method should be either NULL (normal UNIX filesystem 264 This method should be either NULL (normal UNIX filesystem
265 semantics) or "generic_delete_inode" (for filesystems that do not 265 semantics) or "generic_delete_inode" (for filesystems that do not
266 want to cache inodes - causing "delete_inode" to always be 266 want to cache inodes - causing "delete_inode" to always be
267 called regardless of the value of i_nlink) 267 called regardless of the value of i_nlink)
268 268
269 The "generic_delete_inode()" behavior is equivalent to the 269 The "generic_delete_inode()" behavior is equivalent to the
270 old practice of using "force_delete" in the put_inode() case, 270 old practice of using "force_delete" in the put_inode() case,
271 but does not have the races that the "force_delete()" approach 271 but does not have the races that the "force_delete()" approach
272 had. 272 had.
273 273
274 delete_inode: called when the VFS wants to delete an inode 274 delete_inode: called when the VFS wants to delete an inode
275 275
276 put_super: called when the VFS wishes to free the superblock 276 put_super: called when the VFS wishes to free the superblock
277 (i.e. unmount). This is called with the superblock lock held 277 (i.e. unmount). This is called with the superblock lock held
278 278
279 write_super: called when the VFS superblock needs to be written to 279 write_super: called when the VFS superblock needs to be written to
280 disc. This method is optional 280 disc. This method is optional
281 281
282 sync_fs: called when VFS is writing out all dirty data associated with 282 sync_fs: called when VFS is writing out all dirty data associated with
283 a superblock. The second parameter indicates whether the method 283 a superblock. The second parameter indicates whether the method
284 should wait until the write out has been completed. Optional. 284 should wait until the write out has been completed. Optional.
285 285
286 write_super_lockfs: called when VFS is locking a filesystem and 286 write_super_lockfs: called when VFS is locking a filesystem and
287 forcing it into a consistent state. This method is currently 287 forcing it into a consistent state. This method is currently
288 used by the Logical Volume Manager (LVM). 288 used by the Logical Volume Manager (LVM).
289 289
290 unlockfs: called when VFS is unlocking a filesystem and making it writable 290 unlockfs: called when VFS is unlocking a filesystem and making it writable
291 again. 291 again.
292 292
293 statfs: called when the VFS needs to get filesystem statistics. This 293 statfs: called when the VFS needs to get filesystem statistics. This
294 is called with the kernel lock held 294 is called with the kernel lock held
295 295
296 remount_fs: called when the filesystem is remounted. This is called 296 remount_fs: called when the filesystem is remounted. This is called
297 with the kernel lock held 297 with the kernel lock held
298 298
299 clear_inode: called then the VFS clears the inode. Optional 299 clear_inode: called then the VFS clears the inode. Optional
300 300
301 umount_begin: called when the VFS is unmounting a filesystem. 301 umount_begin: called when the VFS is unmounting a filesystem.
302 302
303 sync_inodes: called when the VFS is writing out dirty data associated with 303 sync_inodes: called when the VFS is writing out dirty data associated with
304 a superblock. 304 a superblock.
305 305
306 show_options: called by the VFS to show mount options for /proc/<pid>/mounts. 306 show_options: called by the VFS to show mount options for /proc/<pid>/mounts.
307 307
308 quota_read: called by the VFS to read from filesystem quota file. 308 quota_read: called by the VFS to read from filesystem quota file.
309 309
310 quota_write: called by the VFS to write to filesystem quota file. 310 quota_write: called by the VFS to write to filesystem quota file.
311 311
312 The read_inode() method is responsible for filling in the "i_op" 312 The read_inode() method is responsible for filling in the "i_op"
313 field. This is a pointer to a "struct inode_operations" which 313 field. This is a pointer to a "struct inode_operations" which
314 describes the methods that can be performed on individual inodes. 314 describes the methods that can be performed on individual inodes.
315 315
316 316
317 The Inode Object 317 The Inode Object
318 ================ 318 ================
319 319
320 An inode object represents an object within the filesystem. 320 An inode object represents an object within the filesystem.
321 321
322 322
323 struct inode_operations 323 struct inode_operations
324 ----------------------- 324 -----------------------
325 325
326 This describes how the VFS can manipulate an inode in your 326 This describes how the VFS can manipulate an inode in your
327 filesystem. As of kernel 2.6.13, the following members are defined: 327 filesystem. As of kernel 2.6.13, the following members are defined:
328 328
329 struct inode_operations { 329 struct inode_operations {
330 int (*create) (struct inode *,struct dentry *,int, struct nameidata *); 330 int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
331 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); 331 struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
332 int (*link) (struct dentry *,struct inode *,struct dentry *); 332 int (*link) (struct dentry *,struct inode *,struct dentry *);
333 int (*unlink) (struct inode *,struct dentry *); 333 int (*unlink) (struct inode *,struct dentry *);
334 int (*symlink) (struct inode *,struct dentry *,const char *); 334 int (*symlink) (struct inode *,struct dentry *,const char *);
335 int (*mkdir) (struct inode *,struct dentry *,int); 335 int (*mkdir) (struct inode *,struct dentry *,int);
336 int (*rmdir) (struct inode *,struct dentry *); 336 int (*rmdir) (struct inode *,struct dentry *);
337 int (*mknod) (struct inode *,struct dentry *,int,dev_t); 337 int (*mknod) (struct inode *,struct dentry *,int,dev_t);
338 int (*rename) (struct inode *, struct dentry *, 338 int (*rename) (struct inode *, struct dentry *,
339 struct inode *, struct dentry *); 339 struct inode *, struct dentry *);
340 int (*readlink) (struct dentry *, char __user *,int); 340 int (*readlink) (struct dentry *, char __user *,int);
341 void * (*follow_link) (struct dentry *, struct nameidata *); 341 void * (*follow_link) (struct dentry *, struct nameidata *);
342 void (*put_link) (struct dentry *, struct nameidata *, void *); 342 void (*put_link) (struct dentry *, struct nameidata *, void *);
343 void (*truncate) (struct inode *); 343 void (*truncate) (struct inode *);
344 int (*permission) (struct inode *, int, struct nameidata *); 344 int (*permission) (struct inode *, int, struct nameidata *);
345 int (*setattr) (struct dentry *, struct iattr *); 345 int (*setattr) (struct dentry *, struct iattr *);
346 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); 346 int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
347 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); 347 int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
348 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); 348 ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
349 ssize_t (*listxattr) (struct dentry *, char *, size_t); 349 ssize_t (*listxattr) (struct dentry *, char *, size_t);
350 int (*removexattr) (struct dentry *, const char *); 350 int (*removexattr) (struct dentry *, const char *);
351 }; 351 };
352 352
353 Again, all methods are called without any locks being held, unless 353 Again, all methods are called without any locks being held, unless
354 otherwise noted. 354 otherwise noted.
355 355
356 create: called by the open(2) and creat(2) system calls. Only 356 create: called by the open(2) and creat(2) system calls. Only
357 required if you want to support regular files. The dentry you 357 required if you want to support regular files. The dentry you
358 get should not have an inode (i.e. it should be a negative 358 get should not have an inode (i.e. it should be a negative
359 dentry). Here you will probably call d_instantiate() with the 359 dentry). Here you will probably call d_instantiate() with the
360 dentry and the newly created inode 360 dentry and the newly created inode
361 361
362 lookup: called when the VFS needs to look up an inode in a parent 362 lookup: called when the VFS needs to look up an inode in a parent
363 directory. The name to look for is found in the dentry. This 363 directory. The name to look for is found in the dentry. This
364 method must call d_add() to insert the found inode into the 364 method must call d_add() to insert the found inode into the
365 dentry. The "i_count" field in the inode structure should be 365 dentry. The "i_count" field in the inode structure should be
366 incremented. If the named inode does not exist a NULL inode 366 incremented. If the named inode does not exist a NULL inode
367 should be inserted into the dentry (this is called a negative 367 should be inserted into the dentry (this is called a negative
368 dentry). Returning an error code from this routine must only 368 dentry). Returning an error code from this routine must only
369 be done on a real error, otherwise creating inodes with system 369 be done on a real error, otherwise creating inodes with system
370 calls like create(2), mknod(2), mkdir(2) and so on will fail. 370 calls like create(2), mknod(2), mkdir(2) and so on will fail.
371 If you wish to overload the dentry methods then you should 371 If you wish to overload the dentry methods then you should
372 initialise the "d_dop" field in the dentry; this is a pointer 372 initialise the "d_dop" field in the dentry; this is a pointer
373 to a struct "dentry_operations". 373 to a struct "dentry_operations".
374 This method is called with the directory inode semaphore held 374 This method is called with the directory inode semaphore held
375 375
376 link: called by the link(2) system call. Only required if you want 376 link: called by the link(2) system call. Only required if you want
377 to support hard links. You will probably need to call 377 to support hard links. You will probably need to call
378 d_instantiate() just as you would in the create() method 378 d_instantiate() just as you would in the create() method
379 379
380 unlink: called by the unlink(2) system call. Only required if you 380 unlink: called by the unlink(2) system call. Only required if you
381 want to support deleting inodes 381 want to support deleting inodes
382 382
383 symlink: called by the symlink(2) system call. Only required if you 383 symlink: called by the symlink(2) system call. Only required if you
384 want to support symlinks. You will probably need to call 384 want to support symlinks. You will probably need to call
385 d_instantiate() just as you would in the create() method 385 d_instantiate() just as you would in the create() method
386 386
387 mkdir: called by the mkdir(2) system call. Only required if you want 387 mkdir: called by the mkdir(2) system call. Only required if you want
388 to support creating subdirectories. You will probably need to 388 to support creating subdirectories. You will probably need to
389 call d_instantiate() just as you would in the create() method 389 call d_instantiate() just as you would in the create() method
390 390
391 rmdir: called by the rmdir(2) system call. Only required if you want 391 rmdir: called by the rmdir(2) system call. Only required if you want
392 to support deleting subdirectories 392 to support deleting subdirectories
393 393
394 mknod: called by the mknod(2) system call to create a device (char, 394 mknod: called by the mknod(2) system call to create a device (char,
395 block) inode or a named pipe (FIFO) or socket. Only required 395 block) inode or a named pipe (FIFO) or socket. Only required
396 if you want to support creating these types of inodes. You 396 if you want to support creating these types of inodes. You
397 will probably need to call d_instantiate() just as you would 397 will probably need to call d_instantiate() just as you would
398 in the create() method 398 in the create() method
399 399
400 rename: called by the rename(2) system call to rename the object to 400 rename: called by the rename(2) system call to rename the object to
401 have the parent and name given by the second inode and dentry. 401 have the parent and name given by the second inode and dentry.
402 402
403 readlink: called by the readlink(2) system call. Only required if 403 readlink: called by the readlink(2) system call. Only required if
404 you want to support reading symbolic links 404 you want to support reading symbolic links
405 405
406 follow_link: called by the VFS to follow a symbolic link to the 406 follow_link: called by the VFS to follow a symbolic link to the
407 inode it points to. Only required if you want to support 407 inode it points to. Only required if you want to support
408 symbolic links. This method returns a void pointer cookie 408 symbolic links. This method returns a void pointer cookie
409 that is passed to put_link(). 409 that is passed to put_link().
410 410
411 put_link: called by the VFS to release resources allocated by 411 put_link: called by the VFS to release resources allocated by
412 follow_link(). The cookie returned by follow_link() is passed 412 follow_link(). The cookie returned by follow_link() is passed
413 to to this method as the last parameter. It is used by 413 to this method as the last parameter. It is used by
414 filesystems such as NFS where page cache is not stable 414 filesystems such as NFS where page cache is not stable
415 (i.e. page that was installed when the symbolic link walk 415 (i.e. page that was installed when the symbolic link walk
416 started might not be in the page cache at the end of the 416 started might not be in the page cache at the end of the
417 walk). 417 walk).
418 418
419 truncate: called by the VFS to change the size of a file. The 419 truncate: called by the VFS to change the size of a file. The
420 i_size field of the inode is set to the desired size by the 420 i_size field of the inode is set to the desired size by the
421 VFS before this method is called. This method is called by 421 VFS before this method is called. This method is called by
422 the truncate(2) system call and related functionality. 422 the truncate(2) system call and related functionality.
423 423
424 permission: called by the VFS to check for access rights on a POSIX-like 424 permission: called by the VFS to check for access rights on a POSIX-like
425 filesystem. 425 filesystem.
426 426
427 setattr: called by the VFS to set attributes for a file. This method 427 setattr: called by the VFS to set attributes for a file. This method
428 is called by chmod(2) and related system calls. 428 is called by chmod(2) and related system calls.
429 429
430 getattr: called by the VFS to get attributes of a file. This method 430 getattr: called by the VFS to get attributes of a file. This method
431 is called by stat(2) and related system calls. 431 is called by stat(2) and related system calls.
432 432
433 setxattr: called by the VFS to set an extended attribute for a file. 433 setxattr: called by the VFS to set an extended attribute for a file.
434 Extended attribute is a name:value pair associated with an 434 Extended attribute is a name:value pair associated with an
435 inode. This method is called by setxattr(2) system call. 435 inode. This method is called by setxattr(2) system call.
436 436
437 getxattr: called by the VFS to retrieve the value of an extended 437 getxattr: called by the VFS to retrieve the value of an extended
438 attribute name. This method is called by getxattr(2) function 438 attribute name. This method is called by getxattr(2) function
439 call. 439 call.
440 440
441 listxattr: called by the VFS to list all extended attributes for a 441 listxattr: called by the VFS to list all extended attributes for a
442 given file. This method is called by listxattr(2) system call. 442 given file. This method is called by listxattr(2) system call.
443 443
444 removexattr: called by the VFS to remove an extended attribute from 444 removexattr: called by the VFS to remove an extended attribute from
445 a file. This method is called by removexattr(2) system call. 445 a file. This method is called by removexattr(2) system call.
446 446
447 447
448 The Address Space Object 448 The Address Space Object
449 ======================== 449 ========================
450 450
451 The address space object is used to group and manage pages in the page 451 The address space object is used to group and manage pages in the page
452 cache. It can be used to keep track of the pages in a file (or 452 cache. It can be used to keep track of the pages in a file (or
453 anything else) and also track the mapping of sections of the file into 453 anything else) and also track the mapping of sections of the file into
454 process address spaces. 454 process address spaces.
455 455
456 There are a number of distinct yet related services that an 456 There are a number of distinct yet related services that an
457 address-space can provide. These include communicating memory 457 address-space can provide. These include communicating memory
458 pressure, page lookup by address, and keeping track of pages tagged as 458 pressure, page lookup by address, and keeping track of pages tagged as
459 Dirty or Writeback. 459 Dirty or Writeback.
460 460
461 The first can be used independently to the others. The VM can try to 461 The first can be used independently to the others. The VM can try to
462 either write dirty pages in order to clean them, or release clean 462 either write dirty pages in order to clean them, or release clean
463 pages in order to reuse them. To do this it can call the ->writepage 463 pages in order to reuse them. To do this it can call the ->writepage
464 method on dirty pages, and ->releasepage on clean pages with 464 method on dirty pages, and ->releasepage on clean pages with
465 PagePrivate set. Clean pages without PagePrivate and with no external 465 PagePrivate set. Clean pages without PagePrivate and with no external
466 references will be released without notice being given to the 466 references will be released without notice being given to the
467 address_space. 467 address_space.
468 468
469 To achieve this functionality, pages need to be placed on an LRU with 469 To achieve this functionality, pages need to be placed on an LRU with
470 lru_cache_add and mark_page_active needs to be called whenever the 470 lru_cache_add and mark_page_active needs to be called whenever the
471 page is used. 471 page is used.
472 472
473 Pages are normally kept in a radix tree index by ->index. This tree 473 Pages are normally kept in a radix tree index by ->index. This tree
474 maintains information about the PG_Dirty and PG_Writeback status of 474 maintains information about the PG_Dirty and PG_Writeback status of
475 each page, so that pages with either of these flags can be found 475 each page, so that pages with either of these flags can be found
476 quickly. 476 quickly.
477 477
478 The Dirty tag is primarily used by mpage_writepages - the default 478 The Dirty tag is primarily used by mpage_writepages - the default
479 ->writepages method. It uses the tag to find dirty pages to call 479 ->writepages method. It uses the tag to find dirty pages to call
480 ->writepage on. If mpage_writepages is not used (i.e. the address 480 ->writepage on. If mpage_writepages is not used (i.e. the address
481 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is 481 provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is
482 almost unused. write_inode_now and sync_inode do use it (through 482 almost unused. write_inode_now and sync_inode do use it (through
483 __sync_single_inode) to check if ->writepages has been successful in 483 __sync_single_inode) to check if ->writepages has been successful in
484 writing out the whole address_space. 484 writing out the whole address_space.
485 485
486 The Writeback tag is used by filemap*wait* and sync_page* functions, 486 The Writeback tag is used by filemap*wait* and sync_page* functions,
487 via wait_on_page_writeback_range, to wait for all writeback to 487 via wait_on_page_writeback_range, to wait for all writeback to
488 complete. While waiting ->sync_page (if defined) will be called on 488 complete. While waiting ->sync_page (if defined) will be called on
489 each page that is found to require writeback. 489 each page that is found to require writeback.
490 490
491 An address_space handler may attach extra information to a page, 491 An address_space handler may attach extra information to a page,
492 typically using the 'private' field in the 'struct page'. If such 492 typically using the 'private' field in the 'struct page'. If such
493 information is attached, the PG_Private flag should be set. This will 493 information is attached, the PG_Private flag should be set. This will
494 cause various VM routines to make extra calls into the address_space 494 cause various VM routines to make extra calls into the address_space
495 handler to deal with that data. 495 handler to deal with that data.
496 496
497 An address space acts as an intermediate between storage and 497 An address space acts as an intermediate between storage and
498 application. Data is read into the address space a whole page at a 498 application. Data is read into the address space a whole page at a
499 time, and provided to the application either by copying of the page, 499 time, and provided to the application either by copying of the page,
500 or by memory-mapping the page. 500 or by memory-mapping the page.
501 Data is written into the address space by the application, and then 501 Data is written into the address space by the application, and then
502 written-back to storage typically in whole pages, however the 502 written-back to storage typically in whole pages, however the
503 address_space has finer control of write sizes. 503 address_space has finer control of write sizes.
504 504
505 The read process essentially only requires 'readpage'. The write 505 The read process essentially only requires 'readpage'. The write
506 process is more complicated and uses prepare_write/commit_write or 506 process is more complicated and uses prepare_write/commit_write or
507 set_page_dirty to write data into the address_space, and writepage, 507 set_page_dirty to write data into the address_space, and writepage,
508 sync_page, and writepages to writeback data to storage. 508 sync_page, and writepages to writeback data to storage.
509 509
510 Adding and removing pages to/from an address_space is protected by the 510 Adding and removing pages to/from an address_space is protected by the
511 inode's i_mutex. 511 inode's i_mutex.
512 512
513 When data is written to a page, the PG_Dirty flag should be set. It 513 When data is written to a page, the PG_Dirty flag should be set. It
514 typically remains set until writepage asks for it to be written. This 514 typically remains set until writepage asks for it to be written. This
515 should clear PG_Dirty and set PG_Writeback. It can be actually 515 should clear PG_Dirty and set PG_Writeback. It can be actually
516 written at any point after PG_Dirty is clear. Once it is known to be 516 written at any point after PG_Dirty is clear. Once it is known to be
517 safe, PG_Writeback is cleared. 517 safe, PG_Writeback is cleared.
518 518
519 Writeback makes use of a writeback_control structure... 519 Writeback makes use of a writeback_control structure...
520 520
521 struct address_space_operations 521 struct address_space_operations
522 ------------------------------- 522 -------------------------------
523 523
524 This describes how the VFS can manipulate mapping of a file to page cache in 524 This describes how the VFS can manipulate mapping of a file to page cache in
525 your filesystem. As of kernel 2.6.16, the following members are defined: 525 your filesystem. As of kernel 2.6.16, the following members are defined:
526 526
527 struct address_space_operations { 527 struct address_space_operations {
528 int (*writepage)(struct page *page, struct writeback_control *wbc); 528 int (*writepage)(struct page *page, struct writeback_control *wbc);
529 int (*readpage)(struct file *, struct page *); 529 int (*readpage)(struct file *, struct page *);
530 int (*sync_page)(struct page *); 530 int (*sync_page)(struct page *);
531 int (*writepages)(struct address_space *, struct writeback_control *); 531 int (*writepages)(struct address_space *, struct writeback_control *);
532 int (*set_page_dirty)(struct page *page); 532 int (*set_page_dirty)(struct page *page);
533 int (*readpages)(struct file *filp, struct address_space *mapping, 533 int (*readpages)(struct file *filp, struct address_space *mapping,
534 struct list_head *pages, unsigned nr_pages); 534 struct list_head *pages, unsigned nr_pages);
535 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); 535 int (*prepare_write)(struct file *, struct page *, unsigned, unsigned);
536 int (*commit_write)(struct file *, struct page *, unsigned, unsigned); 536 int (*commit_write)(struct file *, struct page *, unsigned, unsigned);
537 sector_t (*bmap)(struct address_space *, sector_t); 537 sector_t (*bmap)(struct address_space *, sector_t);
538 int (*invalidatepage) (struct page *, unsigned long); 538 int (*invalidatepage) (struct page *, unsigned long);
539 int (*releasepage) (struct page *, int); 539 int (*releasepage) (struct page *, int);
540 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 540 ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
541 loff_t offset, unsigned long nr_segs); 541 loff_t offset, unsigned long nr_segs);
542 struct page* (*get_xip_page)(struct address_space *, sector_t, 542 struct page* (*get_xip_page)(struct address_space *, sector_t,
543 int); 543 int);
544 /* migrate the contents of a page to the specified target */ 544 /* migrate the contents of a page to the specified target */
545 int (*migratepage) (struct page *, struct page *); 545 int (*migratepage) (struct page *, struct page *);
546 }; 546 };
547 547
548 writepage: called by the VM to write a dirty page to backing store. 548 writepage: called by the VM to write a dirty page to backing store.
549 This may happen for data integrity reasons (i.e. 'sync'), or 549 This may happen for data integrity reasons (i.e. 'sync'), or
550 to free up memory (flush). The difference can be seen in 550 to free up memory (flush). The difference can be seen in
551 wbc->sync_mode. 551 wbc->sync_mode.
552 The PG_Dirty flag has been cleared and PageLocked is true. 552 The PG_Dirty flag has been cleared and PageLocked is true.
553 writepage should start writeout, should set PG_Writeback, 553 writepage should start writeout, should set PG_Writeback,
554 and should make sure the page is unlocked, either synchronously 554 and should make sure the page is unlocked, either synchronously
555 or asynchronously when the write operation completes. 555 or asynchronously when the write operation completes.
556 556
557 If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to 557 If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to
558 try too hard if there are problems, and may choose to write out 558 try too hard if there are problems, and may choose to write out
559 other pages from the mapping if that is easier (e.g. due to 559 other pages from the mapping if that is easier (e.g. due to
560 internal dependencies). If it chooses not to start writeout, it 560 internal dependencies). If it chooses not to start writeout, it
561 should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep 561 should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep
562 calling ->writepage on that page. 562 calling ->writepage on that page.
563 563
564 See the file "Locking" for more details. 564 See the file "Locking" for more details.
565 565
566 readpage: called by the VM to read a page from backing store. 566 readpage: called by the VM to read a page from backing store.
567 The page will be Locked when readpage is called, and should be 567 The page will be Locked when readpage is called, and should be
568 unlocked and marked uptodate once the read completes. 568 unlocked and marked uptodate once the read completes.
569 If ->readpage discovers that it needs to unlock the page for 569 If ->readpage discovers that it needs to unlock the page for
570 some reason, it can do so, and then return AOP_TRUNCATED_PAGE. 570 some reason, it can do so, and then return AOP_TRUNCATED_PAGE.
571 In this case, the page will be relocated, relocked and if 571 In this case, the page will be relocated, relocked and if
572 that all succeeds, ->readpage will be called again. 572 that all succeeds, ->readpage will be called again.
573 573
574 sync_page: called by the VM to notify the backing store to perform all 574 sync_page: called by the VM to notify the backing store to perform all
575 queued I/O operations for a page. I/O operations for other pages 575 queued I/O operations for a page. I/O operations for other pages
576 associated with this address_space object may also be performed. 576 associated with this address_space object may also be performed.
577 577
578 This function is optional and is called only for pages with 578 This function is optional and is called only for pages with
579 PG_Writeback set while waiting for the writeback to complete. 579 PG_Writeback set while waiting for the writeback to complete.
580 580
581 writepages: called by the VM to write out pages associated with the 581 writepages: called by the VM to write out pages associated with the
582 address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then 582 address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then
583 the writeback_control will specify a range of pages that must be 583 the writeback_control will specify a range of pages that must be
584 written out. If it is WBC_SYNC_NONE, then a nr_to_write is given 584 written out. If it is WBC_SYNC_NONE, then a nr_to_write is given
585 and that many pages should be written if possible. 585 and that many pages should be written if possible.
586 If no ->writepages is given, then mpage_writepages is used 586 If no ->writepages is given, then mpage_writepages is used
587 instead. This will choose pages from the address space that are 587 instead. This will choose pages from the address space that are
588 tagged as DIRTY and will pass them to ->writepage. 588 tagged as DIRTY and will pass them to ->writepage.
589 589
590 set_page_dirty: called by the VM to set a page dirty. 590 set_page_dirty: called by the VM to set a page dirty.
591 This is particularly needed if an address space attaches 591 This is particularly needed if an address space attaches
592 private data to a page, and that data needs to be updated when 592 private data to a page, and that data needs to be updated when
593 a page is dirtied. This is called, for example, when a memory 593 a page is dirtied. This is called, for example, when a memory
594 mapped page gets modified. 594 mapped page gets modified.
595 If defined, it should set the PageDirty flag, and the 595 If defined, it should set the PageDirty flag, and the
596 PAGECACHE_TAG_DIRTY tag in the radix tree. 596 PAGECACHE_TAG_DIRTY tag in the radix tree.
597 597
598 readpages: called by the VM to read pages associated with the address_space 598 readpages: called by the VM to read pages associated with the address_space
599 object. This is essentially just a vector version of 599 object. This is essentially just a vector version of
600 readpage. Instead of just one page, several pages are 600 readpage. Instead of just one page, several pages are
601 requested. 601 requested.
602 readpages is only used for read-ahead, so read errors are 602 readpages is only used for read-ahead, so read errors are
603 ignored. If anything goes wrong, feel free to give up. 603 ignored. If anything goes wrong, feel free to give up.
604 604
605 prepare_write: called by the generic write path in VM to set up a write 605 prepare_write: called by the generic write path in VM to set up a write
606 request for a page. This indicates to the address space that 606 request for a page. This indicates to the address space that
607 the given range of bytes is about to be written. The 607 the given range of bytes is about to be written. The
608 address_space should check that the write will be able to 608 address_space should check that the write will be able to
609 complete, by allocating space if necessary and doing any other 609 complete, by allocating space if necessary and doing any other
610 internal housekeeping. If the write will update parts of 610 internal housekeeping. If the write will update parts of
611 any basic-blocks on storage, then those blocks should be 611 any basic-blocks on storage, then those blocks should be
612 pre-read (if they haven't been read already) so that the 612 pre-read (if they haven't been read already) so that the
613 updated blocks can be written out properly. 613 updated blocks can be written out properly.
614 The page will be locked. If prepare_write wants to unlock the 614 The page will be locked. If prepare_write wants to unlock the
615 page it, like readpage, may do so and return 615 page it, like readpage, may do so and return
616 AOP_TRUNCATED_PAGE. 616 AOP_TRUNCATED_PAGE.
617 In this case the prepare_write will be retried one the lock is 617 In this case the prepare_write will be retried one the lock is
618 regained. 618 regained.
619 619
620 commit_write: If prepare_write succeeds, new data will be copied 620 commit_write: If prepare_write succeeds, new data will be copied
621 into the page and then commit_write will be called. It will 621 into the page and then commit_write will be called. It will
622 typically update the size of the file (if appropriate) and 622 typically update the size of the file (if appropriate) and
623 mark the inode as dirty, and do any other related housekeeping 623 mark the inode as dirty, and do any other related housekeeping
624 operations. It should avoid returning an error if possible - 624 operations. It should avoid returning an error if possible -
625 errors should have been handled by prepare_write. 625 errors should have been handled by prepare_write.
626 626
627 bmap: called by the VFS to map a logical block offset within object to 627 bmap: called by the VFS to map a logical block offset within object to
628 physical block number. This method is used by the FIBMAP 628 physical block number. This method is used by the FIBMAP
629 ioctl and for working with swap-files. To be able to swap to 629 ioctl and for working with swap-files. To be able to swap to
630 a file, the file must have a stable mapping to a block 630 a file, the file must have a stable mapping to a block
631 device. The swap system does not go through the filesystem 631 device. The swap system does not go through the filesystem
632 but instead uses bmap to find out where the blocks in the file 632 but instead uses bmap to find out where the blocks in the file
633 are and uses those addresses directly. 633 are and uses those addresses directly.
634 634
635 635
636 invalidatepage: If a page has PagePrivate set, then invalidatepage 636 invalidatepage: If a page has PagePrivate set, then invalidatepage
637 will be called when part or all of the page is to be removed 637 will be called when part or all of the page is to be removed
638 from the address space. This generally corresponds to either a 638 from the address space. This generally corresponds to either a
639 truncation or a complete invalidation of the address space 639 truncation or a complete invalidation of the address space
640 (in the latter case 'offset' will always be 0). 640 (in the latter case 'offset' will always be 0).
641 Any private data associated with the page should be updated 641 Any private data associated with the page should be updated
642 to reflect this truncation. If offset is 0, then 642 to reflect this truncation. If offset is 0, then
643 the private data should be released, because the page 643 the private data should be released, because the page
644 must be able to be completely discarded. This may be done by 644 must be able to be completely discarded. This may be done by
645 calling the ->releasepage function, but in this case the 645 calling the ->releasepage function, but in this case the
646 release MUST succeed. 646 release MUST succeed.
647 647
648 releasepage: releasepage is called on PagePrivate pages to indicate 648 releasepage: releasepage is called on PagePrivate pages to indicate
649 that the page should be freed if possible. ->releasepage 649 that the page should be freed if possible. ->releasepage
650 should remove any private data from the page and clear the 650 should remove any private data from the page and clear the
651 PagePrivate flag. It may also remove the page from the 651 PagePrivate flag. It may also remove the page from the
652 address_space. If this fails for some reason, it may indicate 652 address_space. If this fails for some reason, it may indicate
653 failure with a 0 return value. 653 failure with a 0 return value.
654 This is used in two distinct though related cases. The first 654 This is used in two distinct though related cases. The first
655 is when the VM finds a clean page with no active users and 655 is when the VM finds a clean page with no active users and
656 wants to make it a free page. If ->releasepage succeeds, the 656 wants to make it a free page. If ->releasepage succeeds, the
657 page will be removed from the address_space and become free. 657 page will be removed from the address_space and become free.
658 658
659 The second case if when a request has been made to invalidate 659 The second case if when a request has been made to invalidate
660 some or all pages in an address_space. This can happen 660 some or all pages in an address_space. This can happen
661 through the fadvice(POSIX_FADV_DONTNEED) system call or by the 661 through the fadvice(POSIX_FADV_DONTNEED) system call or by the
662 filesystem explicitly requesting it as nfs and 9fs do (when 662 filesystem explicitly requesting it as nfs and 9fs do (when
663 they believe the cache may be out of date with storage) by 663 they believe the cache may be out of date with storage) by
664 calling invalidate_inode_pages2(). 664 calling invalidate_inode_pages2().
665 If the filesystem makes such a call, and needs to be certain 665 If the filesystem makes such a call, and needs to be certain
666 that all pages are invalidated, then its releasepage will 666 that all pages are invalidated, then its releasepage will
667 need to ensure this. Possibly it can clear the PageUptodate 667 need to ensure this. Possibly it can clear the PageUptodate
668 bit if it cannot free private data yet. 668 bit if it cannot free private data yet.
669 669
670 direct_IO: called by the generic read/write routines to perform 670 direct_IO: called by the generic read/write routines to perform
671 direct_IO - that is IO requests which bypass the page cache 671 direct_IO - that is IO requests which bypass the page cache
672 and transfer data directly between the storage and the 672 and transfer data directly between the storage and the
673 application's address space. 673 application's address space.
674 674
675 get_xip_page: called by the VM to translate a block number to a page. 675 get_xip_page: called by the VM to translate a block number to a page.
676 The page is valid until the corresponding filesystem is unmounted. 676 The page is valid until the corresponding filesystem is unmounted.
677 Filesystems that want to use execute-in-place (XIP) need to implement 677 Filesystems that want to use execute-in-place (XIP) need to implement
678 it. An example implementation can be found in fs/ext2/xip.c. 678 it. An example implementation can be found in fs/ext2/xip.c.
679 679
680 migrate_page: This is used to compact the physical memory usage. 680 migrate_page: This is used to compact the physical memory usage.
681 If the VM wants to relocate a page (maybe off a memory card 681 If the VM wants to relocate a page (maybe off a memory card
682 that is signalling imminent failure) it will pass a new page 682 that is signalling imminent failure) it will pass a new page
683 and an old page to this function. migrate_page should 683 and an old page to this function. migrate_page should
684 transfer any private data across and update any references 684 transfer any private data across and update any references
685 that it has to the page. 685 that it has to the page.
686 686
687 The File Object 687 The File Object
688 =============== 688 ===============
689 689
690 A file object represents a file opened by a process. 690 A file object represents a file opened by a process.
691 691
692 692
693 struct file_operations 693 struct file_operations
694 ---------------------- 694 ----------------------
695 695
696 This describes how the VFS can manipulate an open file. As of kernel 696 This describes how the VFS can manipulate an open file. As of kernel
697 2.6.17, the following members are defined: 697 2.6.17, the following members are defined:
698 698
699 struct file_operations { 699 struct file_operations {
700 loff_t (*llseek) (struct file *, loff_t, int); 700 loff_t (*llseek) (struct file *, loff_t, int);
701 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); 701 ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
702 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); 702 ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
703 ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); 703 ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
704 ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); 704 ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
705 int (*readdir) (struct file *, void *, filldir_t); 705 int (*readdir) (struct file *, void *, filldir_t);
706 unsigned int (*poll) (struct file *, struct poll_table_struct *); 706 unsigned int (*poll) (struct file *, struct poll_table_struct *);
707 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); 707 int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
708 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); 708 long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
709 long (*compat_ioctl) (struct file *, unsigned int, unsigned long); 709 long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
710 int (*mmap) (struct file *, struct vm_area_struct *); 710 int (*mmap) (struct file *, struct vm_area_struct *);
711 int (*open) (struct inode *, struct file *); 711 int (*open) (struct inode *, struct file *);
712 int (*flush) (struct file *); 712 int (*flush) (struct file *);
713 int (*release) (struct inode *, struct file *); 713 int (*release) (struct inode *, struct file *);
714 int (*fsync) (struct file *, struct dentry *, int datasync); 714 int (*fsync) (struct file *, struct dentry *, int datasync);
715 int (*aio_fsync) (struct kiocb *, int datasync); 715 int (*aio_fsync) (struct kiocb *, int datasync);
716 int (*fasync) (int, struct file *, int); 716 int (*fasync) (int, struct file *, int);
717 int (*lock) (struct file *, int, struct file_lock *); 717 int (*lock) (struct file *, int, struct file_lock *);
718 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); 718 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
719 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); 719 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
720 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); 720 ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
721 ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); 721 ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
722 unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); 722 unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
723 int (*check_flags)(int); 723 int (*check_flags)(int);
724 int (*dir_notify)(struct file *filp, unsigned long arg); 724 int (*dir_notify)(struct file *filp, unsigned long arg);
725 int (*flock) (struct file *, int, struct file_lock *); 725 int (*flock) (struct file *, int, struct file_lock *);
726 ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned 726 ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned
727 int); 727 int);
728 ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned 728 ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned
729 int); 729 int);
730 }; 730 };
731 731
732 Again, all methods are called without any locks being held, unless 732 Again, all methods are called without any locks being held, unless
733 otherwise noted. 733 otherwise noted.
734 734
735 llseek: called when the VFS needs to move the file position index 735 llseek: called when the VFS needs to move the file position index
736 736
737 read: called by read(2) and related system calls 737 read: called by read(2) and related system calls
738 738
739 aio_read: called by io_submit(2) and other asynchronous I/O operations 739 aio_read: called by io_submit(2) and other asynchronous I/O operations
740 740
741 write: called by write(2) and related system calls 741 write: called by write(2) and related system calls
742 742
743 aio_write: called by io_submit(2) and other asynchronous I/O operations 743 aio_write: called by io_submit(2) and other asynchronous I/O operations
744 744
745 readdir: called when the VFS needs to read the directory contents 745 readdir: called when the VFS needs to read the directory contents
746 746
747 poll: called by the VFS when a process wants to check if there is 747 poll: called by the VFS when a process wants to check if there is
748 activity on this file and (optionally) go to sleep until there 748 activity on this file and (optionally) go to sleep until there
749 is activity. Called by the select(2) and poll(2) system calls 749 is activity. Called by the select(2) and poll(2) system calls
750 750
751 ioctl: called by the ioctl(2) system call 751 ioctl: called by the ioctl(2) system call
752 752
753 unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not 753 unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not
754 require the BKL should use this method instead of the ioctl() above. 754 require the BKL should use this method instead of the ioctl() above.
755 755
756 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls 756 compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
757 are used on 64 bit kernels. 757 are used on 64 bit kernels.
758 758
759 mmap: called by the mmap(2) system call 759 mmap: called by the mmap(2) system call
760 760
761 open: called by the VFS when an inode should be opened. When the VFS 761 open: called by the VFS when an inode should be opened. When the VFS
762 opens a file, it creates a new "struct file". It then calls the 762 opens a file, it creates a new "struct file". It then calls the
763 open method for the newly allocated file structure. You might 763 open method for the newly allocated file structure. You might
764 think that the open method really belongs in 764 think that the open method really belongs in
765 "struct inode_operations", and you may be right. I think it's 765 "struct inode_operations", and you may be right. I think it's
766 done the way it is because it makes filesystems simpler to 766 done the way it is because it makes filesystems simpler to
767 implement. The open() method is a good place to initialize the 767 implement. The open() method is a good place to initialize the
768 "private_data" member in the file structure if you want to point 768 "private_data" member in the file structure if you want to point
769 to a device structure 769 to a device structure
770 770
771 flush: called by the close(2) system call to flush a file 771 flush: called by the close(2) system call to flush a file
772 772
773 release: called when the last reference to an open file is closed 773 release: called when the last reference to an open file is closed
774 774
775 fsync: called by the fsync(2) system call 775 fsync: called by the fsync(2) system call
776 776
777 fasync: called by the fcntl(2) system call when asynchronous 777 fasync: called by the fcntl(2) system call when asynchronous
778 (non-blocking) mode is enabled for a file 778 (non-blocking) mode is enabled for a file
779 779
780 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW 780 lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW
781 commands 781 commands
782 782
783 readv: called by the readv(2) system call 783 readv: called by the readv(2) system call
784 784
785 writev: called by the writev(2) system call 785 writev: called by the writev(2) system call
786 786
787 sendfile: called by the sendfile(2) system call 787 sendfile: called by the sendfile(2) system call
788 788
789 get_unmapped_area: called by the mmap(2) system call 789 get_unmapped_area: called by the mmap(2) system call
790 790
791 check_flags: called by the fcntl(2) system call for F_SETFL command 791 check_flags: called by the fcntl(2) system call for F_SETFL command
792 792
793 dir_notify: called by the fcntl(2) system call for F_NOTIFY command 793 dir_notify: called by the fcntl(2) system call for F_NOTIFY command
794 794
795 flock: called by the flock(2) system call 795 flock: called by the flock(2) system call
796 796
797 splice_write: called by the VFS to splice data from a pipe to a file. This 797 splice_write: called by the VFS to splice data from a pipe to a file. This
798 method is used by the splice(2) system call 798 method is used by the splice(2) system call
799 799
800 splice_read: called by the VFS to splice data from file to a pipe. This 800 splice_read: called by the VFS to splice data from file to a pipe. This
801 method is used by the splice(2) system call 801 method is used by the splice(2) system call
802 802
803 Note that the file operations are implemented by the specific 803 Note that the file operations are implemented by the specific
804 filesystem in which the inode resides. When opening a device node 804 filesystem in which the inode resides. When opening a device node
805 (character or block special) most filesystems will call special 805 (character or block special) most filesystems will call special
806 support routines in the VFS which will locate the required device 806 support routines in the VFS which will locate the required device
807 driver information. These support routines replace the filesystem file 807 driver information. These support routines replace the filesystem file
808 operations with those for the device driver, and then proceed to call 808 operations with those for the device driver, and then proceed to call
809 the new open() method for the file. This is how opening a device file 809 the new open() method for the file. This is how opening a device file
810 in the filesystem eventually ends up calling the device driver open() 810 in the filesystem eventually ends up calling the device driver open()
811 method. 811 method.
812 812
813 813
814 Directory Entry Cache (dcache) 814 Directory Entry Cache (dcache)
815 ============================== 815 ==============================
816 816
817 817
818 struct dentry_operations 818 struct dentry_operations
819 ------------------------ 819 ------------------------
820 820
821 This describes how a filesystem can overload the standard dentry 821 This describes how a filesystem can overload the standard dentry
822 operations. Dentries and the dcache are the domain of the VFS and the 822 operations. Dentries and the dcache are the domain of the VFS and the
823 individual filesystem implementations. Device drivers have no business 823 individual filesystem implementations. Device drivers have no business
824 here. These methods may be set to NULL, as they are either optional or 824 here. These methods may be set to NULL, as they are either optional or
825 the VFS uses a default. As of kernel 2.6.13, the following members are 825 the VFS uses a default. As of kernel 2.6.13, the following members are
826 defined: 826 defined:
827 827
828 struct dentry_operations { 828 struct dentry_operations {
829 int (*d_revalidate)(struct dentry *, struct nameidata *); 829 int (*d_revalidate)(struct dentry *, struct nameidata *);
830 int (*d_hash) (struct dentry *, struct qstr *); 830 int (*d_hash) (struct dentry *, struct qstr *);
831 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); 831 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
832 int (*d_delete)(struct dentry *); 832 int (*d_delete)(struct dentry *);
833 void (*d_release)(struct dentry *); 833 void (*d_release)(struct dentry *);
834 void (*d_iput)(struct dentry *, struct inode *); 834 void (*d_iput)(struct dentry *, struct inode *);
835 }; 835 };
836 836
837 d_revalidate: called when the VFS needs to revalidate a dentry. This 837 d_revalidate: called when the VFS needs to revalidate a dentry. This
838 is called whenever a name look-up finds a dentry in the 838 is called whenever a name look-up finds a dentry in the
839 dcache. Most filesystems leave this as NULL, because all their 839 dcache. Most filesystems leave this as NULL, because all their
840 dentries in the dcache are valid 840 dentries in the dcache are valid
841 841
842 d_hash: called when the VFS adds a dentry to the hash table 842 d_hash: called when the VFS adds a dentry to the hash table
843 843
844 d_compare: called when a dentry should be compared with another 844 d_compare: called when a dentry should be compared with another
845 845
846 d_delete: called when the last reference to a dentry is 846 d_delete: called when the last reference to a dentry is
847 deleted. This means no-one is using the dentry, however it is 847 deleted. This means no-one is using the dentry, however it is
848 still valid and in the dcache 848 still valid and in the dcache
849 849
850 d_release: called when a dentry is really deallocated 850 d_release: called when a dentry is really deallocated
851 851
852 d_iput: called when a dentry loses its inode (just prior to its 852 d_iput: called when a dentry loses its inode (just prior to its
853 being deallocated). The default when this is NULL is that the 853 being deallocated). The default when this is NULL is that the
854 VFS calls iput(). If you define this method, you must call 854 VFS calls iput(). If you define this method, you must call
855 iput() yourself 855 iput() yourself
856 856
857 Each dentry has a pointer to its parent dentry, as well as a hash list 857 Each dentry has a pointer to its parent dentry, as well as a hash list
858 of child dentries. Child dentries are basically like files in a 858 of child dentries. Child dentries are basically like files in a
859 directory. 859 directory.
860 860
861 861
862 Directory Entry Cache API 862 Directory Entry Cache API
863 -------------------------- 863 --------------------------
864 864
865 There are a number of functions defined which permit a filesystem to 865 There are a number of functions defined which permit a filesystem to
866 manipulate dentries: 866 manipulate dentries:
867 867
868 dget: open a new handle for an existing dentry (this just increments 868 dget: open a new handle for an existing dentry (this just increments
869 the usage count) 869 the usage count)
870 870
871 dput: close a handle for a dentry (decrements the usage count). If 871 dput: close a handle for a dentry (decrements the usage count). If
872 the usage count drops to 0, the "d_delete" method is called 872 the usage count drops to 0, the "d_delete" method is called
873 and the dentry is placed on the unused list if the dentry is 873 and the dentry is placed on the unused list if the dentry is
874 still in its parents hash list. Putting the dentry on the 874 still in its parents hash list. Putting the dentry on the
875 unused list just means that if the system needs some RAM, it 875 unused list just means that if the system needs some RAM, it
876 goes through the unused list of dentries and deallocates them. 876 goes through the unused list of dentries and deallocates them.
877 If the dentry has already been unhashed and the usage count 877 If the dentry has already been unhashed and the usage count
878 drops to 0, in this case the dentry is deallocated after the 878 drops to 0, in this case the dentry is deallocated after the
879 "d_delete" method is called 879 "d_delete" method is called
880 880
881 d_drop: this unhashes a dentry from its parents hash list. A 881 d_drop: this unhashes a dentry from its parents hash list. A
882 subsequent call to dput() will deallocate the dentry if its 882 subsequent call to dput() will deallocate the dentry if its
883 usage count drops to 0 883 usage count drops to 0
884 884
885 d_delete: delete a dentry. If there are no other open references to 885 d_delete: delete a dentry. If there are no other open references to
886 the dentry then the dentry is turned into a negative dentry 886 the dentry then the dentry is turned into a negative dentry
887 (the d_iput() method is called). If there are other 887 (the d_iput() method is called). If there are other
888 references, then d_drop() is called instead 888 references, then d_drop() is called instead
889 889
890 d_add: add a dentry to its parents hash list and then calls 890 d_add: add a dentry to its parents hash list and then calls
891 d_instantiate() 891 d_instantiate()
892 892
893 d_instantiate: add a dentry to the alias hash list for the inode and 893 d_instantiate: add a dentry to the alias hash list for the inode and
894 updates the "d_inode" member. The "i_count" member in the 894 updates the "d_inode" member. The "i_count" member in the
895 inode structure should be set/incremented. If the inode 895 inode structure should be set/incremented. If the inode
896 pointer is NULL, the dentry is called a "negative 896 pointer is NULL, the dentry is called a "negative
897 dentry". This function is commonly called when an inode is 897 dentry". This function is commonly called when an inode is
898 created for an existing negative dentry 898 created for an existing negative dentry
899 899
900 d_lookup: look up a dentry given its parent and path name component 900 d_lookup: look up a dentry given its parent and path name component
901 It looks up the child of that given name from the dcache 901 It looks up the child of that given name from the dcache
902 hash table. If it is found, the reference count is incremented 902 hash table. If it is found, the reference count is incremented
903 and the dentry is returned. The caller must use d_put() 903 and the dentry is returned. The caller must use d_put()
904 to free the dentry when it finishes using it. 904 to free the dentry when it finishes using it.
905 905
906 For further information on dentry locking, please refer to the document 906 For further information on dentry locking, please refer to the document
907 Documentation/filesystems/dentry-locking.txt. 907 Documentation/filesystems/dentry-locking.txt.
908 908
909 909
910 Resources 910 Resources
911 ========= 911 =========
912 912
913 (Note some of these resources are not up-to-date with the latest kernel 913 (Note some of these resources are not up-to-date with the latest kernel
914 version.) 914 version.)
915 915
916 Creating Linux virtual filesystems. 2002 916 Creating Linux virtual filesystems. 2002
917 <http://lwn.net/Articles/13325/> 917 <http://lwn.net/Articles/13325/>
918 918
919 The Linux Virtual File-system Layer by Neil Brown. 1999 919 The Linux Virtual File-system Layer by Neil Brown. 1999
920 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> 920 <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html>
921 921
922 A tour of the Linux VFS by Michael K. Johnson. 1996 922 A tour of the Linux VFS by Michael K. Johnson. 1996
923 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> 923 <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html>
924 924
925 A small trail through the Linux kernel by Andries Brouwer. 2001 925 A small trail through the Linux kernel by Andries Brouwer. 2001
926 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> 926 <http://www.win.tue.nl/~aeb/linux/vfs/trail.html>
927 927
Documentation/fujitsu/frv/mmu-layout.txt
1 ================================= 1 =================================
2 FR451 MMU LINUX MEMORY MANAGEMENT 2 FR451 MMU LINUX MEMORY MANAGEMENT
3 ================================= 3 =================================
4 4
5 ============ 5 ============
6 MMU HARDWARE 6 MMU HARDWARE
7 ============ 7 ============
8 8
9 FR451 MMU Linux puts the MMU into EDAT mode whilst running. This means that it uses both the SAT 9 FR451 MMU Linux puts the MMU into EDAT mode whilst running. This means that it uses both the SAT
10 registers and the DAT TLB to perform address translation. 10 registers and the DAT TLB to perform address translation.
11 11
12 There are 8 IAMLR/IAMPR register pairs and 16 DAMLR/DAMPR register pairs for SAT mode. 12 There are 8 IAMLR/IAMPR register pairs and 16 DAMLR/DAMPR register pairs for SAT mode.
13 13
14 In DAT mode, there is also a TLB organised in cache format as 64 lines x 2 ways. Each line spans a 14 In DAT mode, there is also a TLB organised in cache format as 64 lines x 2 ways. Each line spans a
15 16KB range of addresses, but can match a larger region. 15 16KB range of addresses, but can match a larger region.
16 16
17 17
18 =========================== 18 ===========================
19 MEMORY MANAGEMENT REGISTERS 19 MEMORY MANAGEMENT REGISTERS
20 =========================== 20 ===========================
21 21
22 Certain control registers are used by the kernel memory management routines: 22 Certain control registers are used by the kernel memory management routines:
23 23
24 REGISTERS USAGE 24 REGISTERS USAGE
25 ====================== ================================================== 25 ====================== ==================================================
26 IAMR0, DAMR0 Kernel image and data mappings 26 IAMR0, DAMR0 Kernel image and data mappings
27 IAMR1, DAMR1 First-chance TLB lookup mapping 27 IAMR1, DAMR1 First-chance TLB lookup mapping
28 DAMR2 Page attachment for cache flush by page 28 DAMR2 Page attachment for cache flush by page
29 DAMR3 Current PGD mapping 29 DAMR3 Current PGD mapping
30 SCR0, DAMR4 Instruction TLB PGE/PTD cache 30 SCR0, DAMR4 Instruction TLB PGE/PTD cache
31 SCR1, DAMR5 Data TLB PGE/PTD cache 31 SCR1, DAMR5 Data TLB PGE/PTD cache
32 DAMR6-10 kmap_atomic() mappings 32 DAMR6-10 kmap_atomic() mappings
33 DAMR11 I/O mapping 33 DAMR11 I/O mapping
34 CXNR mm_struct context ID 34 CXNR mm_struct context ID
35 TTBR Page directory (PGD) pointer (physical address) 35 TTBR Page directory (PGD) pointer (physical address)
36 36
37 37
38 ===================== 38 =====================
39 GENERAL MEMORY LAYOUT 39 GENERAL MEMORY LAYOUT
40 ===================== 40 =====================
41 41
42 The physical memory layout is as follows: 42 The physical memory layout is as follows:
43 43
44 PHYSICAL ADDRESS CONTROLLER DEVICE 44 PHYSICAL ADDRESS CONTROLLER DEVICE
45 =================== ============== ======================================= 45 =================== ============== =======================================
46 00000000 - BFFFFFFF SDRAM SDRAM area 46 00000000 - BFFFFFFF SDRAM SDRAM area
47 E0000000 - EFFFFFFF L-BUS CS2# VDK SLBUS/PCI window 47 E0000000 - EFFFFFFF L-BUS CS2# VDK SLBUS/PCI window
48 F0000000 - F0FFFFFF L-BUS CS5# MB93493 CSC area (DAV daughter board) 48 F0000000 - F0FFFFFF L-BUS CS5# MB93493 CSC area (DAV daughter board)
49 F1000000 - F1FFFFFF L-BUS CS7# (CB70 CPU-card PCMCIA port I/O space) 49 F1000000 - F1FFFFFF L-BUS CS7# (CB70 CPU-card PCMCIA port I/O space)
50 FC000000 - FC0FFFFF L-BUS CS1# VDK MB86943 config space 50 FC000000 - FC0FFFFF L-BUS CS1# VDK MB86943 config space
51 FC100000 - FC1FFFFF L-BUS CS6# DM9000 NIC I/O space 51 FC100000 - FC1FFFFF L-BUS CS6# DM9000 NIC I/O space
52 FC200000 - FC2FFFFF L-BUS CS3# MB93493 CSR area (DAV daughter board) 52 FC200000 - FC2FFFFF L-BUS CS3# MB93493 CSR area (DAV daughter board)
53 FD000000 - FDFFFFFF L-BUS CS4# (CB70 CPU-card extra flash space) 53 FD000000 - FDFFFFFF L-BUS CS4# (CB70 CPU-card extra flash space)
54 FE000000 - FEFFFFFF Internal CPU peripherals 54 FE000000 - FEFFFFFF Internal CPU peripherals
55 FF000000 - FF1FFFFF L-BUS CS0# Flash 1 55 FF000000 - FF1FFFFF L-BUS CS0# Flash 1
56 FF200000 - FF3FFFFF L-BUS CS0# Flash 2 56 FF200000 - FF3FFFFF L-BUS CS0# Flash 2
57 FFC00000 - FFC0001F L-BUS CS0# FPGA 57 FFC00000 - FFC0001F L-BUS CS0# FPGA
58 58
59 The virtual memory layout is: 59 The virtual memory layout is:
60 60
61 VIRTUAL ADDRESS PHYSICAL TRANSLATOR FLAGS SIZE OCCUPATION 61 VIRTUAL ADDRESS PHYSICAL TRANSLATOR FLAGS SIZE OCCUPATION
62 ================= ======== ============== ======= ======= =================================== 62 ================= ======== ============== ======= ======= ===================================
63 00004000-BFFFFFFF various TLB,xAMR1 D-N-??V 3GB Userspace 63 00004000-BFFFFFFF various TLB,xAMR1 D-N-??V 3GB Userspace
64 C0000000-CFFFFFFF 00000000 xAMPR0 -L-S--V 256MB Kernel image and data 64 C0000000-CFFFFFFF 00000000 xAMPR0 -L-S--V 256MB Kernel image and data
65 D0000000-D7FFFFFF various TLB,xAMR1 D-NS??V 128MB vmalloc area 65 D0000000-D7FFFFFF various TLB,xAMR1 D-NS??V 128MB vmalloc area
66 D8000000-DBFFFFFF various TLB,xAMR1 D-NS??V 64MB kmap() area 66 D8000000-DBFFFFFF various TLB,xAMR1 D-NS??V 64MB kmap() area
67 DC000000-DCFFFFFF various TLB 1MB Secondary kmap_atomic() frame 67 DC000000-DCFFFFFF various TLB 1MB Secondary kmap_atomic() frame
68 DD000000-DD27FFFF various DAMR 160KB Primary kmap_atomic() frame 68 DD000000-DD27FFFF various DAMR 160KB Primary kmap_atomic() frame
69 DD040000 DAMR2/IAMR2 -L-S--V page Page cache flush attachment point 69 DD040000 DAMR2/IAMR2 -L-S--V page Page cache flush attachment point
70 DD080000 DAMR3 -L-SC-V page Page Directory (PGD) 70 DD080000 DAMR3 -L-SC-V page Page Directory (PGD)
71 DD0C0000 DAMR4 -L-SC-V page Cached insn TLB Page Table lookup 71 DD0C0000 DAMR4 -L-SC-V page Cached insn TLB Page Table lookup
72 DD100000 DAMR5 -L-SC-V page Cached data TLB Page Table lookup 72 DD100000 DAMR5 -L-SC-V page Cached data TLB Page Table lookup
73 DD140000 DAMR6 -L-S--V page kmap_atomic(KM_BOUNCE_READ) 73 DD140000 DAMR6 -L-S--V page kmap_atomic(KM_BOUNCE_READ)
74 DD180000 DAMR7 -L-S--V page kmap_atomic(KM_SKB_SUNRPC_DATA) 74 DD180000 DAMR7 -L-S--V page kmap_atomic(KM_SKB_SUNRPC_DATA)
75 DD1C0000 DAMR8 -L-S--V page kmap_atomic(KM_SKB_DATA_SOFTIRQ) 75 DD1C0000 DAMR8 -L-S--V page kmap_atomic(KM_SKB_DATA_SOFTIRQ)
76 DD200000 DAMR9 -L-S--V page kmap_atomic(KM_USER0) 76 DD200000 DAMR9 -L-S--V page kmap_atomic(KM_USER0)
77 DD240000 DAMR10 -L-S--V page kmap_atomic(KM_USER1) 77 DD240000 DAMR10 -L-S--V page kmap_atomic(KM_USER1)
78 E0000000-FFFFFFFF E0000000 DAMR11 -L-SC-V 512MB I/O region 78 E0000000-FFFFFFFF E0000000 DAMR11 -L-SC-V 512MB I/O region
79 79
80 IAMPR1 and DAMPR1 are used as an extension to the TLB. 80 IAMPR1 and DAMPR1 are used as an extension to the TLB.
81 81
82 82
83 ==================== 83 ====================
84 KMAP AND KMAP_ATOMIC 84 KMAP AND KMAP_ATOMIC
85 ==================== 85 ====================
86 86
87 To access pages in the page cache (which may not be directly accessible if highmem is available), 87 To access pages in the page cache (which may not be directly accessible if highmem is available),
88 the kernel calls kmap(), does the access and then calls kunmap(); or it calls kmap_atomic(), does 88 the kernel calls kmap(), does the access and then calls kunmap(); or it calls kmap_atomic(), does
89 the access and then calls kunmap_atomic(). 89 the access and then calls kunmap_atomic().
90 90
91 kmap() creates an attachment between an arbitrary inaccessible page and a range of virtual 91 kmap() creates an attachment between an arbitrary inaccessible page and a range of virtual
92 addresses by installing a PTE in a special page table. The kernel can then access this page as it 92 addresses by installing a PTE in a special page table. The kernel can then access this page as it
93 wills. When it's finished, the kernel calls kunmap() to clear the PTE. 93 wills. When it's finished, the kernel calls kunmap() to clear the PTE.
94 94
95 kmap_atomic() does something slightly different. In the interests of speed, it chooses one of two 95 kmap_atomic() does something slightly different. In the interests of speed, it chooses one of two
96 strategies: 96 strategies:
97 97
98 (1) If possible, kmap_atomic() attaches the requested page to one of DAMPR5 through DAMPR10 98 (1) If possible, kmap_atomic() attaches the requested page to one of DAMPR5 through DAMPR10
99 register pairs; and the matching kunmap_atomic() clears the DAMPR. This makes high memory 99 register pairs; and the matching kunmap_atomic() clears the DAMPR. This makes high memory
100 support really fast as there's no need to flush the TLB or modify the page tables. The DAMLR 100 support really fast as there's no need to flush the TLB or modify the page tables. The DAMLR
101 registers being used for this are preset during boot and don't change over the lifetime of the 101 registers being used for this are preset during boot and don't change over the lifetime of the
102 process. There's a direct mapping between the first few kmap_atomic() types, DAMR number and 102 process. There's a direct mapping between the first few kmap_atomic() types, DAMR number and
103 virtual address slot. 103 virtual address slot.
104 104
105 However, there are more kmap_atomic() types defined than there are DAMR registers available, 105 However, there are more kmap_atomic() types defined than there are DAMR registers available,
106 so we fall back to: 106 so we fall back to:
107 107
108 (2) kmap_atomic() uses a slot in the secondary frame (determined by the type parameter), and then 108 (2) kmap_atomic() uses a slot in the secondary frame (determined by the type parameter), and then
109 locks an entry in the TLB to translate that slot to the specified page. The number of slots is 109 locks an entry in the TLB to translate that slot to the specified page. The number of slots is
110 obviously limited, and their positions are controlled such that each slot is matched by a 110 obviously limited, and their positions are controlled such that each slot is matched by a
111 different line in the TLB. kunmap() ejects the entry from the TLB. 111 different line in the TLB. kunmap() ejects the entry from the TLB.
112 112
113 Note that the first three kmap atomic types are really just declared as placeholders. The DAMPR 113 Note that the first three kmap atomic types are really just declared as placeholders. The DAMPR
114 registers involved are actually modified directly. 114 registers involved are actually modified directly.
115 115
116 Also note that kmap() itself may sleep, kmap_atomic() may never sleep and both always succeed; 116 Also note that kmap() itself may sleep, kmap_atomic() may never sleep and both always succeed;
117 furthermore, a driver using kmap() may sleep before calling kunmap(), but may not sleep before 117 furthermore, a driver using kmap() may sleep before calling kunmap(), but may not sleep before
118 calling kunmap_atomic() if it had previously called kmap_atomic(). 118 calling kunmap_atomic() if it had previously called kmap_atomic().
119 119
120 120
121 =============================== 121 ===============================
122 USING MORE THAN 256MB OF MEMORY 122 USING MORE THAN 256MB OF MEMORY
123 =============================== 123 ===============================
124 124
125 The kernel cannot access more than 256MB of memory directly. The physical layout, however, permits 125 The kernel cannot access more than 256MB of memory directly. The physical layout, however, permits
126 up to 3GB of SDRAM (possibly 3.25GB) to be made available. By using CONFIG_HIGHMEM, the kernel can 126 up to 3GB of SDRAM (possibly 3.25GB) to be made available. By using CONFIG_HIGHMEM, the kernel can
127 allow userspace (by way of page tables) and itself (by way of kmap) to deal with the memory 127 allow userspace (by way of page tables) and itself (by way of kmap) to deal with the memory
128 allocation. 128 allocation.
129 129
130 External devices can, of course, still DMA to and from all of the SDRAM, even if the kernel can't 130 External devices can, of course, still DMA to and from all of the SDRAM, even if the kernel can't
131 see it directly. The kernel translates page references into real addresses for communicating to the 131 see it directly. The kernel translates page references into real addresses for communicating to the
132 devices. 132 devices.
133 133
134 134
135 =================== 135 ===================
136 PAGE TABLE TOPOLOGY 136 PAGE TABLE TOPOLOGY
137 =================== 137 ===================
138 138
139 The page tables are arranged in 2-layer format. There is a middle layer (PMD) that would be used in 139 The page tables are arranged in 2-layer format. There is a middle layer (PMD) that would be used in
140 3-layer format tables but that is folded into the top layer (PGD) and so consumes no extra memory 140 3-layer format tables but that is folded into the top layer (PGD) and so consumes no extra memory
141 or processing power. 141 or processing power.
142 142
143 +------+ PGD PMD 143 +------+ PGD PMD
144 | TTBR |--->+-------------------+ 144 | TTBR |--->+-------------------+
145 +------+ | | : STE | 145 +------+ | | : STE |
146 | PGE0 | PME0 : STE | 146 | PGE0 | PME0 : STE |
147 | | : STE | 147 | | : STE |
148 +-------------------+ Page Table 148 +-------------------+ Page Table
149 | | : STE -------------->+--------+ +0x0000 149 | | : STE -------------->+--------+ +0x0000
150 | PGE1 | PME0 : STE -----------+ | PTE0 | 150 | PGE1 | PME0 : STE -----------+ | PTE0 |
151 | | : STE -------+ | +--------+ 151 | | : STE -------+ | +--------+
152 +-------------------+ | | | PTE63 | 152 +-------------------+ | | | PTE63 |
153 | | : STE | | +-->+--------+ +0x0100 153 | | : STE | | +-->+--------+ +0x0100
154 | PGE2 | PME0 : STE | | | PTE64 | 154 | PGE2 | PME0 : STE | | | PTE64 |
155 | | : STE | | +--------+ 155 | | : STE | | +--------+
156 +-------------------+ | | PTE127 | 156 +-------------------+ | | PTE127 |
157 | | : STE | +------>+--------+ +0x0200 157 | | : STE | +------>+--------+ +0x0200
158 | PGE3 | PME0 : STE | | PTE128 | 158 | PGE3 | PME0 : STE | | PTE128 |
159 | | : STE | +--------+ 159 | | : STE | +--------+
160 +-------------------+ | PTE191 | 160 +-------------------+ | PTE191 |
161 +--------+ +0x0300 161 +--------+ +0x0300
162 162
163 Each Page Directory (PGD) is 16KB (page size) in size and is divided into 64 entries (PGEs). Each 163 Each Page Directory (PGD) is 16KB (page size) in size and is divided into 64 entries (PGEs). Each
164 PGE contains one Page Mid Directory (PMD). 164 PGE contains one Page Mid Directory (PMD).
165 165
166 Each PMD is 256 bytes in size and contains a single entry (PME). Each PME holds 64 FR451 MMU 166 Each PMD is 256 bytes in size and contains a single entry (PME). Each PME holds 64 FR451 MMU
167 segment table entries of 4 bytes apiece. Each PME "points to" a page table. In practice, each STE 167 segment table entries of 4 bytes apiece. Each PME "points to" a page table. In practice, each STE
168 points to a subset of the page table, the first to PT+0x0000, the second to PT+0x0100, the third to 168 points to a subset of the page table, the first to PT+0x0000, the second to PT+0x0100, the third to
169 PT+0x200, and so on. 169 PT+0x200, and so on.
170 170
171 Each PGE and PME covers 64MB of the total virtual address space. 171 Each PGE and PME covers 64MB of the total virtual address space.
172 172
173 Each Page Table (PTD) is 16KB (page size) in size, and is divided into 4096 entries (PTEs). Each 173 Each Page Table (PTD) is 16KB (page size) in size, and is divided into 4096 entries (PTEs). Each
174 entry can point to one 16KB page. In practice, each Linux page table is subdivided into 64 FR451 174 entry can point to one 16KB page. In practice, each Linux page table is subdivided into 64 FR451
175 MMU page tables. But they are all grouped together to make management easier, in particular rmap 175 MMU page tables. But they are all grouped together to make management easier, in particular rmap
176 support is then trivial. 176 support is then trivial.
177 177
178 Grouping page tables in this fashion makes PGE caching in SCR0/SCR1 more efficient because the 178 Grouping page tables in this fashion makes PGE caching in SCR0/SCR1 more efficient because the
179 coverage of the cached item is greater. 179 coverage of the cached item is greater.
180 180
181 Page tables for the vmalloc area are allocated at boot time and shared between all mm_structs. 181 Page tables for the vmalloc area are allocated at boot time and shared between all mm_structs.
182 182
183 183
184 ================= 184 =================
185 USER SPACE LAYOUT 185 USER SPACE LAYOUT
186 ================= 186 =================
187 187
188 For MMU capable Linux, the regions userspace code are allowed to access are kept entirely separate 188 For MMU capable Linux, the regions userspace code are allowed to access are kept entirely separate
189 from those dedicated to the kernel: 189 from those dedicated to the kernel:
190 190
191 VIRTUAL ADDRESS SIZE PURPOSE 191 VIRTUAL ADDRESS SIZE PURPOSE
192 ================= ===== =================================== 192 ================= ===== ===================================
193 00000000-00003fff 4KB NULL pointer access trap 193 00000000-00003fff 4KB NULL pointer access trap
194 00004000-01ffffff ~32MB lower mmap space (grows up) 194 00004000-01ffffff ~32MB lower mmap space (grows up)
195 02000000-021fffff 2MB Stack space (grows down from top) 195 02000000-021fffff 2MB Stack space (grows down from top)
196 02200000-nnnnnnnn Executable mapping 196 02200000-nnnnnnnn Executable mapping
197 nnnnnnnn- brk space (grows up) 197 nnnnnnnn- brk space (grows up)
198 -bfffffff upper mmap space (grows down) 198 -bfffffff upper mmap space (grows down)
199 199
200 This is so arranged so as to make best use of the 16KB page tables and the way in which PGEs/PMEs 200 This is so arranged so as to make best use of the 16KB page tables and the way in which PGEs/PMEs
201 are cached by the TLB handler. The lower mmap space is filled first, and then the upper mmap space 201 are cached by the TLB handler. The lower mmap space is filled first, and then the upper mmap space
202 is filled. 202 is filled.
203 203
204 204
205 =============================== 205 ===============================
206 GDB-STUB MMU DEBUGGING SERVICES 206 GDB-STUB MMU DEBUGGING SERVICES
207 =============================== 207 ===============================
208 208
209 The gdb-stub included in this kernel provides a number of services to aid in the debugging of MMU 209 The gdb-stub included in this kernel provides a number of services to aid in the debugging of MMU
210 related kernel services: 210 related kernel services:
211 211
212 (*) Every time the kernel stops, certain state information is dumped into __debug_mmu. This 212 (*) Every time the kernel stops, certain state information is dumped into __debug_mmu. This
213 variable is defined in arch/frv/kernel/gdb-stub.c. Note that the gdbinit file in this 213 variable is defined in arch/frv/kernel/gdb-stub.c. Note that the gdbinit file in this
214 directory has some useful macros for dealing with this. 214 directory has some useful macros for dealing with this.
215 215
216 (*) __debug_mmu.tlb[] 216 (*) __debug_mmu.tlb[]
217 217
218 This receives the current TLB contents. This can be viewed with the _tlb GDB macro: 218 This receives the current TLB contents. This can be viewed with the _tlb GDB macro:
219 219
220 (gdb) _tlb 220 (gdb) _tlb
221 tlb[0x00]: 01000005 00718203 01000002 00718203 221 tlb[0x00]: 01000005 00718203 01000002 00718203
222 tlb[0x01]: 01004002 006d4201 01004005 006d4203 222 tlb[0x01]: 01004002 006d4201 01004005 006d4203
223 tlb[0x02]: 01008002 006d0201 01008006 00004200 223 tlb[0x02]: 01008002 006d0201 01008006 00004200
224 tlb[0x03]: 0100c006 007f4202 0100c002 0064c202 224 tlb[0x03]: 0100c006 007f4202 0100c002 0064c202
225 tlb[0x04]: 01110005 00774201 01110002 00774201 225 tlb[0x04]: 01110005 00774201 01110002 00774201
226 tlb[0x05]: 01114005 00770201 01114002 00770201 226 tlb[0x05]: 01114005 00770201 01114002 00770201
227 tlb[0x06]: 01118002 0076c201 01118005 0076c201 227 tlb[0x06]: 01118002 0076c201 01118005 0076c201
228 ... 228 ...
229 tlb[0x3d]: 010f4002 00790200 001f4002 0054ca02 229 tlb[0x3d]: 010f4002 00790200 001f4002 0054ca02
230 tlb[0x3e]: 010f8005 0078c201 010f8002 0078c201 230 tlb[0x3e]: 010f8005 0078c201 010f8002 0078c201
231 tlb[0x3f]: 001fc002 0056ca01 001fc005 00538a01 231 tlb[0x3f]: 001fc002 0056ca01 001fc005 00538a01
232 232
233 (*) __debug_mmu.iamr[] 233 (*) __debug_mmu.iamr[]
234 (*) __debug_mmu.damr[] 234 (*) __debug_mmu.damr[]
235 235
236 These receive the current IAMR and DAMR contents. These can be viewed with with the _amr 236 These receive the current IAMR and DAMR contents. These can be viewed with the _amr
237 GDB macro: 237 GDB macro:
238 238
239 (gdb) _amr 239 (gdb) _amr
240 AMRx DAMR IAMR 240 AMRx DAMR IAMR
241 ==== ===================== ===================== 241 ==== ===================== =====================
242 amr0 : L:c0000000 P:00000cb9 : L:c0000000 P:000004b9 242 amr0 : L:c0000000 P:00000cb9 : L:c0000000 P:000004b9
243 amr1 : L:01070005 P:006f9203 : L:0102c005 P:006a1201 243 amr1 : L:01070005 P:006f9203 : L:0102c005 P:006a1201
244 amr2 : L:d8d00000 P:00000000 : L:d8d00000 P:00000000 244 amr2 : L:d8d00000 P:00000000 : L:d8d00000 P:00000000
245 amr3 : L:d8d04000 P:00534c0d : L:00000000 P:00000000 245 amr3 : L:d8d04000 P:00534c0d : L:00000000 P:00000000
246 amr4 : L:d8d08000 P:00554c0d : L:00000000 P:00000000 246 amr4 : L:d8d08000 P:00554c0d : L:00000000 P:00000000
247 amr5 : L:d8d0c000 P:00554c0d : L:00000000 P:00000000 247 amr5 : L:d8d0c000 P:00554c0d : L:00000000 P:00000000
248 amr6 : L:d8d10000 P:00000000 : L:00000000 P:00000000 248 amr6 : L:d8d10000 P:00000000 : L:00000000 P:00000000
249 amr7 : L:d8d14000 P:00000000 : L:00000000 P:00000000 249 amr7 : L:d8d14000 P:00000000 : L:00000000 P:00000000
250 amr8 : L:d8d18000 P:00000000 250 amr8 : L:d8d18000 P:00000000
251 amr9 : L:d8d1c000 P:00000000 251 amr9 : L:d8d1c000 P:00000000
252 amr10: L:d8d20000 P:00000000 252 amr10: L:d8d20000 P:00000000
253 amr11: L:e0000000 P:e0000ccd 253 amr11: L:e0000000 P:e0000ccd
254 254
255 (*) The current task's page directory is bound to DAMR3. 255 (*) The current task's page directory is bound to DAMR3.
256 256
257 This can be viewed with the _pgd GDB macro: 257 This can be viewed with the _pgd GDB macro:
258 258
259 (gdb) _pgd 259 (gdb) _pgd
260 $3 = {{pge = {{ste = {0x554001, 0x554101, 0x554201, 0x554301, 0x554401, 260 $3 = {{pge = {{ste = {0x554001, 0x554101, 0x554201, 0x554301, 0x554401,
261 0x554501, 0x554601, 0x554701, 0x554801, 0x554901, 0x554a01, 261 0x554501, 0x554601, 0x554701, 0x554801, 0x554901, 0x554a01,
262 0x554b01, 0x554c01, 0x554d01, 0x554e01, 0x554f01, 0x555001, 262 0x554b01, 0x554c01, 0x554d01, 0x554e01, 0x554f01, 0x555001,
263 0x555101, 0x555201, 0x555301, 0x555401, 0x555501, 0x555601, 263 0x555101, 0x555201, 0x555301, 0x555401, 0x555501, 0x555601,
264 0x555701, 0x555801, 0x555901, 0x555a01, 0x555b01, 0x555c01, 264 0x555701, 0x555801, 0x555901, 0x555a01, 0x555b01, 0x555c01,
265 0x555d01, 0x555e01, 0x555f01, 0x556001, 0x556101, 0x556201, 265 0x555d01, 0x555e01, 0x555f01, 0x556001, 0x556101, 0x556201,
266 0x556301, 0x556401, 0x556501, 0x556601, 0x556701, 0x556801, 266 0x556301, 0x556401, 0x556501, 0x556601, 0x556701, 0x556801,
267 0x556901, 0x556a01, 0x556b01, 0x556c01, 0x556d01, 0x556e01, 267 0x556901, 0x556a01, 0x556b01, 0x556c01, 0x556d01, 0x556e01,
268 0x556f01, 0x557001, 0x557101, 0x557201, 0x557301, 0x557401, 268 0x556f01, 0x557001, 0x557101, 0x557201, 0x557301, 0x557401,
269 0x557501, 0x557601, 0x557701, 0x557801, 0x557901, 0x557a01, 269 0x557501, 0x557601, 0x557701, 0x557801, 0x557901, 0x557a01,
270 0x557b01, 0x557c01, 0x557d01, 0x557e01, 0x557f01}}}}, {pge = {{ 270 0x557b01, 0x557c01, 0x557d01, 0x557e01, 0x557f01}}}}, {pge = {{
271 ste = {0x0 <repeats 64 times>}}}} <repeats 51 times>, {pge = {{ste = { 271 ste = {0x0 <repeats 64 times>}}}} <repeats 51 times>, {pge = {{ste = {
272 0x248001, 0x248101, 0x248201, 0x248301, 0x248401, 0x248501, 272 0x248001, 0x248101, 0x248201, 0x248301, 0x248401, 0x248501,
273 0x248601, 0x248701, 0x248801, 0x248901, 0x248a01, 0x248b01, 273 0x248601, 0x248701, 0x248801, 0x248901, 0x248a01, 0x248b01,
274 0x248c01, 0x248d01, 0x248e01, 0x248f01, 0x249001, 0x249101, 274 0x248c01, 0x248d01, 0x248e01, 0x248f01, 0x249001, 0x249101,
275 0x249201, 0x249301, 0x249401, 0x249501, 0x249601, 0x249701, 275 0x249201, 0x249301, 0x249401, 0x249501, 0x249601, 0x249701,
276 0x249801, 0x249901, 0x249a01, 0x249b01, 0x249c01, 0x249d01, 276 0x249801, 0x249901, 0x249a01, 0x249b01, 0x249c01, 0x249d01,
277 0x249e01, 0x249f01, 0x24a001, 0x24a101, 0x24a201, 0x24a301, 277 0x249e01, 0x249f01, 0x24a001, 0x24a101, 0x24a201, 0x24a301,
278 0x24a401, 0x24a501, 0x24a601, 0x24a701, 0x24a801, 0x24a901, 278 0x24a401, 0x24a501, 0x24a601, 0x24a701, 0x24a801, 0x24a901,
279 0x24aa01, 0x24ab01, 0x24ac01, 0x24ad01, 0x24ae01, 0x24af01, 279 0x24aa01, 0x24ab01, 0x24ac01, 0x24ad01, 0x24ae01, 0x24af01,
280 0x24b001, 0x24b101, 0x24b201, 0x24b301, 0x24b401, 0x24b501, 280 0x24b001, 0x24b101, 0x24b201, 0x24b301, 0x24b401, 0x24b501,
281 0x24b601, 0x24b701, 0x24b801, 0x24b901, 0x24ba01, 0x24bb01, 281 0x24b601, 0x24b701, 0x24b801, 0x24b901, 0x24ba01, 0x24bb01,
282 0x24bc01, 0x24bd01, 0x24be01, 0x24bf01}}}}, {pge = {{ste = { 282 0x24bc01, 0x24bd01, 0x24be01, 0x24bf01}}}}, {pge = {{ste = {
283 0x0 <repeats 64 times>}}}} <repeats 11 times>} 283 0x0 <repeats 64 times>}}}} <repeats 11 times>}
284 284
285 (*) The PTD last used by the instruction TLB miss handler is attached to DAMR4. 285 (*) The PTD last used by the instruction TLB miss handler is attached to DAMR4.
286 (*) The PTD last used by the data TLB miss handler is attached to DAMR5. 286 (*) The PTD last used by the data TLB miss handler is attached to DAMR5.
287 287
288 These can be viewed with the _ptd_i and _ptd_d GDB macros: 288 These can be viewed with the _ptd_i and _ptd_d GDB macros:
289 289
290 (gdb) _ptd_d 290 (gdb) _ptd_d
291 $5 = {{pte = 0x0} <repeats 127 times>, {pte = 0x539b01}, { 291 $5 = {{pte = 0x0} <repeats 127 times>, {pte = 0x539b01}, {
292 pte = 0x0} <repeats 896 times>, {pte = 0x719303}, {pte = 0x6d5303}, { 292 pte = 0x0} <repeats 896 times>, {pte = 0x719303}, {pte = 0x6d5303}, {
293 pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, { 293 pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {
294 pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x6a1303}, { 294 pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x6a1303}, {
295 pte = 0x0} <repeats 12 times>, {pte = 0x709303}, {pte = 0x0}, {pte = 0x0}, 295 pte = 0x0} <repeats 12 times>, {pte = 0x709303}, {pte = 0x0}, {pte = 0x0},
296 {pte = 0x6fd303}, {pte = 0x6f9303}, {pte = 0x6f5303}, {pte = 0x0}, { 296 {pte = 0x6fd303}, {pte = 0x6f9303}, {pte = 0x6f5303}, {pte = 0x0}, {
297 pte = 0x6ed303}, {pte = 0x531b01}, {pte = 0x50db01}, { 297 pte = 0x6ed303}, {pte = 0x531b01}, {pte = 0x50db01}, {
298 pte = 0x0} <repeats 13 times>, {pte = 0x5303}, {pte = 0x7f5303}, { 298 pte = 0x0} <repeats 13 times>, {pte = 0x5303}, {pte = 0x7f5303}, {
299 pte = 0x509b01}, {pte = 0x505b01}, {pte = 0x7c9303}, {pte = 0x7b9303}, { 299 pte = 0x509b01}, {pte = 0x505b01}, {pte = 0x7c9303}, {pte = 0x7b9303}, {
300 pte = 0x7b5303}, {pte = 0x7b1303}, {pte = 0x7ad303}, {pte = 0x0}, { 300 pte = 0x7b5303}, {pte = 0x7b1303}, {pte = 0x7ad303}, {pte = 0x0}, {
301 pte = 0x0}, {pte = 0x7a1303}, {pte = 0x0}, {pte = 0x795303}, {pte = 0x0}, { 301 pte = 0x0}, {pte = 0x7a1303}, {pte = 0x0}, {pte = 0x795303}, {pte = 0x0}, {
302 pte = 0x78d303}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, { 302 pte = 0x78d303}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {pte = 0x0}, {
303 pte = 0x0}, {pte = 0x775303}, {pte = 0x771303}, {pte = 0x76d303}, { 303 pte = 0x0}, {pte = 0x775303}, {pte = 0x771303}, {pte = 0x76d303}, {
304 pte = 0x0}, {pte = 0x765303}, {pte = 0x7c5303}, {pte = 0x501b01}, { 304 pte = 0x0}, {pte = 0x765303}, {pte = 0x7c5303}, {pte = 0x501b01}, {
305 pte = 0x4f1b01}, {pte = 0x4edb01}, {pte = 0x0}, {pte = 0x4f9b01}, { 305 pte = 0x4f1b01}, {pte = 0x4edb01}, {pte = 0x0}, {pte = 0x4f9b01}, {
306 pte = 0x4fdb01}, {pte = 0x0} <repeats 2992 times>} 306 pte = 0x4fdb01}, {pte = 0x0} <repeats 2992 times>}
307 307
Documentation/ia64/efirtc.txt
1 EFI Real Time Clock driver 1 EFI Real Time Clock driver
2 ------------------------------- 2 -------------------------------
3 S. Eranian <eranian@hpl.hp.com> 3 S. Eranian <eranian@hpl.hp.com>
4 March 2000 4 March 2000
5 5
6 I/ Introduction 6 I/ Introduction
7 7
8 This document describes the efirtc.c driver has provided for 8 This document describes the efirtc.c driver has provided for
9 the IA-64 platform. 9 the IA-64 platform.
10 10
11 The purpose of this driver is to supply an API for kernel and user applications 11 The purpose of this driver is to supply an API for kernel and user applications
12 to get access to the Time Service offered by EFI version 0.92. 12 to get access to the Time Service offered by EFI version 0.92.
13 13
14 EFI provides 4 calls one can make once the OS is booted: GetTime(), 14 EFI provides 4 calls one can make once the OS is booted: GetTime(),
15 SetTime(), GetWakeupTime(), SetWakeupTime() which are all supported by this 15 SetTime(), GetWakeupTime(), SetWakeupTime() which are all supported by this
16 driver. We describe those calls as well the design of the driver in the 16 driver. We describe those calls as well the design of the driver in the
17 following sections. 17 following sections.
18 18
19 II/ Design Decisions 19 II/ Design Decisions
20 20
21 The original ideas was to provide a very simple driver to get access to, 21 The original ideas was to provide a very simple driver to get access to,
22 at first, the time of day service. This is required in order to access, in a 22 at first, the time of day service. This is required in order to access, in a
23 portable way, the CMOS clock. A program like /sbin/hwclock uses such a clock 23 portable way, the CMOS clock. A program like /sbin/hwclock uses such a clock
24 to initialize the system view of the time during boot. 24 to initialize the system view of the time during boot.
25 25
26 Because we wanted to minimize the impact on existing user-level apps using 26 Because we wanted to minimize the impact on existing user-level apps using
27 the CMOS clock, we decided to expose an API that was very similar to the one 27 the CMOS clock, we decided to expose an API that was very similar to the one
28 used today with the legacy RTC driver (driver/char/rtc.c). However, because 28 used today with the legacy RTC driver (driver/char/rtc.c). However, because
29 EFI provides a simpler services, not all all ioctl() are available. Also 29 EFI provides a simpler services, not all ioctl() are available. Also
30 new ioctl()s have been introduced for things that EFI provides but not the 30 new ioctl()s have been introduced for things that EFI provides but not the
31 legacy. 31 legacy.
32 32
33 EFI uses a slightly different way of representing the time, noticeably 33 EFI uses a slightly different way of representing the time, noticeably
34 the reference date is different. Year is the using the full 4-digit format. 34 the reference date is different. Year is the using the full 4-digit format.
35 The Epoch is January 1st 1998. For backward compatibility reasons we don't 35 The Epoch is January 1st 1998. For backward compatibility reasons we don't
36 expose this new way of representing time. Instead we use something very 36 expose this new way of representing time. Instead we use something very
37 similar to the struct tm, i.e. struct rtc_time, as used by hwclock. 37 similar to the struct tm, i.e. struct rtc_time, as used by hwclock.
38 One of the reasons for doing it this way is to allow for EFI to still evolve 38 One of the reasons for doing it this way is to allow for EFI to still evolve
39 without necessarily impacting any of the user applications. The decoupling 39 without necessarily impacting any of the user applications. The decoupling
40 enables flexibility and permits writing wrapper code is ncase things change. 40 enables flexibility and permits writing wrapper code is ncase things change.
41 41
42 The driver exposes two interfaces, one via the device file and a set of 42 The driver exposes two interfaces, one via the device file and a set of
43 ioctl()s. The other is read-only via the /proc filesystem. 43 ioctl()s. The other is read-only via the /proc filesystem.
44 44
45 As of today we don't offer a /proc/sys interface. 45 As of today we don't offer a /proc/sys interface.
46 46
47 To allow for a uniform interface between the legacy RTC and EFI time service, 47 To allow for a uniform interface between the legacy RTC and EFI time service,
48 we have created the include/linux/rtc.h header file to contain only the 48 we have created the include/linux/rtc.h header file to contain only the
49 "public" API of the two drivers. The specifics of the legacy RTC are still 49 "public" API of the two drivers. The specifics of the legacy RTC are still
50 in include/linux/mc146818rtc.h. 50 in include/linux/mc146818rtc.h.
51 51
52 52
53 III/ Time of day service 53 III/ Time of day service
54 54
55 The part of the driver gives access to the time of day service of EFI. 55 The part of the driver gives access to the time of day service of EFI.
56 Two ioctl()s, compatible with the legacy RTC calls: 56 Two ioctl()s, compatible with the legacy RTC calls:
57 57
58 Read the CMOS clock: ioctl(d, RTC_RD_TIME, &rtc); 58 Read the CMOS clock: ioctl(d, RTC_RD_TIME, &rtc);
59 59
60 Write the CMOS clock: ioctl(d, RTC_SET_TIME, &rtc); 60 Write the CMOS clock: ioctl(d, RTC_SET_TIME, &rtc);
61 61
62 The rtc is a pointer to a data structure defined in rtc.h which is close 62 The rtc is a pointer to a data structure defined in rtc.h which is close
63 to a struct tm: 63 to a struct tm:
64 64
65 struct rtc_time { 65 struct rtc_time {
66 int tm_sec; 66 int tm_sec;
67 int tm_min; 67 int tm_min;
68 int tm_hour; 68 int tm_hour;
69 int tm_mday; 69 int tm_mday;
70 int tm_mon; 70 int tm_mon;
71 int tm_year; 71 int tm_year;
72 int tm_wday; 72 int tm_wday;
73 int tm_yday; 73 int tm_yday;
74 int tm_isdst; 74 int tm_isdst;
75 }; 75 };
76 76
77 The driver takes care of converting back an forth between the EFI time and 77 The driver takes care of converting back an forth between the EFI time and
78 this format. 78 this format.
79 79
80 Those two ioctl()s can be exercised with the hwclock command: 80 Those two ioctl()s can be exercised with the hwclock command:
81 81
82 For reading: 82 For reading:
83 # /sbin/hwclock --show 83 # /sbin/hwclock --show
84 Mon Mar 6 15:32:32 2000 -0.910248 seconds 84 Mon Mar 6 15:32:32 2000 -0.910248 seconds
85 85
86 For setting: 86 For setting:
87 # /sbin/hwclock --systohc 87 # /sbin/hwclock --systohc
88 88
89 Root privileges are required to be able to set the time of day. 89 Root privileges are required to be able to set the time of day.
90 90
91 IV/ Wakeup Alarm service 91 IV/ Wakeup Alarm service
92 92
93 EFI provides an API by which one can program when a machine should wakeup, 93 EFI provides an API by which one can program when a machine should wakeup,
94 i.e. reboot. This is very different from the alarm provided by the legacy 94 i.e. reboot. This is very different from the alarm provided by the legacy
95 RTC which is some kind of interval timer alarm. For this reason we don't use 95 RTC which is some kind of interval timer alarm. For this reason we don't use
96 the same ioctl()s to get access to the service. Instead we have 96 the same ioctl()s to get access to the service. Instead we have
97 introduced 2 news ioctl()s to the interface of an RTC. 97 introduced 2 news ioctl()s to the interface of an RTC.
98 98
99 We have added 2 new ioctl()s that are specific to the EFI driver: 99 We have added 2 new ioctl()s that are specific to the EFI driver:
100 100
101 Read the current state of the alarm 101 Read the current state of the alarm
102 ioctl(d, RTC_WKLAM_RD, &wkt) 102 ioctl(d, RTC_WKLAM_RD, &wkt)
103 103
104 Set the alarm or change its status 104 Set the alarm or change its status
105 ioctl(d, RTC_WKALM_SET, &wkt) 105 ioctl(d, RTC_WKALM_SET, &wkt)
106 106
107 The wkt structure encapsulates a struct rtc_time + 2 extra fields to get 107 The wkt structure encapsulates a struct rtc_time + 2 extra fields to get
108 status information: 108 status information:
109 109
110 struct rtc_wkalrm { 110 struct rtc_wkalrm {
111 111
112 unsigned char enabled; /* =1 if alarm is enabled */ 112 unsigned char enabled; /* =1 if alarm is enabled */
113 unsigned char pending; /* =1 if alarm is pending */ 113 unsigned char pending; /* =1 if alarm is pending */
114 114
115 struct rtc_time time; 115 struct rtc_time time;
116 } 116 }
117 117
118 As of today, none of the existing user-level apps supports this feature. 118 As of today, none of the existing user-level apps supports this feature.
119 However writing such a program should be hard by simply using those two 119 However writing such a program should be hard by simply using those two
120 ioctl(). 120 ioctl().
121 121
122 Root privileges are required to be able to set the alarm. 122 Root privileges are required to be able to set the alarm.
123 123
124 V/ References. 124 V/ References.
125 125
126 Checkout the following Web site for more information on EFI: 126 Checkout the following Web site for more information on EFI:
127 127
128 http://developer.intel.com/technology/efi/ 128 http://developer.intel.com/technology/efi/
129 129
Documentation/ia64/mca.txt
1 An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel 1 An ad-hoc collection of notes on IA64 MCA and INIT processing. Feel
2 free to update it with notes about any area that is not clear. 2 free to update it with notes about any area that is not clear.
3 3
4 --- 4 ---
5 5
6 MCA/INIT are completely asynchronous. They can occur at any time, when 6 MCA/INIT are completely asynchronous. They can occur at any time, when
7 the OS is in any state. Including when one of the cpus is already 7 the OS is in any state. Including when one of the cpus is already
8 holding a spinlock. Trying to get any lock from MCA/INIT state is 8 holding a spinlock. Trying to get any lock from MCA/INIT state is
9 asking for deadlock. Also the state of structures that are protected 9 asking for deadlock. Also the state of structures that are protected
10 by locks is indeterminate, including linked lists. 10 by locks is indeterminate, including linked lists.
11 11
12 --- 12 ---
13 13
14 The complicated ia64 MCA process. All of this is mandated by Intel's 14 The complicated ia64 MCA process. All of this is mandated by Intel's
15 specification for ia64 SAL, error recovery and and unwind, it is not as 15 specification for ia64 SAL, error recovery and unwind, it is not as
16 if we have a choice here. 16 if we have a choice here.
17 17
18 * MCA occurs on one cpu, usually due to a double bit memory error. 18 * MCA occurs on one cpu, usually due to a double bit memory error.
19 This is the monarch cpu. 19 This is the monarch cpu.
20 20
21 * SAL sends an MCA rendezvous interrupt (which is a normal interrupt) 21 * SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
22 to all the other cpus, the slaves. 22 to all the other cpus, the slaves.
23 23
24 * Slave cpus that receive the MCA interrupt call down into SAL, they 24 * Slave cpus that receive the MCA interrupt call down into SAL, they
25 end up spinning disabled while the MCA is being serviced. 25 end up spinning disabled while the MCA is being serviced.
26 26
27 * If any slave cpu was already spinning disabled when the MCA occurred 27 * If any slave cpu was already spinning disabled when the MCA occurred
28 then it cannot service the MCA interrupt. SAL waits ~20 seconds then 28 then it cannot service the MCA interrupt. SAL waits ~20 seconds then
29 sends an unmaskable INIT event to the slave cpus that have not 29 sends an unmaskable INIT event to the slave cpus that have not
30 already rendezvoused. 30 already rendezvoused.
31 31
32 * Because MCA/INIT can be delivered at any time, including when the cpu 32 * Because MCA/INIT can be delivered at any time, including when the cpu
33 is down in PAL in physical mode, the registers at the time of the 33 is down in PAL in physical mode, the registers at the time of the
34 event are _completely_ undefined. In particular the MCA/INIT 34 event are _completely_ undefined. In particular the MCA/INIT
35 handlers cannot rely on the thread pointer, PAL physical mode can 35 handlers cannot rely on the thread pointer, PAL physical mode can
36 (and does) modify TP. It is allowed to do that as long as it resets 36 (and does) modify TP. It is allowed to do that as long as it resets
37 TP on return. However MCA/INIT events expose us to these PAL 37 TP on return. However MCA/INIT events expose us to these PAL
38 internal TP changes. Hence curr_task(). 38 internal TP changes. Hence curr_task().
39 39
40 * If an MCA/INIT event occurs while the kernel was running (not user 40 * If an MCA/INIT event occurs while the kernel was running (not user
41 space) and the kernel has called PAL then the MCA/INIT handler cannot 41 space) and the kernel has called PAL then the MCA/INIT handler cannot
42 assume that the kernel stack is in a fit state to be used. Mainly 42 assume that the kernel stack is in a fit state to be used. Mainly
43 because PAL may or may not maintain the stack pointer internally. 43 because PAL may or may not maintain the stack pointer internally.
44 Because the MCA/INIT handlers cannot trust the kernel stack, they 44 Because the MCA/INIT handlers cannot trust the kernel stack, they
45 have to use their own, per-cpu stacks. The MCA/INIT stacks are 45 have to use their own, per-cpu stacks. The MCA/INIT stacks are
46 preformatted with just enough task state to let the relevant handlers 46 preformatted with just enough task state to let the relevant handlers
47 do their job. 47 do their job.
48 48
49 * Unlike most other architectures, the ia64 struct task is embedded in 49 * Unlike most other architectures, the ia64 struct task is embedded in
50 the kernel stack[1]. So switching to a new kernel stack means that 50 the kernel stack[1]. So switching to a new kernel stack means that
51 we switch to a new task as well. Because various bits of the kernel 51 we switch to a new task as well. Because various bits of the kernel
52 assume that current points into the struct task, switching to a new 52 assume that current points into the struct task, switching to a new
53 stack also means a new value for current. 53 stack also means a new value for current.
54 54
55 * Once all slaves have rendezvoused and are spinning disabled, the 55 * Once all slaves have rendezvoused and are spinning disabled, the
56 monarch is entered. The monarch now tries to diagnose the problem 56 monarch is entered. The monarch now tries to diagnose the problem
57 and decide if it can recover or not. 57 and decide if it can recover or not.
58 58
59 * Part of the monarch's job is to look at the state of all the other 59 * Part of the monarch's job is to look at the state of all the other
60 tasks. The only way to do that on ia64 is to call the unwinder, 60 tasks. The only way to do that on ia64 is to call the unwinder,
61 as mandated by Intel. 61 as mandated by Intel.
62 62
63 * The starting point for the unwind depends on whether a task is 63 * The starting point for the unwind depends on whether a task is
64 running or not. That is, whether it is on a cpu or is blocked. The 64 running or not. That is, whether it is on a cpu or is blocked. The
65 monarch has to determine whether or not a task is on a cpu before it 65 monarch has to determine whether or not a task is on a cpu before it
66 knows how to start unwinding it. The tasks that received an MCA or 66 knows how to start unwinding it. The tasks that received an MCA or
67 INIT event are no longer running, they have been converted to blocked 67 INIT event are no longer running, they have been converted to blocked
68 tasks. But (and its a big but), the cpus that received the MCA 68 tasks. But (and its a big but), the cpus that received the MCA
69 rendezvous interrupt are still running on their normal kernel stacks! 69 rendezvous interrupt are still running on their normal kernel stacks!
70 70
71 * To distinguish between these two cases, the monarch must know which 71 * To distinguish between these two cases, the monarch must know which
72 tasks are on a cpu and which are not. Hence each slave cpu that 72 tasks are on a cpu and which are not. Hence each slave cpu that
73 switches to an MCA/INIT stack, registers its new stack using 73 switches to an MCA/INIT stack, registers its new stack using
74 set_curr_task(), so the monarch can tell that the _original_ task is 74 set_curr_task(), so the monarch can tell that the _original_ task is
75 no longer running on that cpu. That gives us a decent chance of 75 no longer running on that cpu. That gives us a decent chance of
76 getting a valid backtrace of the _original_ task. 76 getting a valid backtrace of the _original_ task.
77 77
78 * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a 78 * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
79 nested error, we want diagnostics on the MCA/INIT handler that 79 nested error, we want diagnostics on the MCA/INIT handler that
80 failed, not on the task that was originally running. Again this 80 failed, not on the task that was originally running. Again this
81 requires set_curr_task() so the MCA/INIT handlers can register their 81 requires set_curr_task() so the MCA/INIT handlers can register their
82 own stack as running on that cpu. Then a recursive error gets a 82 own stack as running on that cpu. Then a recursive error gets a
83 trace of the failing handler's "task". 83 trace of the failing handler's "task".
84 84
85 [1] My (Keith Owens) original design called for ia64 to separate its 85 [1] My (Keith Owens) original design called for ia64 to separate its
86 struct task and the kernel stacks. Then the MCA/INIT data would be 86 struct task and the kernel stacks. Then the MCA/INIT data would be
87 chained stacks like i386 interrupt stacks. But that required 87 chained stacks like i386 interrupt stacks. But that required
88 radical surgery on the rest of ia64, plus extra hard wired TLB 88 radical surgery on the rest of ia64, plus extra hard wired TLB
89 entries with its associated performance degradation. David 89 entries with its associated performance degradation. David
90 Mosberger vetoed that approach. Which meant that separate kernel 90 Mosberger vetoed that approach. Which meant that separate kernel
91 stacks meant separate "tasks" for the MCA/INIT handlers. 91 stacks meant separate "tasks" for the MCA/INIT handlers.
92 92
93 --- 93 ---
94 94
95 INIT is less complicated than MCA. Pressing the nmi button or using 95 INIT is less complicated than MCA. Pressing the nmi button or using
96 the equivalent command on the management console sends INIT to all 96 the equivalent command on the management console sends INIT to all
97 cpus. SAL picks one one of the cpus as the monarch and the rest are 97 cpus. SAL picks one of the cpus as the monarch and the rest are
98 slaves. All the OS INIT handlers are entered at approximately the same 98 slaves. All the OS INIT handlers are entered at approximately the same
99 time. The OS monarch prints the state of all tasks and returns, after 99 time. The OS monarch prints the state of all tasks and returns, after
100 which the slaves return and the system resumes. 100 which the slaves return and the system resumes.
101 101
102 At least that is what is supposed to happen. Alas there are broken 102 At least that is what is supposed to happen. Alas there are broken
103 versions of SAL out there. Some drive all the cpus as monarchs. Some 103 versions of SAL out there. Some drive all the cpus as monarchs. Some
104 drive them all as slaves. Some drive one cpu as monarch, wait for that 104 drive them all as slaves. Some drive one cpu as monarch, wait for that
105 cpu to return from the OS then drive the rest as slaves. Some versions 105 cpu to return from the OS then drive the rest as slaves. Some versions
106 of SAL cannot even cope with returning from the OS, they spin inside 106 of SAL cannot even cope with returning from the OS, they spin inside
107 SAL on resume. The OS INIT code has workarounds for some of these 107 SAL on resume. The OS INIT code has workarounds for some of these
108 broken SAL symptoms, but some simply cannot be fixed from the OS side. 108 broken SAL symptoms, but some simply cannot be fixed from the OS side.
109 109
110 --- 110 ---
111 111
112 The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer 112 The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
113 violations. Unfortunately MCA/INIT start off as massive layer 113 violations. Unfortunately MCA/INIT start off as massive layer
114 violations (can occur at _any_ time) and they build from there. 114 violations (can occur at _any_ time) and they build from there.
115 115
116 At least ia64 makes an attempt at recovering from hardware errors, but 116 At least ia64 makes an attempt at recovering from hardware errors, but
117 it is a difficult problem because of the asynchronous nature of these 117 it is a difficult problem because of the asynchronous nature of these
118 errors. When processing an unmaskable interrupt we sometimes need 118 errors. When processing an unmaskable interrupt we sometimes need
119 special code to cope with our inability to take any locks. 119 special code to cope with our inability to take any locks.
120 120
121 --- 121 ---
122 122
123 How is ia64 MCA/INIT different from x86 NMI? 123 How is ia64 MCA/INIT different from x86 NMI?
124 124
125 * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to 125 * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
126 all cpus. 126 all cpus.
127 127
128 * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 128 * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
129 per cpu. 129 per cpu.
130 130
131 * x86 has a separate struct task which points to one of multiple kernel 131 * x86 has a separate struct task which points to one of multiple kernel
132 stacks. ia64 has the struct task embedded in the single kernel 132 stacks. ia64 has the struct task embedded in the single kernel
133 stack, so switching stack means switching task. 133 stack, so switching stack means switching task.
134 134
135 * x86 does not call the BIOS so the NMI handler does not have to worry 135 * x86 does not call the BIOS so the NMI handler does not have to worry
136 about any registers having changed. MCA/INIT can occur while the cpu 136 about any registers having changed. MCA/INIT can occur while the cpu
137 is in PAL in physical mode, with undefined registers and an undefined 137 is in PAL in physical mode, with undefined registers and an undefined
138 kernel stack. 138 kernel stack.
139 139
140 * i386 backtrace is not very sensitive to whether a process is running 140 * i386 backtrace is not very sensitive to whether a process is running
141 or not. ia64 unwind is very, very sensitive to whether a process is 141 or not. ia64 unwind is very, very sensitive to whether a process is
142 running or not. 142 running or not.
143 143
144 --- 144 ---
145 145
146 What happens when MCA/INIT is delivered what a cpu is running user 146 What happens when MCA/INIT is delivered what a cpu is running user
147 space code? 147 space code?
148 148
149 The user mode registers are stored in the RSE area of the MCA/INIT on 149 The user mode registers are stored in the RSE area of the MCA/INIT on
150 entry to the OS and are restored from there on return to SAL, so user 150 entry to the OS and are restored from there on return to SAL, so user
151 mode registers are preserved across a recoverable MCA/INIT. Since the 151 mode registers are preserved across a recoverable MCA/INIT. Since the
152 OS has no idea what unwind data is available for the user space stack, 152 OS has no idea what unwind data is available for the user space stack,
153 MCA/INIT never tries to backtrace user space. Which means that the OS 153 MCA/INIT never tries to backtrace user space. Which means that the OS
154 does not bother making the user space process look like a blocked task, 154 does not bother making the user space process look like a blocked task,
155 i.e. the OS does not copy pt_regs and switch_stack to the user space 155 i.e. the OS does not copy pt_regs and switch_stack to the user space
156 stack. Also the OS has no idea how big the user space RSE and memory 156 stack. Also the OS has no idea how big the user space RSE and memory
157 stacks are, which makes it too risky to copy the saved state to a user 157 stacks are, which makes it too risky to copy the saved state to a user
158 mode stack. 158 mode stack.
159 159
160 --- 160 ---
161 161
162 How do we get a backtrace on the tasks that were running when MCA/INIT 162 How do we get a backtrace on the tasks that were running when MCA/INIT
163 was delivered? 163 was delivered?
164 164
165 mca.c:::ia64_mca_modify_original_stack(). That identifies and 165 mca.c:::ia64_mca_modify_original_stack(). That identifies and
166 verifies the original kernel stack, copies the dirty registers from 166 verifies the original kernel stack, copies the dirty registers from
167 the MCA/INIT stack's RSE to the original stack's RSE, copies the 167 the MCA/INIT stack's RSE to the original stack's RSE, copies the
168 skeleton struct pt_regs and switch_stack to the original stack, fills 168 skeleton struct pt_regs and switch_stack to the original stack, fills
169 in the skeleton structures from the PAL minstate area and updates the 169 in the skeleton structures from the PAL minstate area and updates the
170 original stack's thread.ksp. That makes the original stack look 170 original stack's thread.ksp. That makes the original stack look
171 exactly like any other blocked task, i.e. it now appears to be 171 exactly like any other blocked task, i.e. it now appears to be
172 sleeping. To get a backtrace, just start with thread.ksp for the 172 sleeping. To get a backtrace, just start with thread.ksp for the
173 original task and unwind like any other sleeping task. 173 original task and unwind like any other sleeping task.
174 174
175 --- 175 ---
176 176
177 How do we identify the tasks that were running when MCA/INIT was 177 How do we identify the tasks that were running when MCA/INIT was
178 delivered? 178 delivered?
179 179
180 If the previous task has been verified and converted to a blocked 180 If the previous task has been verified and converted to a blocked
181 state, then sos->prev_task on the MCA/INIT stack is updated to point to 181 state, then sos->prev_task on the MCA/INIT stack is updated to point to
182 the previous task. You can look at that field in dumps or debuggers. 182 the previous task. You can look at that field in dumps or debuggers.
183 To help distinguish between the handler and the original tasks, 183 To help distinguish between the handler and the original tasks,
184 handlers have _TIF_MCA_INIT set in thread_info.flags. 184 handlers have _TIF_MCA_INIT set in thread_info.flags.
185 185
186 The sos data is always in the MCA/INIT handler stack, at offset 186 The sos data is always in the MCA/INIT handler stack, at offset
187 MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it 187 MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
188 as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct 188 as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
189 ia64_sal_os_state), with 16 byte alignment for all structures. 189 ia64_sal_os_state), with 16 byte alignment for all structures.
190 190
191 Also the comm field of the MCA/INIT task is modified to include the pid 191 Also the comm field of the MCA/INIT task is modified to include the pid
192 of the original task, for humans to use. For example, a comm field of 192 of the original task, for humans to use. For example, a comm field of
193 'MCA 12159' means that pid 12159 was running when the MCA was 193 'MCA 12159' means that pid 12159 was running when the MCA was
194 delivered. 194 delivered.
195 195
Documentation/input/input.txt
1 Linux Input drivers v1.0 1 Linux Input drivers v1.0
2 (c) 1999-2001 Vojtech Pavlik <vojtech@ucw.cz> 2 (c) 1999-2001 Vojtech Pavlik <vojtech@ucw.cz>
3 Sponsored by SuSE 3 Sponsored by SuSE
4 $Id: input.txt,v 1.8 2002/05/29 03:15:01 bradleym Exp $ 4 $Id: input.txt,v 1.8 2002/05/29 03:15:01 bradleym Exp $
5 ---------------------------------------------------------------------------- 5 ----------------------------------------------------------------------------
6 6
7 0. Disclaimer 7 0. Disclaimer
8 ~~~~~~~~~~~~~ 8 ~~~~~~~~~~~~~
9 This program is free software; you can redistribute it and/or modify it 9 This program is free software; you can redistribute it and/or modify it
10 under the terms of the GNU General Public License as published by the Free 10 under the terms of the GNU General Public License as published by the Free
11 Software Foundation; either version 2 of the License, or (at your option) 11 Software Foundation; either version 2 of the License, or (at your option)
12 any later version. 12 any later version.
13 13
14 This program is distributed in the hope that it will be useful, but 14 This program is distributed in the hope that it will be useful, but
15 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 15 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
16 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 16 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
17 more details. 17 more details.
18 18
19 You should have received a copy of the GNU General Public License along 19 You should have received a copy of the GNU General Public License along
20 with this program; if not, write to the Free Software Foundation, Inc., 59 20 with this program; if not, write to the Free Software Foundation, Inc., 59
21 Temple Place, Suite 330, Boston, MA 02111-1307 USA 21 Temple Place, Suite 330, Boston, MA 02111-1307 USA
22 22
23 Should you need to contact me, the author, you can do so either by e-mail 23 Should you need to contact me, the author, you can do so either by e-mail
24 - mail your message to <vojtech@ucw.cz>, or by paper mail: Vojtech Pavlik, 24 - mail your message to <vojtech@ucw.cz>, or by paper mail: Vojtech Pavlik,
25 Simunkova 1594, Prague 8, 182 00 Czech Republic 25 Simunkova 1594, Prague 8, 182 00 Czech Republic
26 26
27 For your convenience, the GNU General Public License version 2 is included 27 For your convenience, the GNU General Public License version 2 is included
28 in the package: See the file COPYING. 28 in the package: See the file COPYING.
29 29
30 1. Introduction 30 1. Introduction
31 ~~~~~~~~~~~~~~~ 31 ~~~~~~~~~~~~~~~
32 This is a collection of drivers that is designed to support all input 32 This is a collection of drivers that is designed to support all input
33 devices under Linux. While it is currently used only on for USB input 33 devices under Linux. While it is currently used only on for USB input
34 devices, future use (say 2.5/2.6) is expected to expand to replace 34 devices, future use (say 2.5/2.6) is expected to expand to replace
35 most of the existing input system, which is why it lives in 35 most of the existing input system, which is why it lives in
36 drivers/input/ instead of drivers/usb/. 36 drivers/input/ instead of drivers/usb/.
37 37
38 The centre of the input drivers is the input module, which must be 38 The centre of the input drivers is the input module, which must be
39 loaded before any other of the input modules - it serves as a way of 39 loaded before any other of the input modules - it serves as a way of
40 communication between two groups of modules: 40 communication between two groups of modules:
41 41
42 1.1 Device drivers 42 1.1 Device drivers
43 ~~~~~~~~~~~~~~~~~~ 43 ~~~~~~~~~~~~~~~~~~
44 These modules talk to the hardware (for example via USB), and provide 44 These modules talk to the hardware (for example via USB), and provide
45 events (keystrokes, mouse movements) to the input module. 45 events (keystrokes, mouse movements) to the input module.
46 46
47 1.2 Event handlers 47 1.2 Event handlers
48 ~~~~~~~~~~~~~~~~~~ 48 ~~~~~~~~~~~~~~~~~~
49 These modules get events from input and pass them where needed via 49 These modules get events from input and pass them where needed via
50 various interfaces - keystrokes to the kernel, mouse movements via a 50 various interfaces - keystrokes to the kernel, mouse movements via a
51 simulated PS/2 interface to GPM and X and so on. 51 simulated PS/2 interface to GPM and X and so on.
52 52
53 2. Simple Usage 53 2. Simple Usage
54 ~~~~~~~~~~~~~~~ 54 ~~~~~~~~~~~~~~~
55 For the most usual configuration, with one USB mouse and one USB keyboard, 55 For the most usual configuration, with one USB mouse and one USB keyboard,
56 you'll have to load the following modules (or have them built in to the 56 you'll have to load the following modules (or have them built in to the
57 kernel): 57 kernel):
58 58
59 input 59 input
60 mousedev 60 mousedev
61 keybdev 61 keybdev
62 usbcore 62 usbcore
63 uhci_hcd or ohci_hcd or ehci_hcd 63 uhci_hcd or ohci_hcd or ehci_hcd
64 usbhid 64 usbhid
65 65
66 After this, the USB keyboard will work straight away, and the USB mouse 66 After this, the USB keyboard will work straight away, and the USB mouse
67 will be available as a character device on major 13, minor 63: 67 will be available as a character device on major 13, minor 63:
68 68
69 crw-r--r-- 1 root root 13, 63 Mar 28 22:45 mice 69 crw-r--r-- 1 root root 13, 63 Mar 28 22:45 mice
70 70
71 This device has to be created. 71 This device has to be created.
72 The commands to create it by hand are: 72 The commands to create it by hand are:
73 73
74 cd /dev 74 cd /dev
75 mkdir input 75 mkdir input
76 mknod input/mice c 13 63 76 mknod input/mice c 13 63
77 77
78 After that you have to point GPM (the textmode mouse cut&paste tool) and 78 After that you have to point GPM (the textmode mouse cut&paste tool) and
79 XFree to this device to use it - GPM should be called like: 79 XFree to this device to use it - GPM should be called like:
80 80
81 gpm -t ps2 -m /dev/input/mice 81 gpm -t ps2 -m /dev/input/mice
82 82
83 And in X: 83 And in X:
84 84
85 Section "Pointer" 85 Section "Pointer"
86 Protocol "ImPS/2" 86 Protocol "ImPS/2"
87 Device "/dev/input/mice" 87 Device "/dev/input/mice"
88 ZAxisMapping 4 5 88 ZAxisMapping 4 5
89 EndSection 89 EndSection
90 90
91 When you do all of the above, you can use your USB mouse and keyboard. 91 When you do all of the above, you can use your USB mouse and keyboard.
92 92
93 3. Detailed Description 93 3. Detailed Description
94 ~~~~~~~~~~~~~~~~~~~~~~~ 94 ~~~~~~~~~~~~~~~~~~~~~~~
95 3.1 Device drivers 95 3.1 Device drivers
96 ~~~~~~~~~~~~~~~~~~ 96 ~~~~~~~~~~~~~~~~~~
97 Device drivers are the modules that generate events. The events are 97 Device drivers are the modules that generate events. The events are
98 however not useful without being handled, so you also will need to use some 98 however not useful without being handled, so you also will need to use some
99 of the modules from section 3.2. 99 of the modules from section 3.2.
100 100
101 3.1.1 usbhid 101 3.1.1 usbhid
102 ~~~~~~~~~~~~ 102 ~~~~~~~~~~~~
103 usbhid is the largest and most complex driver of the whole suite. It 103 usbhid is the largest and most complex driver of the whole suite. It
104 handles all HID devices, and because there is a very wide variety of them, 104 handles all HID devices, and because there is a very wide variety of them,
105 and because the USB HID specification isn't simple, it needs to be this big. 105 and because the USB HID specification isn't simple, it needs to be this big.
106 106
107 Currently, it handles USB mice, joysticks, gamepads, steering wheels 107 Currently, it handles USB mice, joysticks, gamepads, steering wheels
108 keyboards, trackballs and digitizers. 108 keyboards, trackballs and digitizers.
109 109
110 However, USB uses HID also for monitor controls, speaker controls, UPSs, 110 However, USB uses HID also for monitor controls, speaker controls, UPSs,
111 LCDs and many other purposes. 111 LCDs and many other purposes.
112 112
113 The monitor and speaker controls should be easy to add to the hid/input 113 The monitor and speaker controls should be easy to add to the hid/input
114 interface, but for the UPSs and LCDs it doesn't make much sense. For this, 114 interface, but for the UPSs and LCDs it doesn't make much sense. For this,
115 the hiddev interface was designed. See Documentation/usb/hiddev.txt 115 the hiddev interface was designed. See Documentation/usb/hiddev.txt
116 for more information about it. 116 for more information about it.
117 117
118 The usage of the usbhid module is very simple, it takes no parameters, 118 The usage of the usbhid module is very simple, it takes no parameters,
119 detects everything automatically and when a HID device is inserted, it 119 detects everything automatically and when a HID device is inserted, it
120 detects it appropriately. 120 detects it appropriately.
121 121
122 However, because the devices vary wildly, you might happen to have a 122 However, because the devices vary wildly, you might happen to have a
123 device that doesn't work well. In that case #define DEBUG at the beginning 123 device that doesn't work well. In that case #define DEBUG at the beginning
124 of hid-core.c and send me the syslog traces. 124 of hid-core.c and send me the syslog traces.
125 125
126 3.1.2 usbmouse 126 3.1.2 usbmouse
127 ~~~~~~~~~~~~~~ 127 ~~~~~~~~~~~~~~
128 For embedded systems, for mice with broken HID descriptors and just any 128 For embedded systems, for mice with broken HID descriptors and just any
129 other use when the big usbhid wouldn't be a good choice, there is the 129 other use when the big usbhid wouldn't be a good choice, there is the
130 usbmouse driver. It handles USB mice only. It uses a simpler HIDBP 130 usbmouse driver. It handles USB mice only. It uses a simpler HIDBP
131 protocol. This also means the mice must support this simpler protocol. Not 131 protocol. This also means the mice must support this simpler protocol. Not
132 all do. If you don't have any strong reason to use this module, use usbhid 132 all do. If you don't have any strong reason to use this module, use usbhid
133 instead. 133 instead.
134 134
135 3.1.3 usbkbd 135 3.1.3 usbkbd
136 ~~~~~~~~~~~~ 136 ~~~~~~~~~~~~
137 Much like usbmouse, this module talks to keyboards with a simplified 137 Much like usbmouse, this module talks to keyboards with a simplified
138 HIDBP protocol. It's smaller, but doesn't support any extra special keys. 138 HIDBP protocol. It's smaller, but doesn't support any extra special keys.
139 Use usbhid instead if there isn't any special reason to use this. 139 Use usbhid instead if there isn't any special reason to use this.
140 140
141 3.1.4 wacom 141 3.1.4 wacom
142 ~~~~~~~~~~~ 142 ~~~~~~~~~~~
143 This is a driver for Wacom Graphire and Intuos tablets. Not for Wacom 143 This is a driver for Wacom Graphire and Intuos tablets. Not for Wacom
144 PenPartner, that one is handled by the HID driver. Although the Intuos and 144 PenPartner, that one is handled by the HID driver. Although the Intuos and
145 Graphire tablets claim that they are HID tablets as well, they are not and 145 Graphire tablets claim that they are HID tablets as well, they are not and
146 thus need this specific driver. 146 thus need this specific driver.
147 147
148 3.1.5 iforce 148 3.1.5 iforce
149 ~~~~~~~~~~~~ 149 ~~~~~~~~~~~~
150 A driver for I-Force joysticks and wheels, both over USB and RS232. 150 A driver for I-Force joysticks and wheels, both over USB and RS232.
151 It includes ForceFeedback support now, even though Immersion 151 It includes ForceFeedback support now, even though Immersion
152 Corp. considers the protocol a trade secret and won't disclose a word 152 Corp. considers the protocol a trade secret and won't disclose a word
153 about it. 153 about it.
154 154
155 3.2 Event handlers 155 3.2 Event handlers
156 ~~~~~~~~~~~~~~~~~~ 156 ~~~~~~~~~~~~~~~~~~
157 Event handlers distribute the events from the devices to userland and 157 Event handlers distribute the events from the devices to userland and
158 kernel, as needed. 158 kernel, as needed.
159 159
160 3.2.1 keybdev 160 3.2.1 keybdev
161 ~~~~~~~~~~~~~ 161 ~~~~~~~~~~~~~
162 keybdev is currently a rather ugly hack that translates the input 162 keybdev is currently a rather ugly hack that translates the input
163 events into architecture-specific keyboard raw mode (Xlated AT Set2 on 163 events into architecture-specific keyboard raw mode (Xlated AT Set2 on
164 x86), and passes them into the handle_scancode function of the 164 x86), and passes them into the handle_scancode function of the
165 keyboard.c module. This works well enough on all architectures that 165 keyboard.c module. This works well enough on all architectures that
166 keybdev can generate rawmode on, other architectures can be added to 166 keybdev can generate rawmode on, other architectures can be added to
167 it. 167 it.
168 168
169 The right way would be to pass the events to keyboard.c directly, 169 The right way would be to pass the events to keyboard.c directly,
170 best if keyboard.c would itself be an event handler. This is done in 170 best if keyboard.c would itself be an event handler. This is done in
171 the input patch, available on the webpage mentioned below. 171 the input patch, available on the webpage mentioned below.
172 172
173 3.2.2 mousedev 173 3.2.2 mousedev
174 ~~~~~~~~~~~~~~ 174 ~~~~~~~~~~~~~~
175 mousedev is also a hack to make programs that use mouse input 175 mousedev is also a hack to make programs that use mouse input
176 work. It takes events from either mice or digitizers/tablets and makes 176 work. It takes events from either mice or digitizers/tablets and makes
177 a PS/2-style (a la /dev/psaux) mouse device available to the 177 a PS/2-style (a la /dev/psaux) mouse device available to the
178 userland. Ideally, the programs could use a more reasonable interface, 178 userland. Ideally, the programs could use a more reasonable interface,
179 for example evdev 179 for example evdev
180 180
181 Mousedev devices in /dev/input (as shown above) are: 181 Mousedev devices in /dev/input (as shown above) are:
182 182
183 crw-r--r-- 1 root root 13, 32 Mar 28 22:45 mouse0 183 crw-r--r-- 1 root root 13, 32 Mar 28 22:45 mouse0
184 crw-r--r-- 1 root root 13, 33 Mar 29 00:41 mouse1 184 crw-r--r-- 1 root root 13, 33 Mar 29 00:41 mouse1
185 crw-r--r-- 1 root root 13, 34 Mar 29 00:41 mouse2 185 crw-r--r-- 1 root root 13, 34 Mar 29 00:41 mouse2
186 crw-r--r-- 1 root root 13, 35 Apr 1 10:50 mouse3 186 crw-r--r-- 1 root root 13, 35 Apr 1 10:50 mouse3
187 ... 187 ...
188 ... 188 ...
189 crw-r--r-- 1 root root 13, 62 Apr 1 10:50 mouse30 189 crw-r--r-- 1 root root 13, 62 Apr 1 10:50 mouse30
190 crw-r--r-- 1 root root 13, 63 Apr 1 10:50 mice 190 crw-r--r-- 1 root root 13, 63 Apr 1 10:50 mice
191 191
192 Each 'mouse' device is assigned to a single mouse or digitizer, except 192 Each 'mouse' device is assigned to a single mouse or digitizer, except
193 the last one - 'mice'. This single character device is shared by all 193 the last one - 'mice'. This single character device is shared by all
194 mice and digitizers, and even if none are connected, the device is 194 mice and digitizers, and even if none are connected, the device is
195 present. This is useful for hotplugging USB mice, so that programs 195 present. This is useful for hotplugging USB mice, so that programs
196 can open the device even when no mice are present. 196 can open the device even when no mice are present.
197 197
198 CONFIG_INPUT_MOUSEDEV_SCREEN_[XY] in the kernel configuration are 198 CONFIG_INPUT_MOUSEDEV_SCREEN_[XY] in the kernel configuration are
199 the size of your screen (in pixels) in XFree86. This is needed if you 199 the size of your screen (in pixels) in XFree86. This is needed if you
200 want to use your digitizer in X, because its movement is sent to X 200 want to use your digitizer in X, because its movement is sent to X
201 via a virtual PS/2 mouse and thus needs to be scaled 201 via a virtual PS/2 mouse and thus needs to be scaled
202 accordingly. These values won't be used if you use a mouse only. 202 accordingly. These values won't be used if you use a mouse only.
203 203
204 Mousedev will generate either PS/2, ImPS/2 (Microsoft IntelliMouse) or 204 Mousedev will generate either PS/2, ImPS/2 (Microsoft IntelliMouse) or
205 ExplorerPS/2 (IntelliMouse Explorer) protocols, depending on what the 205 ExplorerPS/2 (IntelliMouse Explorer) protocols, depending on what the
206 program reading the data wishes. You can set GPM and X to any of 206 program reading the data wishes. You can set GPM and X to any of
207 these. You'll need ImPS/2 if you want to make use of a wheel on a USB 207 these. You'll need ImPS/2 if you want to make use of a wheel on a USB
208 mouse and ExplorerPS/2 if you want to use extra (up to 5) buttons. 208 mouse and ExplorerPS/2 if you want to use extra (up to 5) buttons.
209 209
210 3.2.3 joydev 210 3.2.3 joydev
211 ~~~~~~~~~~~~ 211 ~~~~~~~~~~~~
212 Joydev implements v0.x and v1.x Linux joystick api, much like 212 Joydev implements v0.x and v1.x Linux joystick api, much like
213 drivers/char/joystick/joystick.c used to in earlier versions. See 213 drivers/char/joystick/joystick.c used to in earlier versions. See
214 joystick-api.txt in the Documentation subdirectory for details. As 214 joystick-api.txt in the Documentation subdirectory for details. As
215 soon as any joystick is connected, it can be accessed in /dev/input 215 soon as any joystick is connected, it can be accessed in /dev/input
216 on: 216 on:
217 217
218 crw-r--r-- 1 root root 13, 0 Apr 1 10:50 js0 218 crw-r--r-- 1 root root 13, 0 Apr 1 10:50 js0
219 crw-r--r-- 1 root root 13, 1 Apr 1 10:50 js1 219 crw-r--r-- 1 root root 13, 1 Apr 1 10:50 js1
220 crw-r--r-- 1 root root 13, 2 Apr 1 10:50 js2 220 crw-r--r-- 1 root root 13, 2 Apr 1 10:50 js2
221 crw-r--r-- 1 root root 13, 3 Apr 1 10:50 js3 221 crw-r--r-- 1 root root 13, 3 Apr 1 10:50 js3
222 ... 222 ...
223 223
224 And so on up to js31. 224 And so on up to js31.
225 225
226 3.2.4 evdev 226 3.2.4 evdev
227 ~~~~~~~~~~~ 227 ~~~~~~~~~~~
228 evdev is the generic input event interface. It passes the events 228 evdev is the generic input event interface. It passes the events
229 generated in the kernel straight to the program, with timestamps. The 229 generated in the kernel straight to the program, with timestamps. The
230 API is still evolving, but should be useable now. It's described in 230 API is still evolving, but should be useable now. It's described in
231 section 5. 231 section 5.
232 232
233 This should be the way for GPM and X to get keyboard and mouse mouse 233 This should be the way for GPM and X to get keyboard and mouse
234 events. It allows for multihead in X without any specific multihead 234 events. It allows for multihead in X without any specific multihead
235 kernel support. The event codes are the same on all architectures and 235 kernel support. The event codes are the same on all architectures and
236 are hardware independent. 236 are hardware independent.
237 237
238 The devices are in /dev/input: 238 The devices are in /dev/input:
239 239
240 crw-r--r-- 1 root root 13, 64 Apr 1 10:49 event0 240 crw-r--r-- 1 root root 13, 64 Apr 1 10:49 event0
241 crw-r--r-- 1 root root 13, 65 Apr 1 10:50 event1 241 crw-r--r-- 1 root root 13, 65 Apr 1 10:50 event1
242 crw-r--r-- 1 root root 13, 66 Apr 1 10:50 event2 242 crw-r--r-- 1 root root 13, 66 Apr 1 10:50 event2
243 crw-r--r-- 1 root root 13, 67 Apr 1 10:50 event3 243 crw-r--r-- 1 root root 13, 67 Apr 1 10:50 event3
244 ... 244 ...
245 245
246 And so on up to event31. 246 And so on up to event31.
247 247
248 4. Verifying if it works 248 4. Verifying if it works
249 ~~~~~~~~~~~~~~~~~~~~~~~~ 249 ~~~~~~~~~~~~~~~~~~~~~~~~
250 Typing a couple keys on the keyboard should be enough to check that 250 Typing a couple keys on the keyboard should be enough to check that
251 a USB keyboard works and is correctly connected to the kernel keyboard 251 a USB keyboard works and is correctly connected to the kernel keyboard
252 driver. 252 driver.
253 253
254 Doing a cat /dev/input/mouse0 (c, 13, 32) will verify that a mouse 254 Doing a cat /dev/input/mouse0 (c, 13, 32) will verify that a mouse
255 is also emulated, characters should appear if you move it. 255 is also emulated, characters should appear if you move it.
256 256
257 You can test the joystick emulation with the 'jstest' utility, 257 You can test the joystick emulation with the 'jstest' utility,
258 available in the joystick package (see Documentation/input/joystick.txt). 258 available in the joystick package (see Documentation/input/joystick.txt).
259 259
260 You can test the event devices with the 'evtest' utility available 260 You can test the event devices with the 'evtest' utility available
261 in the LinuxConsole project CVS archive (see the URL below). 261 in the LinuxConsole project CVS archive (see the URL below).
262 262
263 5. Event interface 263 5. Event interface
264 ~~~~~~~~~~~~~~~~~~ 264 ~~~~~~~~~~~~~~~~~~
265 Should you want to add event device support into any application (X, gpm, 265 Should you want to add event device support into any application (X, gpm,
266 svgalib ...) I <vojtech@ucw.cz> will be happy to provide you any help I 266 svgalib ...) I <vojtech@ucw.cz> will be happy to provide you any help I
267 can. Here goes a description of the current state of things, which is going 267 can. Here goes a description of the current state of things, which is going
268 to be extended, but not changed incompatibly as time goes: 268 to be extended, but not changed incompatibly as time goes:
269 269
270 You can use blocking and nonblocking reads, also select() on the 270 You can use blocking and nonblocking reads, also select() on the
271 /dev/input/eventX devices, and you'll always get a whole number of input 271 /dev/input/eventX devices, and you'll always get a whole number of input
272 events on a read. Their layout is: 272 events on a read. Their layout is:
273 273
274 struct input_event { 274 struct input_event {
275 struct timeval time; 275 struct timeval time;
276 unsigned short type; 276 unsigned short type;
277 unsigned short code; 277 unsigned short code;
278 unsigned int value; 278 unsigned int value;
279 }; 279 };
280 280
281 'time' is the timestamp, it returns the time at which the event happened. 281 'time' is the timestamp, it returns the time at which the event happened.
282 Type is for example EV_REL for relative moment, REL_KEY for a keypress or 282 Type is for example EV_REL for relative moment, REL_KEY for a keypress or
283 release. More types are defined in include/linux/input.h. 283 release. More types are defined in include/linux/input.h.
284 284
285 'code' is event code, for example REL_X or KEY_BACKSPACE, again a complete 285 'code' is event code, for example REL_X or KEY_BACKSPACE, again a complete
286 list is in include/linux/input.h. 286 list is in include/linux/input.h.
287 287
288 'value' is the value the event carries. Either a relative change for 288 'value' is the value the event carries. Either a relative change for
289 EV_REL, absolute new value for EV_ABS (joysticks ...), or 0 for EV_KEY for 289 EV_REL, absolute new value for EV_ABS (joysticks ...), or 0 for EV_KEY for
290 release, 1 for keypress and 2 for autorepeat. 290 release, 1 for keypress and 2 for autorepeat.
291 291
292 292
Documentation/isdn/INTERFACE.fax
1 $Id: INTERFACE.fax,v 1.2 2000/08/06 09:22:50 armin Exp $ 1 $Id: INTERFACE.fax,v 1.2 2000/08/06 09:22:50 armin Exp $
2 2
3 3
4 Description of the fax-subinterface between linklevel and hardwarelevel of 4 Description of the fax-subinterface between linklevel and hardwarelevel of
5 isdn4linux. 5 isdn4linux.
6 6
7 The communication between linklevel (LL) and hardwarelevel (HL) for fax 7 The communication between linklevel (LL) and hardwarelevel (HL) for fax
8 is based on the struct T30_s (defined in isdnif.h). 8 is based on the struct T30_s (defined in isdnif.h).
9 This struct is allocated in the LL. 9 This struct is allocated in the LL.
10 In order to use fax, the LL provides the pointer to this struct with the 10 In order to use fax, the LL provides the pointer to this struct with the
11 command ISDN_CMD_SETL3 (parm.fax). This pointer expires in case of hangup 11 command ISDN_CMD_SETL3 (parm.fax). This pointer expires in case of hangup
12 and when a new channel to a new connection is assigned. 12 and when a new channel to a new connection is assigned.
13 13
14 14
15 Data handling: 15 Data handling:
16 In send-mode the HL-driver has to handle the <DLE> codes and the bit-order 16 In send-mode the HL-driver has to handle the <DLE> codes and the bit-order
17 conversion by itself. 17 conversion by itself.
18 In receive-mode the LL-driver takes care of the bit-order conversion 18 In receive-mode the LL-driver takes care of the bit-order conversion
19 (specified by +FBOR) 19 (specified by +FBOR)
20 20
21 Structure T30_s description: 21 Structure T30_s description:
22 22
23 This structure stores the values (set by AT-commands), the remote- 23 This structure stores the values (set by AT-commands), the remote-
24 capability-values and the command-codes between LL and HL. 24 capability-values and the command-codes between LL and HL.
25 25
26 If the HL-driver receives ISDN_CMD_FAXCMD, all needed information 26 If the HL-driver receives ISDN_CMD_FAXCMD, all needed information
27 is in this struct set by the LL. 27 is in this struct set by the LL.
28 To signal information to the LL, the HL-driver has to set the 28 To signal information to the LL, the HL-driver has to set the
29 the parameters and use ISDN_STAT_FAXIND. 29 parameters and use ISDN_STAT_FAXIND.
30 (Please refer to INTERFACE) 30 (Please refer to INTERFACE)
31 31
32 Structure T30_s: 32 Structure T30_s:
33 33
34 All members are 8-bit unsigned (__u8) 34 All members are 8-bit unsigned (__u8)
35 35
36 - resolution 36 - resolution
37 - rate 37 - rate
38 - width 38 - width
39 - length 39 - length
40 - compression 40 - compression
41 - ecm 41 - ecm
42 - binary 42 - binary
43 - scantime 43 - scantime
44 - id[] 44 - id[]
45 Local faxmachine's parameters, set by +FDIS, +FDCS, +FLID, ... 45 Local faxmachine's parameters, set by +FDIS, +FDCS, +FLID, ...
46 46
47 - r_resolution 47 - r_resolution
48 - r_rate 48 - r_rate
49 - r_width 49 - r_width
50 - r_length 50 - r_length
51 - r_compression 51 - r_compression
52 - r_ecm 52 - r_ecm
53 - r_binary 53 - r_binary
54 - r_scantime 54 - r_scantime
55 - r_id[] 55 - r_id[]
56 Remote faxmachine's parameters. To be set by HL-driver. 56 Remote faxmachine's parameters. To be set by HL-driver.
57 57
58 - phase 58 - phase
59 Defines the actual state of fax connection. Set by HL or LL 59 Defines the actual state of fax connection. Set by HL or LL
60 depending on progress and type of connection. 60 depending on progress and type of connection.
61 If the phase changes because of an AT command, the LL driver 61 If the phase changes because of an AT command, the LL driver
62 changes this value. Otherwise the HL-driver takes care of it, but 62 changes this value. Otherwise the HL-driver takes care of it, but
63 only necessary on call establishment (from IDLE to PHASE_A). 63 only necessary on call establishment (from IDLE to PHASE_A).
64 (one of the constants ISDN_FAX_PHASE_[IDLE,A,B,C,D,E]) 64 (one of the constants ISDN_FAX_PHASE_[IDLE,A,B,C,D,E])
65 65
66 - direction 66 - direction
67 Defines outgoing/send or incoming/receive connection. 67 Defines outgoing/send or incoming/receive connection.
68 (ISDN_TTY_FAX_CONN_[IN,OUT]) 68 (ISDN_TTY_FAX_CONN_[IN,OUT])
69 69
70 - code 70 - code
71 Commands from LL to HL; possible constants : 71 Commands from LL to HL; possible constants :
72 ISDN_TTY_FAX_DR signals +FDR command to HL 72 ISDN_TTY_FAX_DR signals +FDR command to HL
73 73
74 ISDN_TTY_FAX_DT signals +FDT command to HL 74 ISDN_TTY_FAX_DT signals +FDT command to HL
75 75
76 ISDN_TTY_FAX_ET signals +FET command to HL 76 ISDN_TTY_FAX_ET signals +FET command to HL
77 77
78 78
79 Other than that the "code" is set with the hangup-code value at 79 Other than that the "code" is set with the hangup-code value at
80 the end of connection for the +FHNG message. 80 the end of connection for the +FHNG message.
81 81
82 - r_code 82 - r_code
83 Commands from HL to LL; possible constants : 83 Commands from HL to LL; possible constants :
84 ISDN_TTY_FAX_CFR output of +FCFR message. 84 ISDN_TTY_FAX_CFR output of +FCFR message.
85 85
86 ISDN_TTY_FAX_RID output of remote ID set in r_id[] 86 ISDN_TTY_FAX_RID output of remote ID set in r_id[]
87 (+FCSI/+FTSI on send/receive) 87 (+FCSI/+FTSI on send/receive)
88 88
89 ISDN_TTY_FAX_DCS output of +FDCS and CONNECT message, 89 ISDN_TTY_FAX_DCS output of +FDCS and CONNECT message,
90 switching to phase C. 90 switching to phase C.
91 91
92 ISDN_TTY_FAX_ET signals end of data, 92 ISDN_TTY_FAX_ET signals end of data,
93 switching to phase D. 93 switching to phase D.
94 94
95 ISDN_TTY_FAX_FCON signals the established, outgoing connection, 95 ISDN_TTY_FAX_FCON signals the established, outgoing connection,
96 switching to phase B. 96 switching to phase B.
97 97
98 ISDN_TTY_FAX_FCON_I signals the established, incoming connection, 98 ISDN_TTY_FAX_FCON_I signals the established, incoming connection,
99 switching to phase B. 99 switching to phase B.
100 100
101 ISDN_TTY_FAX_DIS output of +FDIS message and values. 101 ISDN_TTY_FAX_DIS output of +FDIS message and values.
102 102
103 ISDN_TTY_FAX_SENT signals that all data has been sent 103 ISDN_TTY_FAX_SENT signals that all data has been sent
104 and <DLE><ETX> is acknowledged, 104 and <DLE><ETX> is acknowledged,
105 OK message will be sent. 105 OK message will be sent.
106 106
107 ISDN_TTY_FAX_PTS signals a msg-confirmation (page sent successful), 107 ISDN_TTY_FAX_PTS signals a msg-confirmation (page sent successful),
108 depending on fet value: 108 depending on fet value:
109 0: output OK message (more pages follow) 109 0: output OK message (more pages follow)
110 1: switching to phase B (next document) 110 1: switching to phase B (next document)
111 111
112 ISDN_TTY_FAX_TRAIN_OK output of +FDCS and OK message (for receive mode). 112 ISDN_TTY_FAX_TRAIN_OK output of +FDCS and OK message (for receive mode).
113 113
114 ISDN_TTY_FAX_EOP signals end of data in receive mode, 114 ISDN_TTY_FAX_EOP signals end of data in receive mode,
115 switching to phase D. 115 switching to phase D.
116 116
117 ISDN_TTY_FAX_HNG output of the +FHNG and value set by code and 117 ISDN_TTY_FAX_HNG output of the +FHNG and value set by code and
118 OK message, switching to phase E. 118 OK message, switching to phase E.
119 119
120 120
121 - badlin 121 - badlin
122 Value of +FBADLIN 122 Value of +FBADLIN
123 123
124 - badmul 124 - badmul
125 Value of +FBADMUL 125 Value of +FBADMUL
126 126
127 - bor 127 - bor
128 Value of +FBOR 128 Value of +FBOR
129 129
130 - fet 130 - fet
131 Value of +FET command in send-mode. 131 Value of +FET command in send-mode.
132 Set by HL in receive-mode for +FET message. 132 Set by HL in receive-mode for +FET message.
133 133
134 - pollid[] 134 - pollid[]
135 ID-string, set by +FCIG 135 ID-string, set by +FCIG
136 136
137 - cq 137 - cq
138 Value of +FCQ 138 Value of +FCQ
139 139
140 - cr 140 - cr
141 Value of +FCR 141 Value of +FCR
142 142
143 - ctcrty 143 - ctcrty
144 Value of +FCTCRTY 144 Value of +FCTCRTY
145 145
146 - minsp 146 - minsp
147 Value of +FMINSP 147 Value of +FMINSP
148 148
149 - phcto 149 - phcto
150 Value of +FPHCTO 150 Value of +FPHCTO
151 151
152 - rel 152 - rel
153 Value of +FREL 153 Value of +FREL
154 154
155 - nbc 155 - nbc
156 Value of +FNBC (0,1) 156 Value of +FNBC (0,1)
157 (+FNBC is not a known class 2 fax command, I added this to change the 157 (+FNBC is not a known class 2 fax command, I added this to change the
158 automatic "best capabilities" connection in the eicon HL-driver) 158 automatic "best capabilities" connection in the eicon HL-driver)
159 159
160 160
161 Armin 161 Armin
162 mac@melware.de 162 mac@melware.de
163 163
164 164
Documentation/isdn/README.hysdn
1 $Id: README.hysdn,v 1.3.6.1 2001/02/10 14:41:19 kai Exp $ 1 $Id: README.hysdn,v 1.3.6.1 2001/02/10 14:41:19 kai Exp $
2 The hysdn driver has been written by 2 The hysdn driver has been written by
3 by Werner Cornelius (werner@isdn4linux.de or werner@titro.de) 3 Werner Cornelius (werner@isdn4linux.de or werner@titro.de)
4 for Hypercope GmbH Aachen Germany. Hypercope agreed to publish this driver 4 for Hypercope GmbH Aachen Germany. Hypercope agreed to publish this driver
5 under the GNU General Public License. 5 under the GNU General Public License.
6 6
7 The CAPI 2.0-support was added by Ulrich Albrecht (ualbrecht@hypercope.de) 7 The CAPI 2.0-support was added by Ulrich Albrecht (ualbrecht@hypercope.de)
8 for Hypercope GmbH Aachen, Germany. 8 for Hypercope GmbH Aachen, Germany.
9 9
10 10
11 This program is free software; you can redistribute it and/or modify 11 This program is free software; you can redistribute it and/or modify
12 it under the terms of the GNU General Public License as published by 12 it under the terms of the GNU General Public License as published by
13 the Free Software Foundation; either version 2 of the License, or 13 the Free Software Foundation; either version 2 of the License, or
14 (at your option) any later version. 14 (at your option) any later version.
15 15
16 This program is distributed in the hope that it will be useful, 16 This program is distributed in the hope that it will be useful,
17 but WITHOUT ANY WARRANTY; without even the implied warranty of 17 but WITHOUT ANY WARRANTY; without even the implied warranty of
18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 18 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 GNU General Public License for more details. 19 GNU General Public License for more details.
20 20
21 You should have received a copy of the GNU General Public License 21 You should have received a copy of the GNU General Public License
22 along with this program; if not, write to the Free Software 22 along with this program; if not, write to the Free Software
23 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 23 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
24 24
25 Table of contents 25 Table of contents
26 ================= 26 =================
27 27
28 1. About the driver 28 1. About the driver
29 29
30 2. Loading/Unloading the driver 30 2. Loading/Unloading the driver
31 31
32 3. Entries in the /proc filesystem 32 3. Entries in the /proc filesystem
33 33
34 4. The /proc/net/hysdn/cardconfX file 34 4. The /proc/net/hysdn/cardconfX file
35 35
36 5. The /proc/net/hysdn/cardlogX file 36 5. The /proc/net/hysdn/cardlogX file
37 37
38 6. Where to get additional info and help 38 6. Where to get additional info and help
39 39
40 40
41 1. About the driver 41 1. About the driver
42 42
43 The drivers/isdn/hysdn subdir contains a driver for HYPERCOPEs active 43 The drivers/isdn/hysdn subdir contains a driver for HYPERCOPEs active
44 PCI isdn cards Champ, Ergo and Metro. To enable support for this cards 44 PCI isdn cards Champ, Ergo and Metro. To enable support for this cards
45 enable ISDN support in the kernel config and support for HYSDN cards in 45 enable ISDN support in the kernel config and support for HYSDN cards in
46 the active cards submenu. The driver may only be compiled and used if 46 the active cards submenu. The driver may only be compiled and used if
47 support for loadable modules and the process filesystem have been enabled. 47 support for loadable modules and the process filesystem have been enabled.
48 48
49 These cards provide two different interfaces to the kernel. Without the 49 These cards provide two different interfaces to the kernel. Without the
50 optional CAPI 2.0 support, they register as ethernet card. IP-routing 50 optional CAPI 2.0 support, they register as ethernet card. IP-routing
51 to a ISDN-destination is performed on the card itself. All necessary 51 to a ISDN-destination is performed on the card itself. All necessary
52 handlers for various protocols like ppp and others as well as config info 52 handlers for various protocols like ppp and others as well as config info
53 and firmware may be fetched from Hypercopes WWW-Site www.hypercope.de. 53 and firmware may be fetched from Hypercopes WWW-Site www.hypercope.de.
54 54
55 With CAPI 2.0 support enabled, the card can also be used as a CAPI 2.0 55 With CAPI 2.0 support enabled, the card can also be used as a CAPI 2.0
56 compliant devices with either CAPI 2.0 applications 56 compliant devices with either CAPI 2.0 applications
57 (check isdn4k-utils) or -using the capidrv module- as a regular 57 (check isdn4k-utils) or -using the capidrv module- as a regular
58 isdn4linux device. This is done via the same mechanism as with the 58 isdn4linux device. This is done via the same mechanism as with the
59 active AVM cards and in fact uses the same module. 59 active AVM cards and in fact uses the same module.
60 60
61 61
62 2. Loading/Unloading the driver 62 2. Loading/Unloading the driver
63 63
64 The module has no command line parameters and auto detects up to 10 cards 64 The module has no command line parameters and auto detects up to 10 cards
65 in the id-range 0-9. 65 in the id-range 0-9.
66 If a loaded driver shall be unloaded all open files in the /proc/net/hysdn 66 If a loaded driver shall be unloaded all open files in the /proc/net/hysdn
67 subdir need to be closed and all ethernet interfaces allocated by this 67 subdir need to be closed and all ethernet interfaces allocated by this
68 driver must be shut down. Otherwise the module counter will avoid a module 68 driver must be shut down. Otherwise the module counter will avoid a module
69 unload. 69 unload.
70 70
71 If you are using the CAPI 2.0-interface, make sure to load/modprobe the 71 If you are using the CAPI 2.0-interface, make sure to load/modprobe the
72 kernelcapi-module first. 72 kernelcapi-module first.
73 73
74 If you plan to use the capidrv-link to isdn4linux, make sure to load 74 If you plan to use the capidrv-link to isdn4linux, make sure to load
75 capidrv.o after all modules using this driver (i.e. after hysdn and 75 capidrv.o after all modules using this driver (i.e. after hysdn and
76 any avm-specific modules). 76 any avm-specific modules).
77 77
78 3. Entries in the /proc filesystem 78 3. Entries in the /proc filesystem
79 79
80 When the module has been loaded it adds the directory hysdn in the 80 When the module has been loaded it adds the directory hysdn in the
81 /proc/net tree. This directory contains exactly 2 file entries for each 81 /proc/net tree. This directory contains exactly 2 file entries for each
82 card. One is called cardconfX and the other cardlogX, where X is the 82 card. One is called cardconfX and the other cardlogX, where X is the
83 card id number from 0 to 9. 83 card id number from 0 to 9.
84 The cards are numbered in the order found in the PCI config data. 84 The cards are numbered in the order found in the PCI config data.
85 85
86 4. The /proc/net/hysdn/cardconfX file 86 4. The /proc/net/hysdn/cardconfX file
87 87
88 This file may be read to get by everyone to get info about the cards type, 88 This file may be read to get by everyone to get info about the cards type,
89 actual state, available features and used resources. 89 actual state, available features and used resources.
90 The first 3 entries (id, bus and slot) are PCI info fields, the following 90 The first 3 entries (id, bus and slot) are PCI info fields, the following
91 type field gives the information about the cards type: 91 type field gives the information about the cards type:
92 92
93 4 -> Ergo card (server card with 2 b-chans) 93 4 -> Ergo card (server card with 2 b-chans)
94 5 -> Metro card (server card with 4 or 8 b-chans) 94 5 -> Metro card (server card with 4 or 8 b-chans)
95 6 -> Champ card (client card with 2 b-chans) 95 6 -> Champ card (client card with 2 b-chans)
96 96
97 The following 3 fields show the hardware assignments for irq, iobase and the 97 The following 3 fields show the hardware assignments for irq, iobase and the
98 dual ported memory (dp-mem). 98 dual ported memory (dp-mem).
99 The fields b-chans and fax-chans announce the available card resources of 99 The fields b-chans and fax-chans announce the available card resources of
100 this types for the user. 100 this types for the user.
101 The state variable indicates the actual drivers state for this card with the 101 The state variable indicates the actual drivers state for this card with the
102 following assignments. 102 following assignments.
103 103
104 0 -> card has not been booted since driver load 104 0 -> card has not been booted since driver load
105 1 -> card booting is actually in progess 105 1 -> card booting is actually in progess
106 2 -> card is in an error state due to a previous boot failure 106 2 -> card is in an error state due to a previous boot failure
107 3 -> card is booted and active 107 3 -> card is booted and active
108 108
109 And the last field (device) shows the name of the ethernet device assigned 109 And the last field (device) shows the name of the ethernet device assigned
110 to this card. Up to the first successful boot this field only shows a - 110 to this card. Up to the first successful boot this field only shows a -
111 to tell that no net device has been allocated up to now. Once a net device 111 to tell that no net device has been allocated up to now. Once a net device
112 has been allocated it remains assigned to this card, even if a card is 112 has been allocated it remains assigned to this card, even if a card is
113 rebooted and an boot error occurs. 113 rebooted and an boot error occurs.
114 114
115 Writing to the cardconfX file boots the card or transfers config lines to 115 Writing to the cardconfX file boots the card or transfers config lines to
116 the cards firmware. The type of data is automatically detected when the 116 the cards firmware. The type of data is automatically detected when the
117 first data is written. Only root has write access to this file. 117 first data is written. Only root has write access to this file.
118 The firmware boot files are normally called hyclient.pof for client cards 118 The firmware boot files are normally called hyclient.pof for client cards
119 and hyserver.pof for server cards. 119 and hyserver.pof for server cards.
120 After successfully writing the boot file, complete config files or single 120 After successfully writing the boot file, complete config files or single
121 config lines may be copied to this file. 121 config lines may be copied to this file.
122 If an error occurs the return value given to the writing process has the 122 If an error occurs the return value given to the writing process has the
123 following additional codes (decimal): 123 following additional codes (decimal):
124 124
125 1000 Another process is currently bootng the card 125 1000 Another process is currently bootng the card
126 1001 Invalid firmware header 126 1001 Invalid firmware header
127 1002 Boards dual-port RAM test failed 127 1002 Boards dual-port RAM test failed
128 1003 Internal firmware handler error 128 1003 Internal firmware handler error
129 1004 Boot image size invalid 129 1004 Boot image size invalid
130 1005 First boot stage (bootstrap loader) failed 130 1005 First boot stage (bootstrap loader) failed
131 1006 Second boot stage failure 131 1006 Second boot stage failure
132 1007 Timeout waiting for card ready during boot 132 1007 Timeout waiting for card ready during boot
133 1008 Operation only allowed in booted state 133 1008 Operation only allowed in booted state
134 1009 Config line too long 134 1009 Config line too long
135 1010 Invalid channel number 135 1010 Invalid channel number
136 1011 Timeout sending config data 136 1011 Timeout sending config data
137 137
138 Additional info about error reasons may be fetched from the log output. 138 Additional info about error reasons may be fetched from the log output.
139 139
140 5. The /proc/net/hysdn/cardlogX file 140 5. The /proc/net/hysdn/cardlogX file
141 141
142 The cardlogX file entry may be opened multiple for reading by everyone to 142 The cardlogX file entry may be opened multiple for reading by everyone to
143 get the cards and drivers log data. Card messages always start with the 143 get the cards and drivers log data. Card messages always start with the
144 keyword LOG. All other lines are output from the driver. 144 keyword LOG. All other lines are output from the driver.
145 The driver log data may be redirected to the syslog by selecting the 145 The driver log data may be redirected to the syslog by selecting the
146 appropriate bitmask. The cards log messages will always be send to this 146 appropriate bitmask. The cards log messages will always be send to this
147 interface but never to the syslog. 147 interface but never to the syslog.
148 148
149 A root user may write a decimal or hex (with 0x) value t this file to select 149 A root user may write a decimal or hex (with 0x) value t this file to select
150 desired output options. As mentioned above the cards log dat is always 150 desired output options. As mentioned above the cards log dat is always
151 written to the cardlog file independent of the following options only used 151 written to the cardlog file independent of the following options only used
152 to check and debug the driver itself: 152 to check and debug the driver itself:
153 153
154 For example: 154 For example:
155 echo "0x34560078" > /proc/net/hysdn/cardlog0 155 echo "0x34560078" > /proc/net/hysdn/cardlog0
156 to output the hex log mask 34560078 for card 0. 156 to output the hex log mask 34560078 for card 0.
157 157
158 The written value is regarded as an unsigned 32-Bit value, bit ored for 158 The written value is regarded as an unsigned 32-Bit value, bit ored for
159 desired output. The following bits are already assigned: 159 desired output. The following bits are already assigned:
160 160
161 0x80000000 All driver log data is alternatively via syslog 161 0x80000000 All driver log data is alternatively via syslog
162 0x00000001 Log memory allocation errors 162 0x00000001 Log memory allocation errors
163 0x00000010 Firmware load start and close are logged 163 0x00000010 Firmware load start and close are logged
164 0x00000020 Log firmware record parser 164 0x00000020 Log firmware record parser
165 0x00000040 Log every firmware write actions 165 0x00000040 Log every firmware write actions
166 0x00000080 Log all card related boot messages 166 0x00000080 Log all card related boot messages
167 0x00000100 Output all config data sent for debugging purposes 167 0x00000100 Output all config data sent for debugging purposes
168 0x00000200 Only non comment config lines are shown wth channel 168 0x00000200 Only non comment config lines are shown wth channel
169 0x00000400 Additional conf log output 169 0x00000400 Additional conf log output
170 0x00001000 Log the asynchronous scheduler actions (config and log) 170 0x00001000 Log the asynchronous scheduler actions (config and log)
171 0x00100000 Log all open and close actions to /proc/net/hysdn/card files 171 0x00100000 Log all open and close actions to /proc/net/hysdn/card files
172 0x00200000 Log all actions from /proc file entries 172 0x00200000 Log all actions from /proc file entries
173 0x00010000 Log network interface init and deinit 173 0x00010000 Log network interface init and deinit
174 174
175 6. Where to get additional info and help 175 6. Where to get additional info and help
176 176
177 If you have any problems concerning the driver or configuration contact 177 If you have any problems concerning the driver or configuration contact
178 the Hypercope support team (support@hypercope.de) and or the authors 178 the Hypercope support team (support@hypercope.de) and or the authors
179 Werner Cornelius (werner@isdn4linux or cornelius@titro.de) or 179 Werner Cornelius (werner@isdn4linux or cornelius@titro.de) or
180 Ulrich Albrecht (ualbrecht@hypercope.de). 180 Ulrich Albrecht (ualbrecht@hypercope.de).
181 181
182 182
183 183
184 184
185 185
186 186
187 187
188 188
189 189
190 190
191 191
192 192
193 193
194 194
195 195
196 196
Documentation/kdump/kdump.txt
1 ================================================================ 1 ================================================================
2 Documentation for Kdump - The kexec-based Crash Dumping Solution 2 Documentation for Kdump - The kexec-based Crash Dumping Solution
3 ================================================================ 3 ================================================================
4 4
5 This document includes overview, setup and installation, and analysis 5 This document includes overview, setup and installation, and analysis
6 information. 6 information.
7 7
8 Overview 8 Overview
9 ======== 9 ========
10 10
11 Kdump uses kexec to quickly boot to a dump-capture kernel whenever a 11 Kdump uses kexec to quickly boot to a dump-capture kernel whenever a
12 dump of the system kernel's memory needs to be taken (for example, when 12 dump of the system kernel's memory needs to be taken (for example, when
13 the system panics). The system kernel's memory image is preserved across 13 the system panics). The system kernel's memory image is preserved across
14 the reboot and is accessible to the dump-capture kernel. 14 the reboot and is accessible to the dump-capture kernel.
15 15
16 You can use common Linux commands, such as cp and scp, to copy the 16 You can use common Linux commands, such as cp and scp, to copy the
17 memory image to a dump file on the local disk, or across the network to 17 memory image to a dump file on the local disk, or across the network to
18 a remote system. 18 a remote system.
19 19
20 Kdump and kexec are currently supported on the x86, x86_64, and ppc64 20 Kdump and kexec are currently supported on the x86, x86_64, and ppc64
21 architectures. 21 architectures.
22 22
23 When the system kernel boots, it reserves a small section of memory for 23 When the system kernel boots, it reserves a small section of memory for
24 the dump-capture kernel. This ensures that ongoing Direct Memory Access 24 the dump-capture kernel. This ensures that ongoing Direct Memory Access
25 (DMA) from the system kernel does not corrupt the dump-capture kernel. 25 (DMA) from the system kernel does not corrupt the dump-capture kernel.
26 The kexec -p command loads the dump-capture kernel into this reserved 26 The kexec -p command loads the dump-capture kernel into this reserved
27 memory. 27 memory.
28 28
29 On x86 machines, the first 640 KB of physical memory is needed to boot, 29 On x86 machines, the first 640 KB of physical memory is needed to boot,
30 regardless of where the kernel loads. Therefore, kexec backs up this 30 regardless of where the kernel loads. Therefore, kexec backs up this
31 region just before rebooting into the dump-capture kernel. 31 region just before rebooting into the dump-capture kernel.
32 32
33 All of the necessary information about the system kernel's core image is 33 All of the necessary information about the system kernel's core image is
34 encoded in the ELF format, and stored in a reserved area of memory 34 encoded in the ELF format, and stored in a reserved area of memory
35 before a crash. The physical address of the start of the ELF header is 35 before a crash. The physical address of the start of the ELF header is
36 passed to the dump-capture kernel through the elfcorehdr= boot 36 passed to the dump-capture kernel through the elfcorehdr= boot
37 parameter. 37 parameter.
38 38
39 With the dump-capture kernel, you can access the memory image, or "old 39 With the dump-capture kernel, you can access the memory image, or "old
40 memory," in two ways: 40 memory," in two ways:
41 41
42 - Through a /dev/oldmem device interface. A capture utility can read the 42 - Through a /dev/oldmem device interface. A capture utility can read the
43 device file and write out the memory in raw format. This is a raw dump 43 device file and write out the memory in raw format. This is a raw dump
44 of memory. Analysis and capture tools must be intelligent enough to 44 of memory. Analysis and capture tools must be intelligent enough to
45 determine where to look for the right information. 45 determine where to look for the right information.
46 46
47 - Through /proc/vmcore. This exports the dump as an ELF-format file that 47 - Through /proc/vmcore. This exports the dump as an ELF-format file that
48 you can write out using file copy commands such as cp or scp. Further, 48 you can write out using file copy commands such as cp or scp. Further,
49 you can use analysis tools such as the GNU Debugger (GDB) and the Crash 49 you can use analysis tools such as the GNU Debugger (GDB) and the Crash
50 tool to debug the dump file. This method ensures that the dump pages are 50 tool to debug the dump file. This method ensures that the dump pages are
51 correctly ordered. 51 correctly ordered.
52 52
53 53
54 Setup and Installation 54 Setup and Installation
55 ====================== 55 ======================
56 56
57 Install kexec-tools and the Kdump patch 57 Install kexec-tools and the Kdump patch
58 --------------------------------------- 58 ---------------------------------------
59 59
60 1) Login as the root user. 60 1) Login as the root user.
61 61
62 2) Download the kexec-tools user-space package from the following URL: 62 2) Download the kexec-tools user-space package from the following URL:
63 63
64 http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz 64 http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.101.tar.gz
65 65
66 3) Unpack the tarball with the tar command, as follows: 66 3) Unpack the tarball with the tar command, as follows:
67 67
68 tar xvpzf kexec-tools-1.101.tar.gz 68 tar xvpzf kexec-tools-1.101.tar.gz
69 69
70 4) Download the latest consolidated Kdump patch from the following URL: 70 4) Download the latest consolidated Kdump patch from the following URL:
71 71
72 http://lse.sourceforge.net/kdump/ 72 http://lse.sourceforge.net/kdump/
73 73
74 (This location is being used until all the user-space Kdump patches 74 (This location is being used until all the user-space Kdump patches
75 are integrated with the kexec-tools package.) 75 are integrated with the kexec-tools package.)
76 76
77 5) Change to the kexec-tools-1.101 directory, as follows: 77 5) Change to the kexec-tools-1.101 directory, as follows:
78 78
79 cd kexec-tools-1.101 79 cd kexec-tools-1.101
80 80
81 6) Apply the consolidated patch to the kexec-tools-1.101 source tree 81 6) Apply the consolidated patch to the kexec-tools-1.101 source tree
82 with the patch command, as follows. (Modify the path to the downloaded 82 with the patch command, as follows. (Modify the path to the downloaded
83 patch as necessary.) 83 patch as necessary.)
84 84
85 patch -p1 < /path-to-kdump-patch/kexec-tools-1.101-kdump.patch 85 patch -p1 < /path-to-kdump-patch/kexec-tools-1.101-kdump.patch
86 86
87 7) Configure the package, as follows: 87 7) Configure the package, as follows:
88 88
89 ./configure 89 ./configure
90 90
91 8) Compile the package, as follows: 91 8) Compile the package, as follows:
92 92
93 make 93 make
94 94
95 9) Install the package, as follows: 95 9) Install the package, as follows:
96 96
97 make install 97 make install
98 98
99 99
100 Download and build the system and dump-capture kernels 100 Download and build the system and dump-capture kernels
101 ------------------------------------------------------ 101 ------------------------------------------------------
102 102
103 Download the mainline (vanilla) kernel source code (2.6.13-rc1 or newer) 103 Download the mainline (vanilla) kernel source code (2.6.13-rc1 or newer)
104 from http://www.kernel.org. Two kernels must be built: a system kernel 104 from http://www.kernel.org. Two kernels must be built: a system kernel
105 and a dump-capture kernel. Use the following steps to configure these 105 and a dump-capture kernel. Use the following steps to configure these
106 kernels with the necessary kexec and Kdump features: 106 kernels with the necessary kexec and Kdump features:
107 107
108 System kernel 108 System kernel
109 ------------- 109 -------------
110 110
111 1) Enable "kexec system call" in "Processor type and features." 111 1) Enable "kexec system call" in "Processor type and features."
112 112
113 CONFIG_KEXEC=y 113 CONFIG_KEXEC=y
114 114
115 2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo 115 2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
116 filesystems." This is usually enabled by default. 116 filesystems." This is usually enabled by default.
117 117
118 CONFIG_SYSFS=y 118 CONFIG_SYSFS=y
119 119
120 Note that "sysfs file system support" might not appear in the "Pseudo 120 Note that "sysfs file system support" might not appear in the "Pseudo
121 filesystems" menu if "Configure standard kernel features (for small 121 filesystems" menu if "Configure standard kernel features (for small
122 systems)" is not enabled in "General Setup." In this case, check the 122 systems)" is not enabled in "General Setup." In this case, check the
123 .config file itself to ensure that sysfs is turned on, as follows: 123 .config file itself to ensure that sysfs is turned on, as follows:
124 124
125 grep 'CONFIG_SYSFS' .config 125 grep 'CONFIG_SYSFS' .config
126 126
127 3) Enable "Compile the kernel with debug info" in "Kernel hacking." 127 3) Enable "Compile the kernel with debug info" in "Kernel hacking."
128 128
129 CONFIG_DEBUG_INFO=Y 129 CONFIG_DEBUG_INFO=Y
130 130
131 This causes the kernel to be built with debug symbols. The dump 131 This causes the kernel to be built with debug symbols. The dump
132 analysis tools require a vmlinux with debug symbols in order to read 132 analysis tools require a vmlinux with debug symbols in order to read
133 and analyze a dump file. 133 and analyze a dump file.
134 134
135 4) Make and install the kernel and its modules. Update the boot loader 135 4) Make and install the kernel and its modules. Update the boot loader
136 (such as grub, yaboot, or lilo) configuration files as necessary. 136 (such as grub, yaboot, or lilo) configuration files as necessary.
137 137
138 5) Boot the system kernel with the boot parameter "crashkernel=Y@X", 138 5) Boot the system kernel with the boot parameter "crashkernel=Y@X",
139 where Y specifies how much memory to reserve for the dump-capture kernel 139 where Y specifies how much memory to reserve for the dump-capture kernel
140 and X specifies the beginning of this reserved memory. For example, 140 and X specifies the beginning of this reserved memory. For example,
141 "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory 141 "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
142 starting at physical address 0x01000000 for the dump-capture kernel. 142 starting at physical address 0x01000000 for the dump-capture kernel.
143 143
144 On x86 and x86_64, use "crashkernel=64M@16M". 144 On x86 and x86_64, use "crashkernel=64M@16M".
145 145
146 On ppc64, use "crashkernel=128M@32M". 146 On ppc64, use "crashkernel=128M@32M".
147 147
148 148
149 The dump-capture kernel 149 The dump-capture kernel
150 ----------------------- 150 -----------------------
151 151
152 1) Under "General setup," append "-kdump" to the current string in 152 1) Under "General setup," append "-kdump" to the current string in
153 "Local version." 153 "Local version."
154 154
155 2) On x86, enable high memory support under "Processor type and 155 2) On x86, enable high memory support under "Processor type and
156 features": 156 features":
157 157
158 CONFIG_HIGHMEM64G=y 158 CONFIG_HIGHMEM64G=y
159 or 159 or
160 CONFIG_HIGHMEM4G 160 CONFIG_HIGHMEM4G
161 161
162 3) On x86 and x86_64, disable symmetric multi-processing support 162 3) On x86 and x86_64, disable symmetric multi-processing support
163 under "Processor type and features": 163 under "Processor type and features":
164 164
165 CONFIG_SMP=n 165 CONFIG_SMP=n
166 (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line 166 (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line
167 when loading the dump-capture kernel, see section "Load the Dump-capture 167 when loading the dump-capture kernel, see section "Load the Dump-capture
168 Kernel".) 168 Kernel".)
169 169
170 4) On ppc64, disable NUMA support and enable EMBEDDED support: 170 4) On ppc64, disable NUMA support and enable EMBEDDED support:
171 171
172 CONFIG_NUMA=n 172 CONFIG_NUMA=n
173 CONFIG_EMBEDDED=y 173 CONFIG_EMBEDDED=y
174 CONFIG_EEH=N for the dump-capture kernel 174 CONFIG_EEH=N for the dump-capture kernel
175 175
176 5) Enable "kernel crash dumps" support under "Processor type and 176 5) Enable "kernel crash dumps" support under "Processor type and
177 features": 177 features":
178 178
179 CONFIG_CRASH_DUMP=y 179 CONFIG_CRASH_DUMP=y
180 180
181 6) Use a suitable value for "Physical address where the kernel is 181 6) Use a suitable value for "Physical address where the kernel is
182 loaded" (under "Processor type and features"). This only appears when 182 loaded" (under "Processor type and features"). This only appears when
183 "kernel crash dumps" is enabled. By default this value is 0x1000000 183 "kernel crash dumps" is enabled. By default this value is 0x1000000
184 (16MB). It should be the same as X in the "crashkernel=Y@X" boot 184 (16MB). It should be the same as X in the "crashkernel=Y@X" boot
185 parameter discussed above. 185 parameter discussed above.
186 186
187 On x86 and x86_64, use "CONFIG_PHYSICAL_START=0x1000000". 187 On x86 and x86_64, use "CONFIG_PHYSICAL_START=0x1000000".
188 188
189 On ppc64 the value is automatically set at 32MB when 189 On ppc64 the value is automatically set at 32MB when
190 CONFIG_CRASH_DUMP is set. 190 CONFIG_CRASH_DUMP is set.
191 191
192 6) Optionally enable "/proc/vmcore support" under "Filesystems" -> 192 6) Optionally enable "/proc/vmcore support" under "Filesystems" ->
193 "Pseudo filesystems". 193 "Pseudo filesystems".
194 194
195 CONFIG_PROC_VMCORE=y 195 CONFIG_PROC_VMCORE=y
196 (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.) 196 (CONFIG_PROC_VMCORE is set by default when CONFIG_CRASH_DUMP is selected.)
197 197
198 7) Make and install the kernel and its modules. DO NOT add this kernel 198 7) Make and install the kernel and its modules. DO NOT add this kernel
199 to the boot loader configuration files. 199 to the boot loader configuration files.
200 200
201 201
202 Load the Dump-capture Kernel 202 Load the Dump-capture Kernel
203 ============================ 203 ============================
204 204
205 After booting to the system kernel, load the dump-capture kernel using 205 After booting to the system kernel, load the dump-capture kernel using
206 the following command: 206 the following command:
207 207
208 kexec -p <dump-capture-kernel> \ 208 kexec -p <dump-capture-kernel> \
209 --initrd=<initrd-for-dump-capture-kernel> --args-linux \ 209 --initrd=<initrd-for-dump-capture-kernel> --args-linux \
210 --append="root=<root-dev> init 1 irqpoll" 210 --append="root=<root-dev> init 1 irqpoll"
211 211
212 212
213 Notes on loading the dump-capture kernel: 213 Notes on loading the dump-capture kernel:
214 214
215 * <dump-capture-kernel> must be a vmlinux image (that is, an 215 * <dump-capture-kernel> must be a vmlinux image (that is, an
216 uncompressed ELF image). bzImage does not work at this time. 216 uncompressed ELF image). bzImage does not work at this time.
217 217
218 * By default, the ELF headers are stored in ELF64 format to support 218 * By default, the ELF headers are stored in ELF64 format to support
219 systems with more than 4GB memory. The --elf32-core-headers option can 219 systems with more than 4GB memory. The --elf32-core-headers option can
220 be used to force the generation of ELF32 headers. This is necessary 220 be used to force the generation of ELF32 headers. This is necessary
221 because GDB currently cannot open vmcore files with ELF64 headers on 221 because GDB currently cannot open vmcore files with ELF64 headers on
222 32-bit systems. ELF32 headers can be used on non-PAE systems (that is, 222 32-bit systems. ELF32 headers can be used on non-PAE systems (that is,
223 less than 4GB of memory). 223 less than 4GB of memory).
224 224
225 * The "irqpoll" boot parameter reduces driver initialization failures 225 * The "irqpoll" boot parameter reduces driver initialization failures
226 due to shared interrupts in the dump-capture kernel. 226 due to shared interrupts in the dump-capture kernel.
227 227
228 * You must specify <root-dev> in the format corresponding to the root 228 * You must specify <root-dev> in the format corresponding to the root
229 device name in the output of mount command. 229 device name in the output of mount command.
230 230
231 * "init 1" boots the dump-capture kernel into single-user mode without 231 * "init 1" boots the dump-capture kernel into single-user mode without
232 networking. If you want networking, use "init 3." 232 networking. If you want networking, use "init 3."
233 233
234 234
235 Kernel Panic 235 Kernel Panic
236 ============ 236 ============
237 237
238 After successfully loading the dump-capture kernel as previously 238 After successfully loading the dump-capture kernel as previously
239 described, the system will reboot into the dump-capture kernel if a 239 described, the system will reboot into the dump-capture kernel if a
240 system crash is triggered. Trigger points are located in panic(), 240 system crash is triggered. Trigger points are located in panic(),
241 die(), die_nmi() and in the sysrq handler (ALT-SysRq-c). 241 die(), die_nmi() and in the sysrq handler (ALT-SysRq-c).
242 242
243 The following conditions will execute a crash trigger point: 243 The following conditions will execute a crash trigger point:
244 244
245 If a hard lockup is detected and "NMI watchdog" is configured, the system 245 If a hard lockup is detected and "NMI watchdog" is configured, the system
246 will boot into the dump-capture kernel ( die_nmi() ). 246 will boot into the dump-capture kernel ( die_nmi() ).
247 247
248 If die() is called, and it happens to be a thread with pid 0 or 1, or die() 248 If die() is called, and it happens to be a thread with pid 0 or 1, or die()
249 is called inside interrupt context or die() is called and panic_on_oops is set, 249 is called inside interrupt context or die() is called and panic_on_oops is set,
250 the system will boot into the dump-capture kernel. 250 the system will boot into the dump-capture kernel.
251 251
252 On powererpc systems when a soft-reset is generated, die() is called by all cpus and the system system will boot into the dump-capture kernel. 252 On powererpc systems when a soft-reset is generated, die() is called by all cpus and the system will boot into the dump-capture kernel.
253 253
254 For testing purposes, you can trigger a crash by using "ALT-SysRq-c", 254 For testing purposes, you can trigger a crash by using "ALT-SysRq-c",
255 "echo c > /proc/sysrq-trigger or write a module to force the panic. 255 "echo c > /proc/sysrq-trigger or write a module to force the panic.
256 256
257 Write Out the Dump File 257 Write Out the Dump File
258 ======================= 258 =======================
259 259
260 After the dump-capture kernel is booted, write out the dump file with 260 After the dump-capture kernel is booted, write out the dump file with
261 the following command: 261 the following command:
262 262
263 cp /proc/vmcore <dump-file> 263 cp /proc/vmcore <dump-file>
264 264
265 You can also access dumped memory as a /dev/oldmem device for a linear 265 You can also access dumped memory as a /dev/oldmem device for a linear
266 and raw view. To create the device, use the following command: 266 and raw view. To create the device, use the following command:
267 267
268 mknod /dev/oldmem c 1 12 268 mknod /dev/oldmem c 1 12
269 269
270 Use the dd command with suitable options for count, bs, and skip to 270 Use the dd command with suitable options for count, bs, and skip to
271 access specific portions of the dump. 271 access specific portions of the dump.
272 272
273 To see the entire memory, use the following command: 273 To see the entire memory, use the following command:
274 274
275 dd if=/dev/oldmem of=oldmem.001 275 dd if=/dev/oldmem of=oldmem.001
276 276
277 277
278 Analysis 278 Analysis
279 ======== 279 ========
280 280
281 Before analyzing the dump image, you should reboot into a stable kernel. 281 Before analyzing the dump image, you should reboot into a stable kernel.
282 282
283 You can do limited analysis using GDB on the dump file copied out of 283 You can do limited analysis using GDB on the dump file copied out of
284 /proc/vmcore. Use the debug vmlinux built with -g and run the following 284 /proc/vmcore. Use the debug vmlinux built with -g and run the following
285 command: 285 command:
286 286
287 gdb vmlinux <dump-file> 287 gdb vmlinux <dump-file>
288 288
289 Stack trace for the task on processor 0, register display, and memory 289 Stack trace for the task on processor 0, register display, and memory
290 display work fine. 290 display work fine.
291 291
292 Note: GDB cannot analyze core files generated in ELF64 format for x86. 292 Note: GDB cannot analyze core files generated in ELF64 format for x86.
293 On systems with a maximum of 4GB of memory, you can generate 293 On systems with a maximum of 4GB of memory, you can generate
294 ELF32-format headers using the --elf32-core-headers kernel option on the 294 ELF32-format headers using the --elf32-core-headers kernel option on the
295 dump kernel. 295 dump kernel.
296 296
297 You can also use the Crash utility to analyze dump files in Kdump 297 You can also use the Crash utility to analyze dump files in Kdump
298 format. Crash is available on Dave Anderson's site at the following URL: 298 format. Crash is available on Dave Anderson's site at the following URL:
299 299
300 http://people.redhat.com/~anderson/ 300 http://people.redhat.com/~anderson/
301 301
302 302
303 To Do 303 To Do
304 ===== 304 =====
305 305
306 1) Provide a kernel pages filtering mechanism, so core file size is not 306 1) Provide a kernel pages filtering mechanism, so core file size is not
307 extreme on systems with huge memory banks. 307 extreme on systems with huge memory banks.
308 308
309 2) Relocatable kernel can help in maintaining multiple kernels for 309 2) Relocatable kernel can help in maintaining multiple kernels for
310 crash_dump, and the same kernel as the system kernel can be used to 310 crash_dump, and the same kernel as the system kernel can be used to
311 capture the dump. 311 capture the dump.
312 312
313 313
314 Contact 314 Contact
315 ======= 315 =======
316 316
317 Vivek Goyal (vgoyal@in.ibm.com) 317 Vivek Goyal (vgoyal@in.ibm.com)
318 Maneesh Soni (maneesh@in.ibm.com) 318 Maneesh Soni (maneesh@in.ibm.com)
319 319
320 320
321 Trademark 321 Trademark
322 ========= 322 =========
323 323
324 Linux is a trademark of Linus Torvalds in the United States, other 324 Linux is a trademark of Linus Torvalds in the United States, other
325 countries, or both. 325 countries, or both.
326 326
Documentation/keys.txt
1 ============================ 1 ============================
2 KERNEL KEY RETENTION SERVICE 2 KERNEL KEY RETENTION SERVICE
3 ============================ 3 ============================
4 4
5 This service allows cryptographic keys, authentication tokens, cross-domain 5 This service allows cryptographic keys, authentication tokens, cross-domain
6 user mappings, and similar to be cached in the kernel for the use of 6 user mappings, and similar to be cached in the kernel for the use of
7 filesystems other kernel services. 7 filesystems other kernel services.
8 8
9 Keyrings are permitted; these are a special type of key that can hold links to 9 Keyrings are permitted; these are a special type of key that can hold links to
10 other keys. Processes each have three standard keyring subscriptions that a 10 other keys. Processes each have three standard keyring subscriptions that a
11 kernel service can search for relevant keys. 11 kernel service can search for relevant keys.
12 12
13 The key service can be configured on by enabling: 13 The key service can be configured on by enabling:
14 14
15 "Security options"/"Enable access key retention support" (CONFIG_KEYS) 15 "Security options"/"Enable access key retention support" (CONFIG_KEYS)
16 16
17 This document has the following sections: 17 This document has the following sections:
18 18
19 - Key overview 19 - Key overview
20 - Key service overview 20 - Key service overview
21 - Key access permissions 21 - Key access permissions
22 - SELinux support 22 - SELinux support
23 - New procfs files 23 - New procfs files
24 - Userspace system call interface 24 - Userspace system call interface
25 - Kernel services 25 - Kernel services
26 - Notes on accessing payload contents 26 - Notes on accessing payload contents
27 - Defining a key type 27 - Defining a key type
28 - Request-key callback service 28 - Request-key callback service
29 - Key access filesystem 29 - Key access filesystem
30 30
31 31
32 ============ 32 ============
33 KEY OVERVIEW 33 KEY OVERVIEW
34 ============ 34 ============
35 35
36 In this context, keys represent units of cryptographic data, authentication 36 In this context, keys represent units of cryptographic data, authentication
37 tokens, keyrings, etc.. These are represented in the kernel by struct key. 37 tokens, keyrings, etc.. These are represented in the kernel by struct key.
38 38
39 Each key has a number of attributes: 39 Each key has a number of attributes:
40 40
41 - A serial number. 41 - A serial number.
42 - A type. 42 - A type.
43 - A description (for matching a key in a search). 43 - A description (for matching a key in a search).
44 - Access control information. 44 - Access control information.
45 - An expiry time. 45 - An expiry time.
46 - A payload. 46 - A payload.
47 - State. 47 - State.
48 48
49 49
50 (*) Each key is issued a serial number of type key_serial_t that is unique for 50 (*) Each key is issued a serial number of type key_serial_t that is unique for
51 the lifetime of that key. All serial numbers are positive non-zero 32-bit 51 the lifetime of that key. All serial numbers are positive non-zero 32-bit
52 integers. 52 integers.
53 53
54 Userspace programs can use a key's serial numbers as a way to gain access 54 Userspace programs can use a key's serial numbers as a way to gain access
55 to it, subject to permission checking. 55 to it, subject to permission checking.
56 56
57 (*) Each key is of a defined "type". Types must be registered inside the 57 (*) Each key is of a defined "type". Types must be registered inside the
58 kernel by a kernel service (such as a filesystem) before keys of that type 58 kernel by a kernel service (such as a filesystem) before keys of that type
59 can be added or used. Userspace programs cannot define new types directly. 59 can be added or used. Userspace programs cannot define new types directly.
60 60
61 Key types are represented in the kernel by struct key_type. This defines a 61 Key types are represented in the kernel by struct key_type. This defines a
62 number of operations that can be performed on a key of that type. 62 number of operations that can be performed on a key of that type.
63 63
64 Should a type be removed from the system, all the keys of that type will 64 Should a type be removed from the system, all the keys of that type will
65 be invalidated. 65 be invalidated.
66 66
67 (*) Each key has a description. This should be a printable string. The key 67 (*) Each key has a description. This should be a printable string. The key
68 type provides an operation to perform a match between the description on a 68 type provides an operation to perform a match between the description on a
69 key and a criterion string. 69 key and a criterion string.
70 70
71 (*) Each key has an owner user ID, a group ID and a permissions mask. These 71 (*) Each key has an owner user ID, a group ID and a permissions mask. These
72 are used to control what a process may do to a key from userspace, and 72 are used to control what a process may do to a key from userspace, and
73 whether a kernel service will be able to find the key. 73 whether a kernel service will be able to find the key.
74 74
75 (*) Each key can be set to expire at a specific time by the key type's 75 (*) Each key can be set to expire at a specific time by the key type's
76 instantiation function. Keys can also be immortal. 76 instantiation function. Keys can also be immortal.
77 77
78 (*) Each key can have a payload. This is a quantity of data that represent the 78 (*) Each key can have a payload. This is a quantity of data that represent the
79 actual "key". In the case of a keyring, this is a list of keys to which 79 actual "key". In the case of a keyring, this is a list of keys to which
80 the keyring links; in the case of a user-defined key, it's an arbitrary 80 the keyring links; in the case of a user-defined key, it's an arbitrary
81 blob of data. 81 blob of data.
82 82
83 Having a payload is not required; and the payload can, in fact, just be a 83 Having a payload is not required; and the payload can, in fact, just be a
84 value stored in the struct key itself. 84 value stored in the struct key itself.
85 85
86 When a key is instantiated, the key type's instantiation function is 86 When a key is instantiated, the key type's instantiation function is
87 called with a blob of data, and that then creates the key's payload in 87 called with a blob of data, and that then creates the key's payload in
88 some way. 88 some way.
89 89
90 Similarly, when userspace wants to read back the contents of the key, if 90 Similarly, when userspace wants to read back the contents of the key, if
91 permitted, another key type operation will be called to convert the key's 91 permitted, another key type operation will be called to convert the key's
92 attached payload back into a blob of data. 92 attached payload back into a blob of data.
93 93
94 (*) Each key can be in one of a number of basic states: 94 (*) Each key can be in one of a number of basic states:
95 95
96 (*) Uninstantiated. The key exists, but does not have any data attached. 96 (*) Uninstantiated. The key exists, but does not have any data attached.
97 Keys being requested from userspace will be in this state. 97 Keys being requested from userspace will be in this state.
98 98
99 (*) Instantiated. This is the normal state. The key is fully formed, and 99 (*) Instantiated. This is the normal state. The key is fully formed, and
100 has data attached. 100 has data attached.
101 101
102 (*) Negative. This is a relatively short-lived state. The key acts as a 102 (*) Negative. This is a relatively short-lived state. The key acts as a
103 note saying that a previous call out to userspace failed, and acts as 103 note saying that a previous call out to userspace failed, and acts as
104 a throttle on key lookups. A negative key can be updated to a normal 104 a throttle on key lookups. A negative key can be updated to a normal
105 state. 105 state.
106 106
107 (*) Expired. Keys can have lifetimes set. If their lifetime is exceeded, 107 (*) Expired. Keys can have lifetimes set. If their lifetime is exceeded,
108 they traverse to this state. An expired key can be updated back to a 108 they traverse to this state. An expired key can be updated back to a
109 normal state. 109 normal state.
110 110
111 (*) Revoked. A key is put in this state by userspace action. It can't be 111 (*) Revoked. A key is put in this state by userspace action. It can't be
112 found or operated upon (apart from by unlinking it). 112 found or operated upon (apart from by unlinking it).
113 113
114 (*) Dead. The key's type was unregistered, and so the key is now useless. 114 (*) Dead. The key's type was unregistered, and so the key is now useless.
115 115
116 116
117 ==================== 117 ====================
118 KEY SERVICE OVERVIEW 118 KEY SERVICE OVERVIEW
119 ==================== 119 ====================
120 120
121 The key service provides a number of features besides keys: 121 The key service provides a number of features besides keys:
122 122
123 (*) The key service defines two special key types: 123 (*) The key service defines two special key types:
124 124
125 (+) "keyring" 125 (+) "keyring"
126 126
127 Keyrings are special keys that contain a list of other keys. Keyring 127 Keyrings are special keys that contain a list of other keys. Keyring
128 lists can be modified using various system calls. Keyrings should not 128 lists can be modified using various system calls. Keyrings should not
129 be given a payload when created. 129 be given a payload when created.
130 130
131 (+) "user" 131 (+) "user"
132 132
133 A key of this type has a description and a payload that are arbitrary 133 A key of this type has a description and a payload that are arbitrary
134 blobs of data. These can be created, updated and read by userspace, 134 blobs of data. These can be created, updated and read by userspace,
135 and aren't intended for use by kernel services. 135 and aren't intended for use by kernel services.
136 136
137 (*) Each process subscribes to three keyrings: a thread-specific keyring, a 137 (*) Each process subscribes to three keyrings: a thread-specific keyring, a
138 process-specific keyring, and a session-specific keyring. 138 process-specific keyring, and a session-specific keyring.
139 139
140 The thread-specific keyring is discarded from the child when any sort of 140 The thread-specific keyring is discarded from the child when any sort of
141 clone, fork, vfork or execve occurs. A new keyring is created only when 141 clone, fork, vfork or execve occurs. A new keyring is created only when
142 required. 142 required.
143 143
144 The process-specific keyring is replaced with an empty one in the child on 144 The process-specific keyring is replaced with an empty one in the child on
145 clone, fork, vfork unless CLONE_THREAD is supplied, in which case it is 145 clone, fork, vfork unless CLONE_THREAD is supplied, in which case it is
146 shared. execve also discards the process's process keyring and creates a 146 shared. execve also discards the process's process keyring and creates a
147 new one. 147 new one.
148 148
149 The session-specific keyring is persistent across clone, fork, vfork and 149 The session-specific keyring is persistent across clone, fork, vfork and
150 execve, even when the latter executes a set-UID or set-GID binary. A 150 execve, even when the latter executes a set-UID or set-GID binary. A
151 process can, however, replace its current session keyring with a new one 151 process can, however, replace its current session keyring with a new one
152 by using PR_JOIN_SESSION_KEYRING. It is permitted to request an anonymous 152 by using PR_JOIN_SESSION_KEYRING. It is permitted to request an anonymous
153 new one, or to attempt to create or join one of a specific name. 153 new one, or to attempt to create or join one of a specific name.
154 154
155 The ownership of the thread keyring changes when the real UID and GID of 155 The ownership of the thread keyring changes when the real UID and GID of
156 the thread changes. 156 the thread changes.
157 157
158 (*) Each user ID resident in the system holds two special keyrings: a user 158 (*) Each user ID resident in the system holds two special keyrings: a user
159 specific keyring and a default user session keyring. The default session 159 specific keyring and a default user session keyring. The default session
160 keyring is initialised with a link to the user-specific keyring. 160 keyring is initialised with a link to the user-specific keyring.
161 161
162 When a process changes its real UID, if it used to have no session key, it 162 When a process changes its real UID, if it used to have no session key, it
163 will be subscribed to the default session key for the new UID. 163 will be subscribed to the default session key for the new UID.
164 164
165 If a process attempts to access its session key when it doesn't have one, 165 If a process attempts to access its session key when it doesn't have one,
166 it will be subscribed to the default for its current UID. 166 it will be subscribed to the default for its current UID.
167 167
168 (*) Each user has two quotas against which the keys they own are tracked. One 168 (*) Each user has two quotas against which the keys they own are tracked. One
169 limits the total number of keys and keyrings, the other limits the total 169 limits the total number of keys and keyrings, the other limits the total
170 amount of description and payload space that can be consumed. 170 amount of description and payload space that can be consumed.
171 171
172 The user can view information on this and other statistics through procfs 172 The user can view information on this and other statistics through procfs
173 files. 173 files.
174 174
175 Process-specific and thread-specific keyrings are not counted towards a 175 Process-specific and thread-specific keyrings are not counted towards a
176 user's quota. 176 user's quota.
177 177
178 If a system call that modifies a key or keyring in some way would put the 178 If a system call that modifies a key or keyring in some way would put the
179 user over quota, the operation is refused and error EDQUOT is returned. 179 user over quota, the operation is refused and error EDQUOT is returned.
180 180
181 (*) There's a system call interface by which userspace programs can create and 181 (*) There's a system call interface by which userspace programs can create and
182 manipulate keys and keyrings. 182 manipulate keys and keyrings.
183 183
184 (*) There's a kernel interface by which services can register types and search 184 (*) There's a kernel interface by which services can register types and search
185 for keys. 185 for keys.
186 186
187 (*) There's a way for the a search done from the kernel to call back to 187 (*) There's a way for the a search done from the kernel to call back to
188 userspace to request a key that can't be found in a process's keyrings. 188 userspace to request a key that can't be found in a process's keyrings.
189 189
190 (*) An optional filesystem is available through which the key database can be 190 (*) An optional filesystem is available through which the key database can be
191 viewed and manipulated. 191 viewed and manipulated.
192 192
193 193
194 ====================== 194 ======================
195 KEY ACCESS PERMISSIONS 195 KEY ACCESS PERMISSIONS
196 ====================== 196 ======================
197 197
198 Keys have an owner user ID, a group access ID, and a permissions mask. The mask 198 Keys have an owner user ID, a group access ID, and a permissions mask. The mask
199 has up to eight bits each for possessor, user, group and other access. Only 199 has up to eight bits each for possessor, user, group and other access. Only
200 six of each set of eight bits are defined. These permissions granted are: 200 six of each set of eight bits are defined. These permissions granted are:
201 201
202 (*) View 202 (*) View
203 203
204 This permits a key or keyring's attributes to be viewed - including key 204 This permits a key or keyring's attributes to be viewed - including key
205 type and description. 205 type and description.
206 206
207 (*) Read 207 (*) Read
208 208
209 This permits a key's payload to be viewed or a keyring's list of linked 209 This permits a key's payload to be viewed or a keyring's list of linked
210 keys. 210 keys.
211 211
212 (*) Write 212 (*) Write
213 213
214 This permits a key's payload to be instantiated or updated, or it allows a 214 This permits a key's payload to be instantiated or updated, or it allows a
215 link to be added to or removed from a keyring. 215 link to be added to or removed from a keyring.
216 216
217 (*) Search 217 (*) Search
218 218
219 This permits keyrings to be searched and keys to be found. Searches can 219 This permits keyrings to be searched and keys to be found. Searches can
220 only recurse into nested keyrings that have search permission set. 220 only recurse into nested keyrings that have search permission set.
221 221
222 (*) Link 222 (*) Link
223 223
224 This permits a key or keyring to be linked to. To create a link from a 224 This permits a key or keyring to be linked to. To create a link from a
225 keyring to a key, a process must have Write permission on the keyring and 225 keyring to a key, a process must have Write permission on the keyring and
226 Link permission on the key. 226 Link permission on the key.
227 227
228 (*) Set Attribute 228 (*) Set Attribute
229 229
230 This permits a key's UID, GID and permissions mask to be changed. 230 This permits a key's UID, GID and permissions mask to be changed.
231 231
232 For changing the ownership, group ID or permissions mask, being the owner of 232 For changing the ownership, group ID or permissions mask, being the owner of
233 the key or having the sysadmin capability is sufficient. 233 the key or having the sysadmin capability is sufficient.
234 234
235 235
236 =============== 236 ===============
237 SELINUX SUPPORT 237 SELINUX SUPPORT
238 =============== 238 ===============
239 239
240 The security class "key" has been added to SELinux so that mandatory access 240 The security class "key" has been added to SELinux so that mandatory access
241 controls can be applied to keys created within various contexts. This support 241 controls can be applied to keys created within various contexts. This support
242 is preliminary, and is likely to change quite significantly in the near future. 242 is preliminary, and is likely to change quite significantly in the near future.
243 Currently, all of the basic permissions explained above are provided in SELinux 243 Currently, all of the basic permissions explained above are provided in SELinux
244 as well; SELinux is simply invoked after all basic permission checks have been 244 as well; SELinux is simply invoked after all basic permission checks have been
245 performed. 245 performed.
246 246
247 The value of the file /proc/self/attr/keycreate influences the labeling of 247 The value of the file /proc/self/attr/keycreate influences the labeling of
248 newly-created keys. If the contents of that file correspond to an SELinux 248 newly-created keys. If the contents of that file correspond to an SELinux
249 security context, then the key will be assigned that context. Otherwise, the 249 security context, then the key will be assigned that context. Otherwise, the
250 key will be assigned the current context of the task that invoked the key 250 key will be assigned the current context of the task that invoked the key
251 creation request. Tasks must be granted explicit permission to assign a 251 creation request. Tasks must be granted explicit permission to assign a
252 particular context to newly-created keys, using the "create" permission in the 252 particular context to newly-created keys, using the "create" permission in the
253 key security class. 253 key security class.
254 254
255 The default keyrings associated with users will be labeled with the default 255 The default keyrings associated with users will be labeled with the default
256 context of the user if and only if the login programs have been instrumented to 256 context of the user if and only if the login programs have been instrumented to
257 properly initialize keycreate during the login process. Otherwise, they will 257 properly initialize keycreate during the login process. Otherwise, they will
258 be labeled with the context of the login program itself. 258 be labeled with the context of the login program itself.
259 259
260 Note, however, that the default keyrings associated with the root user are 260 Note, however, that the default keyrings associated with the root user are
261 labeled with the default kernel context, since they are created early in the 261 labeled with the default kernel context, since they are created early in the
262 boot process, before root has a chance to log in. 262 boot process, before root has a chance to log in.
263 263
264 The keyrings associated with new threads are each labeled with the context of 264 The keyrings associated with new threads are each labeled with the context of
265 their associated thread, and both session and process keyrings are handled 265 their associated thread, and both session and process keyrings are handled
266 similarly. 266 similarly.
267 267
268 268
269 ================ 269 ================
270 NEW PROCFS FILES 270 NEW PROCFS FILES
271 ================ 271 ================
272 272
273 Two files have been added to procfs by which an administrator can find out 273 Two files have been added to procfs by which an administrator can find out
274 about the status of the key service: 274 about the status of the key service:
275 275
276 (*) /proc/keys 276 (*) /proc/keys
277 277
278 This lists the keys that are currently viewable by the task reading the 278 This lists the keys that are currently viewable by the task reading the
279 file, giving information about their type, description and permissions. 279 file, giving information about their type, description and permissions.
280 It is not possible to view the payload of the key this way, though some 280 It is not possible to view the payload of the key this way, though some
281 information about it may be given. 281 information about it may be given.
282 282
283 The only keys included in the list are those that grant View permission to 283 The only keys included in the list are those that grant View permission to
284 the reading process whether or not it possesses them. Note that LSM 284 the reading process whether or not it possesses them. Note that LSM
285 security checks are still performed, and may further filter out keys that 285 security checks are still performed, and may further filter out keys that
286 the current process is not authorised to view. 286 the current process is not authorised to view.
287 287
288 The contents of the file look like this: 288 The contents of the file look like this:
289 289
290 SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY 290 SERIAL FLAGS USAGE EXPY PERM UID GID TYPE DESCRIPTION: SUMMARY
291 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4 291 00000001 I----- 39 perm 1f3f0000 0 0 keyring _uid_ses.0: 1/4
292 00000002 I----- 2 perm 1f3f0000 0 0 keyring _uid.0: empty 292 00000002 I----- 2 perm 1f3f0000 0 0 keyring _uid.0: empty
293 00000007 I----- 1 perm 1f3f0000 0 0 keyring _pid.1: empty 293 00000007 I----- 1 perm 1f3f0000 0 0 keyring _pid.1: empty
294 0000018d I----- 1 perm 1f3f0000 0 0 keyring _pid.412: empty 294 0000018d I----- 1 perm 1f3f0000 0 0 keyring _pid.412: empty
295 000004d2 I--Q-- 1 perm 1f3f0000 32 -1 keyring _uid.32: 1/4 295 000004d2 I--Q-- 1 perm 1f3f0000 32 -1 keyring _uid.32: 1/4
296 000004d3 I--Q-- 3 perm 1f3f0000 32 -1 keyring _uid_ses.32: empty 296 000004d3 I--Q-- 3 perm 1f3f0000 32 -1 keyring _uid_ses.32: empty
297 00000892 I--QU- 1 perm 1f000000 0 0 user metal:copper: 0 297 00000892 I--QU- 1 perm 1f000000 0 0 user metal:copper: 0
298 00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0 298 00000893 I--Q-N 1 35s 1f3f0000 0 0 user metal:silver: 0
299 00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0 299 00000894 I--Q-- 1 10h 003f0000 0 0 user metal:gold: 0
300 300
301 The flags are: 301 The flags are:
302 302
303 I Instantiated 303 I Instantiated
304 R Revoked 304 R Revoked
305 D Dead 305 D Dead
306 Q Contributes to user's quota 306 Q Contributes to user's quota
307 U Under contruction by callback to userspace 307 U Under contruction by callback to userspace
308 N Negative key 308 N Negative key
309 309
310 This file must be enabled at kernel configuration time as it allows anyone 310 This file must be enabled at kernel configuration time as it allows anyone
311 to list the keys database. 311 to list the keys database.
312 312
313 (*) /proc/key-users 313 (*) /proc/key-users
314 314
315 This file lists the tracking data for each user that has at least one key 315 This file lists the tracking data for each user that has at least one key
316 on the system. Such data includes quota information and statistics: 316 on the system. Such data includes quota information and statistics:
317 317
318 [root@andromeda root]# cat /proc/key-users 318 [root@andromeda root]# cat /proc/key-users
319 0: 46 45/45 1/100 13/10000 319 0: 46 45/45 1/100 13/10000
320 29: 2 2/2 2/100 40/10000 320 29: 2 2/2 2/100 40/10000
321 32: 2 2/2 2/100 40/10000 321 32: 2 2/2 2/100 40/10000
322 38: 2 2/2 2/100 40/10000 322 38: 2 2/2 2/100 40/10000
323 323
324 The format of each line is 324 The format of each line is
325 <UID>: User ID to which this applies 325 <UID>: User ID to which this applies
326 <usage> Structure refcount 326 <usage> Structure refcount
327 <inst>/<keys> Total number of keys and number instantiated 327 <inst>/<keys> Total number of keys and number instantiated
328 <keys>/<max> Key count quota 328 <keys>/<max> Key count quota
329 <bytes>/<max> Key size quota 329 <bytes>/<max> Key size quota
330 330
331 331
332 =============================== 332 ===============================
333 USERSPACE SYSTEM CALL INTERFACE 333 USERSPACE SYSTEM CALL INTERFACE
334 =============================== 334 ===============================
335 335
336 Userspace can manipulate keys directly through three new syscalls: add_key, 336 Userspace can manipulate keys directly through three new syscalls: add_key,
337 request_key and keyctl. The latter provides a number of functions for 337 request_key and keyctl. The latter provides a number of functions for
338 manipulating keys. 338 manipulating keys.
339 339
340 When referring to a key directly, userspace programs should use the key's 340 When referring to a key directly, userspace programs should use the key's
341 serial number (a positive 32-bit integer). However, there are some special 341 serial number (a positive 32-bit integer). However, there are some special
342 values available for referring to special keys and keyrings that relate to the 342 values available for referring to special keys and keyrings that relate to the
343 process making the call: 343 process making the call:
344 344
345 CONSTANT VALUE KEY REFERENCED 345 CONSTANT VALUE KEY REFERENCED
346 ============================== ====== =========================== 346 ============================== ====== ===========================
347 KEY_SPEC_THREAD_KEYRING -1 thread-specific keyring 347 KEY_SPEC_THREAD_KEYRING -1 thread-specific keyring
348 KEY_SPEC_PROCESS_KEYRING -2 process-specific keyring 348 KEY_SPEC_PROCESS_KEYRING -2 process-specific keyring
349 KEY_SPEC_SESSION_KEYRING -3 session-specific keyring 349 KEY_SPEC_SESSION_KEYRING -3 session-specific keyring
350 KEY_SPEC_USER_KEYRING -4 UID-specific keyring 350 KEY_SPEC_USER_KEYRING -4 UID-specific keyring
351 KEY_SPEC_USER_SESSION_KEYRING -5 UID-session keyring 351 KEY_SPEC_USER_SESSION_KEYRING -5 UID-session keyring
352 KEY_SPEC_GROUP_KEYRING -6 GID-specific keyring 352 KEY_SPEC_GROUP_KEYRING -6 GID-specific keyring
353 KEY_SPEC_REQKEY_AUTH_KEY -7 assumed request_key() 353 KEY_SPEC_REQKEY_AUTH_KEY -7 assumed request_key()
354 authorisation key 354 authorisation key
355 355
356 356
357 The main syscalls are: 357 The main syscalls are:
358 358
359 (*) Create a new key of given type, description and payload and add it to the 359 (*) Create a new key of given type, description and payload and add it to the
360 nominated keyring: 360 nominated keyring:
361 361
362 key_serial_t add_key(const char *type, const char *desc, 362 key_serial_t add_key(const char *type, const char *desc,
363 const void *payload, size_t plen, 363 const void *payload, size_t plen,
364 key_serial_t keyring); 364 key_serial_t keyring);
365 365
366 If a key of the same type and description as that proposed already exists 366 If a key of the same type and description as that proposed already exists
367 in the keyring, this will try to update it with the given payload, or it 367 in the keyring, this will try to update it with the given payload, or it
368 will return error EEXIST if that function is not supported by the key 368 will return error EEXIST if that function is not supported by the key
369 type. The process must also have permission to write to the key to be able 369 type. The process must also have permission to write to the key to be able
370 to update it. The new key will have all user permissions granted and no 370 to update it. The new key will have all user permissions granted and no
371 group or third party permissions. 371 group or third party permissions.
372 372
373 Otherwise, this will attempt to create a new key of the specified type and 373 Otherwise, this will attempt to create a new key of the specified type and
374 description, and to instantiate it with the supplied payload and attach it 374 description, and to instantiate it with the supplied payload and attach it
375 to the keyring. In this case, an error will be generated if the process 375 to the keyring. In this case, an error will be generated if the process
376 does not have permission to write to the keyring. 376 does not have permission to write to the keyring.
377 377
378 The payload is optional, and the pointer can be NULL if not required by 378 The payload is optional, and the pointer can be NULL if not required by
379 the type. The payload is plen in size, and plen can be zero for an empty 379 the type. The payload is plen in size, and plen can be zero for an empty
380 payload. 380 payload.
381 381
382 A new keyring can be generated by setting type "keyring", the keyring name 382 A new keyring can be generated by setting type "keyring", the keyring name
383 as the description (or NULL) and setting the payload to NULL. 383 as the description (or NULL) and setting the payload to NULL.
384 384
385 User defined keys can be created by specifying type "user". It is 385 User defined keys can be created by specifying type "user". It is
386 recommended that a user defined key's description by prefixed with a type 386 recommended that a user defined key's description by prefixed with a type
387 ID and a colon, such as "krb5tgt:" for a Kerberos 5 ticket granting 387 ID and a colon, such as "krb5tgt:" for a Kerberos 5 ticket granting
388 ticket. 388 ticket.
389 389
390 Any other type must have been registered with the kernel in advance by a 390 Any other type must have been registered with the kernel in advance by a
391 kernel service such as a filesystem. 391 kernel service such as a filesystem.
392 392
393 The ID of the new or updated key is returned if successful. 393 The ID of the new or updated key is returned if successful.
394 394
395 395
396 (*) Search the process's keyrings for a key, potentially calling out to 396 (*) Search the process's keyrings for a key, potentially calling out to
397 userspace to create it. 397 userspace to create it.
398 398
399 key_serial_t request_key(const char *type, const char *description, 399 key_serial_t request_key(const char *type, const char *description,
400 const char *callout_info, 400 const char *callout_info,
401 key_serial_t dest_keyring); 401 key_serial_t dest_keyring);
402 402
403 This function searches all the process's keyrings in the order thread, 403 This function searches all the process's keyrings in the order thread,
404 process, session for a matching key. This works very much like 404 process, session for a matching key. This works very much like
405 KEYCTL_SEARCH, including the optional attachment of the discovered key to 405 KEYCTL_SEARCH, including the optional attachment of the discovered key to
406 a keyring. 406 a keyring.
407 407
408 If a key cannot be found, and if callout_info is not NULL, then 408 If a key cannot be found, and if callout_info is not NULL, then
409 /sbin/request-key will be invoked in an attempt to obtain a key. The 409 /sbin/request-key will be invoked in an attempt to obtain a key. The
410 callout_info string will be passed as an argument to the program. 410 callout_info string will be passed as an argument to the program.
411 411
412 See also Documentation/keys-request-key.txt. 412 See also Documentation/keys-request-key.txt.
413 413
414 414
415 The keyctl syscall functions are: 415 The keyctl syscall functions are:
416 416
417 (*) Map a special key ID to a real key ID for this process: 417 (*) Map a special key ID to a real key ID for this process:
418 418
419 key_serial_t keyctl(KEYCTL_GET_KEYRING_ID, key_serial_t id, 419 key_serial_t keyctl(KEYCTL_GET_KEYRING_ID, key_serial_t id,
420 int create); 420 int create);
421 421
422 The special key specified by "id" is looked up (with the key being created 422 The special key specified by "id" is looked up (with the key being created
423 if necessary) and the ID of the key or keyring thus found is returned if 423 if necessary) and the ID of the key or keyring thus found is returned if
424 it exists. 424 it exists.
425 425
426 If the key does not yet exist, the key will be created if "create" is 426 If the key does not yet exist, the key will be created if "create" is
427 non-zero; and the error ENOKEY will be returned if "create" is zero. 427 non-zero; and the error ENOKEY will be returned if "create" is zero.
428 428
429 429
430 (*) Replace the session keyring this process subscribes to with a new one: 430 (*) Replace the session keyring this process subscribes to with a new one:
431 431
432 key_serial_t keyctl(KEYCTL_JOIN_SESSION_KEYRING, const char *name); 432 key_serial_t keyctl(KEYCTL_JOIN_SESSION_KEYRING, const char *name);
433 433
434 If name is NULL, an anonymous keyring is created attached to the process 434 If name is NULL, an anonymous keyring is created attached to the process
435 as its session keyring, displacing the old session keyring. 435 as its session keyring, displacing the old session keyring.
436 436
437 If name is not NULL, if a keyring of that name exists, the process 437 If name is not NULL, if a keyring of that name exists, the process
438 attempts to attach it as the session keyring, returning an error if that 438 attempts to attach it as the session keyring, returning an error if that
439 is not permitted; otherwise a new keyring of that name is created and 439 is not permitted; otherwise a new keyring of that name is created and
440 attached as the session keyring. 440 attached as the session keyring.
441 441
442 To attach to a named keyring, the keyring must have search permission for 442 To attach to a named keyring, the keyring must have search permission for
443 the process's ownership. 443 the process's ownership.
444 444
445 The ID of the new session keyring is returned if successful. 445 The ID of the new session keyring is returned if successful.
446 446
447 447
448 (*) Update the specified key: 448 (*) Update the specified key:
449 449
450 long keyctl(KEYCTL_UPDATE, key_serial_t key, const void *payload, 450 long keyctl(KEYCTL_UPDATE, key_serial_t key, const void *payload,
451 size_t plen); 451 size_t plen);
452 452
453 This will try to update the specified key with the given payload, or it 453 This will try to update the specified key with the given payload, or it
454 will return error EOPNOTSUPP if that function is not supported by the key 454 will return error EOPNOTSUPP if that function is not supported by the key
455 type. The process must also have permission to write to the key to be able 455 type. The process must also have permission to write to the key to be able
456 to update it. 456 to update it.
457 457
458 The payload is of length plen, and may be absent or empty as for 458 The payload is of length plen, and may be absent or empty as for
459 add_key(). 459 add_key().
460 460
461 461
462 (*) Revoke a key: 462 (*) Revoke a key:
463 463
464 long keyctl(KEYCTL_REVOKE, key_serial_t key); 464 long keyctl(KEYCTL_REVOKE, key_serial_t key);
465 465
466 This makes a key unavailable for further operations. Further attempts to 466 This makes a key unavailable for further operations. Further attempts to
467 use the key will be met with error EKEYREVOKED, and the key will no longer 467 use the key will be met with error EKEYREVOKED, and the key will no longer
468 be findable. 468 be findable.
469 469
470 470
471 (*) Change the ownership of a key: 471 (*) Change the ownership of a key:
472 472
473 long keyctl(KEYCTL_CHOWN, key_serial_t key, uid_t uid, gid_t gid); 473 long keyctl(KEYCTL_CHOWN, key_serial_t key, uid_t uid, gid_t gid);
474 474
475 This function permits a key's owner and group ID to be changed. Either one 475 This function permits a key's owner and group ID to be changed. Either one
476 of uid or gid can be set to -1 to suppress that change. 476 of uid or gid can be set to -1 to suppress that change.
477 477
478 Only the superuser can change a key's owner to something other than the 478 Only the superuser can change a key's owner to something other than the
479 key's current owner. Similarly, only the superuser can change a key's 479 key's current owner. Similarly, only the superuser can change a key's
480 group ID to something other than the calling process's group ID or one of 480 group ID to something other than the calling process's group ID or one of
481 its group list members. 481 its group list members.
482 482
483 483
484 (*) Change the permissions mask on a key: 484 (*) Change the permissions mask on a key:
485 485
486 long keyctl(KEYCTL_SETPERM, key_serial_t key, key_perm_t perm); 486 long keyctl(KEYCTL_SETPERM, key_serial_t key, key_perm_t perm);
487 487
488 This function permits the owner of a key or the superuser to change the 488 This function permits the owner of a key or the superuser to change the
489 permissions mask on a key. 489 permissions mask on a key.
490 490
491 Only bits the available bits are permitted; if any other bits are set, 491 Only bits the available bits are permitted; if any other bits are set,
492 error EINVAL will be returned. 492 error EINVAL will be returned.
493 493
494 494
495 (*) Describe a key: 495 (*) Describe a key:
496 496
497 long keyctl(KEYCTL_DESCRIBE, key_serial_t key, char *buffer, 497 long keyctl(KEYCTL_DESCRIBE, key_serial_t key, char *buffer,
498 size_t buflen); 498 size_t buflen);
499 499
500 This function returns a summary of the key's attributes (but not its 500 This function returns a summary of the key's attributes (but not its
501 payload data) as a string in the buffer provided. 501 payload data) as a string in the buffer provided.
502 502
503 Unless there's an error, it always returns the amount of data it could 503 Unless there's an error, it always returns the amount of data it could
504 produce, even if that's too big for the buffer, but it won't copy more 504 produce, even if that's too big for the buffer, but it won't copy more
505 than requested to userspace. If the buffer pointer is NULL then no copy 505 than requested to userspace. If the buffer pointer is NULL then no copy
506 will take place. 506 will take place.
507 507
508 A process must have view permission on the key for this function to be 508 A process must have view permission on the key for this function to be
509 successful. 509 successful.
510 510
511 If successful, a string is placed in the buffer in the following format: 511 If successful, a string is placed in the buffer in the following format:
512 512
513 <type>;<uid>;<gid>;<perm>;<description> 513 <type>;<uid>;<gid>;<perm>;<description>
514 514
515 Where type and description are strings, uid and gid are decimal, and perm 515 Where type and description are strings, uid and gid are decimal, and perm
516 is hexadecimal. A NUL character is included at the end of the string if 516 is hexadecimal. A NUL character is included at the end of the string if
517 the buffer is sufficiently big. 517 the buffer is sufficiently big.
518 518
519 This can be parsed with 519 This can be parsed with
520 520
521 sscanf(buffer, "%[^;];%d;%d;%o;%s", type, &uid, &gid, &mode, desc); 521 sscanf(buffer, "%[^;];%d;%d;%o;%s", type, &uid, &gid, &mode, desc);
522 522
523 523
524 (*) Clear out a keyring: 524 (*) Clear out a keyring:
525 525
526 long keyctl(KEYCTL_CLEAR, key_serial_t keyring); 526 long keyctl(KEYCTL_CLEAR, key_serial_t keyring);
527 527
528 This function clears the list of keys attached to a keyring. The calling 528 This function clears the list of keys attached to a keyring. The calling
529 process must have write permission on the keyring, and it must be a 529 process must have write permission on the keyring, and it must be a
530 keyring (or else error ENOTDIR will result). 530 keyring (or else error ENOTDIR will result).
531 531
532 532
533 (*) Link a key into a keyring: 533 (*) Link a key into a keyring:
534 534
535 long keyctl(KEYCTL_LINK, key_serial_t keyring, key_serial_t key); 535 long keyctl(KEYCTL_LINK, key_serial_t keyring, key_serial_t key);
536 536
537 This function creates a link from the keyring to the key. The process must 537 This function creates a link from the keyring to the key. The process must
538 have write permission on the keyring and must have link permission on the 538 have write permission on the keyring and must have link permission on the
539 key. 539 key.
540 540
541 Should the keyring not be a keyring, error ENOTDIR will result; and if the 541 Should the keyring not be a keyring, error ENOTDIR will result; and if the
542 keyring is full, error ENFILE will result. 542 keyring is full, error ENFILE will result.
543 543
544 The link procedure checks the nesting of the keyrings, returning ELOOP if 544 The link procedure checks the nesting of the keyrings, returning ELOOP if
545 it appears too deep or EDEADLK if the link would introduce a cycle. 545 it appears too deep or EDEADLK if the link would introduce a cycle.
546 546
547 Any links within the keyring to keys that match the new key in terms of 547 Any links within the keyring to keys that match the new key in terms of
548 type and description will be discarded from the keyring as the new one is 548 type and description will be discarded from the keyring as the new one is
549 added. 549 added.
550 550
551 551
552 (*) Unlink a key or keyring from another keyring: 552 (*) Unlink a key or keyring from another keyring:
553 553
554 long keyctl(KEYCTL_UNLINK, key_serial_t keyring, key_serial_t key); 554 long keyctl(KEYCTL_UNLINK, key_serial_t keyring, key_serial_t key);
555 555
556 This function looks through the keyring for the first link to the 556 This function looks through the keyring for the first link to the
557 specified key, and removes it if found. Subsequent links to that key are 557 specified key, and removes it if found. Subsequent links to that key are
558 ignored. The process must have write permission on the keyring. 558 ignored. The process must have write permission on the keyring.
559 559
560 If the keyring is not a keyring, error ENOTDIR will result; and if the key 560 If the keyring is not a keyring, error ENOTDIR will result; and if the key
561 is not present, error ENOENT will be the result. 561 is not present, error ENOENT will be the result.
562 562
563 563
564 (*) Search a keyring tree for a key: 564 (*) Search a keyring tree for a key:
565 565
566 key_serial_t keyctl(KEYCTL_SEARCH, key_serial_t keyring, 566 key_serial_t keyctl(KEYCTL_SEARCH, key_serial_t keyring,
567 const char *type, const char *description, 567 const char *type, const char *description,
568 key_serial_t dest_keyring); 568 key_serial_t dest_keyring);
569 569
570 This searches the keyring tree headed by the specified keyring until a key 570 This searches the keyring tree headed by the specified keyring until a key
571 is found that matches the type and description criteria. Each keyring is 571 is found that matches the type and description criteria. Each keyring is
572 checked for keys before recursion into its children occurs. 572 checked for keys before recursion into its children occurs.
573 573
574 The process must have search permission on the top level keyring, or else 574 The process must have search permission on the top level keyring, or else
575 error EACCES will result. Only keyrings that the process has search 575 error EACCES will result. Only keyrings that the process has search
576 permission on will be recursed into, and only keys and keyrings for which 576 permission on will be recursed into, and only keys and keyrings for which
577 a process has search permission can be matched. If the specified keyring 577 a process has search permission can be matched. If the specified keyring
578 is not a keyring, ENOTDIR will result. 578 is not a keyring, ENOTDIR will result.
579 579
580 If the search succeeds, the function will attempt to link the found key 580 If the search succeeds, the function will attempt to link the found key
581 into the destination keyring if one is supplied (non-zero ID). All the 581 into the destination keyring if one is supplied (non-zero ID). All the
582 constraints applicable to KEYCTL_LINK apply in this case too. 582 constraints applicable to KEYCTL_LINK apply in this case too.
583 583
584 Error ENOKEY, EKEYREVOKED or EKEYEXPIRED will be returned if the search 584 Error ENOKEY, EKEYREVOKED or EKEYEXPIRED will be returned if the search
585 fails. On success, the resulting key ID will be returned. 585 fails. On success, the resulting key ID will be returned.
586 586
587 587
588 (*) Read the payload data from a key: 588 (*) Read the payload data from a key:
589 589
590 long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer, 590 long keyctl(KEYCTL_READ, key_serial_t keyring, char *buffer,
591 size_t buflen); 591 size_t buflen);
592 592
593 This function attempts to read the payload data from the specified key 593 This function attempts to read the payload data from the specified key
594 into the buffer. The process must have read permission on the key to 594 into the buffer. The process must have read permission on the key to
595 succeed. 595 succeed.
596 596
597 The returned data will be processed for presentation by the key type. For 597 The returned data will be processed for presentation by the key type. For
598 instance, a keyring will return an array of key_serial_t entries 598 instance, a keyring will return an array of key_serial_t entries
599 representing the IDs of all the keys to which it is subscribed. The user 599 representing the IDs of all the keys to which it is subscribed. The user
600 defined key type will return its data as is. If a key type does not 600 defined key type will return its data as is. If a key type does not
601 implement this function, error EOPNOTSUPP will result. 601 implement this function, error EOPNOTSUPP will result.
602 602
603 As much of the data as can be fitted into the buffer will be copied to 603 As much of the data as can be fitted into the buffer will be copied to
604 userspace if the buffer pointer is not NULL. 604 userspace if the buffer pointer is not NULL.
605 605
606 On a successful return, the function will always return the amount of data 606 On a successful return, the function will always return the amount of data
607 available rather than the amount copied. 607 available rather than the amount copied.
608 608
609 609
610 (*) Instantiate a partially constructed key. 610 (*) Instantiate a partially constructed key.
611 611
612 long keyctl(KEYCTL_INSTANTIATE, key_serial_t key, 612 long keyctl(KEYCTL_INSTANTIATE, key_serial_t key,
613 const void *payload, size_t plen, 613 const void *payload, size_t plen,
614 key_serial_t keyring); 614 key_serial_t keyring);
615 615
616 If the kernel calls back to userspace to complete the instantiation of a 616 If the kernel calls back to userspace to complete the instantiation of a
617 key, userspace should use this call to supply data for the key before the 617 key, userspace should use this call to supply data for the key before the
618 invoked process returns, or else the key will be marked negative 618 invoked process returns, or else the key will be marked negative
619 automatically. 619 automatically.
620 620
621 The process must have write access on the key to be able to instantiate 621 The process must have write access on the key to be able to instantiate
622 it, and the key must be uninstantiated. 622 it, and the key must be uninstantiated.
623 623
624 If a keyring is specified (non-zero), the key will also be linked into 624 If a keyring is specified (non-zero), the key will also be linked into
625 that keyring, however all the constraints applying in KEYCTL_LINK apply in 625 that keyring, however all the constraints applying in KEYCTL_LINK apply in
626 this case too. 626 this case too.
627 627
628 The payload and plen arguments describe the payload data as for add_key(). 628 The payload and plen arguments describe the payload data as for add_key().
629 629
630 630
631 (*) Negatively instantiate a partially constructed key. 631 (*) Negatively instantiate a partially constructed key.
632 632
633 long keyctl(KEYCTL_NEGATE, key_serial_t key, 633 long keyctl(KEYCTL_NEGATE, key_serial_t key,
634 unsigned timeout, key_serial_t keyring); 634 unsigned timeout, key_serial_t keyring);
635 635
636 If the kernel calls back to userspace to complete the instantiation of a 636 If the kernel calls back to userspace to complete the instantiation of a
637 key, userspace should use this call mark the key as negative before the 637 key, userspace should use this call mark the key as negative before the
638 invoked process returns if it is unable to fulfil the request. 638 invoked process returns if it is unable to fulfil the request.
639 639
640 The process must have write access on the key to be able to instantiate 640 The process must have write access on the key to be able to instantiate
641 it, and the key must be uninstantiated. 641 it, and the key must be uninstantiated.
642 642
643 If a keyring is specified (non-zero), the key will also be linked into 643 If a keyring is specified (non-zero), the key will also be linked into
644 that keyring, however all the constraints applying in KEYCTL_LINK apply in 644 that keyring, however all the constraints applying in KEYCTL_LINK apply in
645 this case too. 645 this case too.
646 646
647 647
648 (*) Set the default request-key destination keyring. 648 (*) Set the default request-key destination keyring.
649 649
650 long keyctl(KEYCTL_SET_REQKEY_KEYRING, int reqkey_defl); 650 long keyctl(KEYCTL_SET_REQKEY_KEYRING, int reqkey_defl);
651 651
652 This sets the default keyring to which implicitly requested keys will be 652 This sets the default keyring to which implicitly requested keys will be
653 attached for this thread. reqkey_defl should be one of these constants: 653 attached for this thread. reqkey_defl should be one of these constants:
654 654
655 CONSTANT VALUE NEW DEFAULT KEYRING 655 CONSTANT VALUE NEW DEFAULT KEYRING
656 ====================================== ====== ======================= 656 ====================================== ====== =======================
657 KEY_REQKEY_DEFL_NO_CHANGE -1 No change 657 KEY_REQKEY_DEFL_NO_CHANGE -1 No change
658 KEY_REQKEY_DEFL_DEFAULT 0 Default[1] 658 KEY_REQKEY_DEFL_DEFAULT 0 Default[1]
659 KEY_REQKEY_DEFL_THREAD_KEYRING 1 Thread keyring 659 KEY_REQKEY_DEFL_THREAD_KEYRING 1 Thread keyring
660 KEY_REQKEY_DEFL_PROCESS_KEYRING 2 Process keyring 660 KEY_REQKEY_DEFL_PROCESS_KEYRING 2 Process keyring
661 KEY_REQKEY_DEFL_SESSION_KEYRING 3 Session keyring 661 KEY_REQKEY_DEFL_SESSION_KEYRING 3 Session keyring
662 KEY_REQKEY_DEFL_USER_KEYRING 4 User keyring 662 KEY_REQKEY_DEFL_USER_KEYRING 4 User keyring
663 KEY_REQKEY_DEFL_USER_SESSION_KEYRING 5 User session keyring 663 KEY_REQKEY_DEFL_USER_SESSION_KEYRING 5 User session keyring
664 KEY_REQKEY_DEFL_GROUP_KEYRING 6 Group keyring 664 KEY_REQKEY_DEFL_GROUP_KEYRING 6 Group keyring
665 665
666 The old default will be returned if successful and error EINVAL will be 666 The old default will be returned if successful and error EINVAL will be
667 returned if reqkey_defl is not one of the above values. 667 returned if reqkey_defl is not one of the above values.
668 668
669 The default keyring can be overridden by the keyring indicated to the 669 The default keyring can be overridden by the keyring indicated to the
670 request_key() system call. 670 request_key() system call.
671 671
672 Note that this setting is inherited across fork/exec. 672 Note that this setting is inherited across fork/exec.
673 673
674 [1] The default default is: the thread keyring if there is one, otherwise 674 [1] The default is: the thread keyring if there is one, otherwise
675 the process keyring if there is one, otherwise the session keyring if 675 the process keyring if there is one, otherwise the session keyring if
676 there is one, otherwise the user default session keyring. 676 there is one, otherwise the user default session keyring.
677 677
678 678
679 (*) Set the timeout on a key. 679 (*) Set the timeout on a key.
680 680
681 long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout); 681 long keyctl(KEYCTL_SET_TIMEOUT, key_serial_t key, unsigned timeout);
682 682
683 This sets or clears the timeout on a key. The timeout can be 0 to clear 683 This sets or clears the timeout on a key. The timeout can be 0 to clear
684 the timeout or a number of seconds to set the expiry time that far into 684 the timeout or a number of seconds to set the expiry time that far into
685 the future. 685 the future.
686 686
687 The process must have attribute modification access on a key to set its 687 The process must have attribute modification access on a key to set its
688 timeout. Timeouts may not be set with this function on negative, revoked 688 timeout. Timeouts may not be set with this function on negative, revoked
689 or expired keys. 689 or expired keys.
690 690
691 691
692 (*) Assume the authority granted to instantiate a key 692 (*) Assume the authority granted to instantiate a key
693 693
694 long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key); 694 long keyctl(KEYCTL_ASSUME_AUTHORITY, key_serial_t key);
695 695
696 This assumes or divests the authority required to instantiate the 696 This assumes or divests the authority required to instantiate the
697 specified key. Authority can only be assumed if the thread has the 697 specified key. Authority can only be assumed if the thread has the
698 authorisation key associated with the specified key in its keyrings 698 authorisation key associated with the specified key in its keyrings
699 somewhere. 699 somewhere.
700 700
701 Once authority is assumed, searches for keys will also search the 701 Once authority is assumed, searches for keys will also search the
702 requester's keyrings using the requester's security label, UID, GID and 702 requester's keyrings using the requester's security label, UID, GID and
703 groups. 703 groups.
704 704
705 If the requested authority is unavailable, error EPERM will be returned, 705 If the requested authority is unavailable, error EPERM will be returned,
706 likewise if the authority has been revoked because the target key is 706 likewise if the authority has been revoked because the target key is
707 already instantiated. 707 already instantiated.
708 708
709 If the specified key is 0, then any assumed authority will be divested. 709 If the specified key is 0, then any assumed authority will be divested.
710 710
711 The assumed authoritative key is inherited across fork and exec. 711 The assumed authoritative key is inherited across fork and exec.
712 712
713 713
714 =============== 714 ===============
715 KERNEL SERVICES 715 KERNEL SERVICES
716 =============== 716 ===============
717 717
718 The kernel services for key management are fairly simple to deal with. They can 718 The kernel services for key management are fairly simple to deal with. They can
719 be broken down into two areas: keys and key types. 719 be broken down into two areas: keys and key types.
720 720
721 Dealing with keys is fairly straightforward. Firstly, the kernel service 721 Dealing with keys is fairly straightforward. Firstly, the kernel service
722 registers its type, then it searches for a key of that type. It should retain 722 registers its type, then it searches for a key of that type. It should retain
723 the key as long as it has need of it, and then it should release it. For a 723 the key as long as it has need of it, and then it should release it. For a
724 filesystem or device file, a search would probably be performed during the open 724 filesystem or device file, a search would probably be performed during the open
725 call, and the key released upon close. How to deal with conflicting keys due to 725 call, and the key released upon close. How to deal with conflicting keys due to
726 two different users opening the same file is left to the filesystem author to 726 two different users opening the same file is left to the filesystem author to
727 solve. 727 solve.
728 728
729 Note that there are two different types of pointers to keys that may be 729 Note that there are two different types of pointers to keys that may be
730 encountered: 730 encountered:
731 731
732 (*) struct key * 732 (*) struct key *
733 733
734 This simply points to the key structure itself. Key structures will be at 734 This simply points to the key structure itself. Key structures will be at
735 least four-byte aligned. 735 least four-byte aligned.
736 736
737 (*) key_ref_t 737 (*) key_ref_t
738 738
739 This is equivalent to a struct key *, but the least significant bit is set 739 This is equivalent to a struct key *, but the least significant bit is set
740 if the caller "possesses" the key. By "possession" it is meant that the 740 if the caller "possesses" the key. By "possession" it is meant that the
741 calling processes has a searchable link to the key from one of its 741 calling processes has a searchable link to the key from one of its
742 keyrings. There are three functions for dealing with these: 742 keyrings. There are three functions for dealing with these:
743 743
744 key_ref_t make_key_ref(const struct key *key, 744 key_ref_t make_key_ref(const struct key *key,
745 unsigned long possession); 745 unsigned long possession);
746 746
747 struct key *key_ref_to_ptr(const key_ref_t key_ref); 747 struct key *key_ref_to_ptr(const key_ref_t key_ref);
748 748
749 unsigned long is_key_possessed(const key_ref_t key_ref); 749 unsigned long is_key_possessed(const key_ref_t key_ref);
750 750
751 The first function constructs a key reference from a key pointer and 751 The first function constructs a key reference from a key pointer and
752 possession information (which must be 0 or 1 and not any other value). 752 possession information (which must be 0 or 1 and not any other value).
753 753
754 The second function retrieves the key pointer from a reference and the 754 The second function retrieves the key pointer from a reference and the
755 third retrieves the possession flag. 755 third retrieves the possession flag.
756 756
757 When accessing a key's payload contents, certain precautions must be taken to 757 When accessing a key's payload contents, certain precautions must be taken to
758 prevent access vs modification races. See the section "Notes on accessing 758 prevent access vs modification races. See the section "Notes on accessing
759 payload contents" for more information. 759 payload contents" for more information.
760 760
761 (*) To search for a key, call: 761 (*) To search for a key, call:
762 762
763 struct key *request_key(const struct key_type *type, 763 struct key *request_key(const struct key_type *type,
764 const char *description, 764 const char *description,
765 const char *callout_string); 765 const char *callout_string);
766 766
767 This is used to request a key or keyring with a description that matches 767 This is used to request a key or keyring with a description that matches
768 the description specified according to the key type's match function. This 768 the description specified according to the key type's match function. This
769 permits approximate matching to occur. If callout_string is not NULL, then 769 permits approximate matching to occur. If callout_string is not NULL, then
770 /sbin/request-key will be invoked in an attempt to obtain the key from 770 /sbin/request-key will be invoked in an attempt to obtain the key from
771 userspace. In that case, callout_string will be passed as an argument to 771 userspace. In that case, callout_string will be passed as an argument to
772 the program. 772 the program.
773 773
774 Should the function fail error ENOKEY, EKEYEXPIRED or EKEYREVOKED will be 774 Should the function fail error ENOKEY, EKEYEXPIRED or EKEYREVOKED will be
775 returned. 775 returned.
776 776
777 If successful, the key will have been attached to the default keyring for 777 If successful, the key will have been attached to the default keyring for
778 implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING. 778 implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING.
779 779
780 See also Documentation/keys-request-key.txt. 780 See also Documentation/keys-request-key.txt.
781 781
782 782
783 (*) To search for a key, passing auxiliary data to the upcaller, call: 783 (*) To search for a key, passing auxiliary data to the upcaller, call:
784 784
785 struct key *request_key_with_auxdata(const struct key_type *type, 785 struct key *request_key_with_auxdata(const struct key_type *type,
786 const char *description, 786 const char *description,
787 const char *callout_string, 787 const char *callout_string,
788 void *aux); 788 void *aux);
789 789
790 This is identical to request_key(), except that the auxiliary data is 790 This is identical to request_key(), except that the auxiliary data is
791 passed to the key_type->request_key() op if it exists. 791 passed to the key_type->request_key() op if it exists.
792 792
793 793
794 (*) When it is no longer required, the key should be released using: 794 (*) When it is no longer required, the key should be released using:
795 795
796 void key_put(struct key *key); 796 void key_put(struct key *key);
797 797
798 Or: 798 Or:
799 799
800 void key_ref_put(key_ref_t key_ref); 800 void key_ref_put(key_ref_t key_ref);
801 801
802 These can be called from interrupt context. If CONFIG_KEYS is not set then 802 These can be called from interrupt context. If CONFIG_KEYS is not set then
803 the argument will not be parsed. 803 the argument will not be parsed.
804 804
805 805
806 (*) Extra references can be made to a key by calling the following function: 806 (*) Extra references can be made to a key by calling the following function:
807 807
808 struct key *key_get(struct key *key); 808 struct key *key_get(struct key *key);
809 809
810 These need to be disposed of by calling key_put() when they've been 810 These need to be disposed of by calling key_put() when they've been
811 finished with. The key pointer passed in will be returned. If the pointer 811 finished with. The key pointer passed in will be returned. If the pointer
812 is NULL or CONFIG_KEYS is not set then the key will not be dereferenced and 812 is NULL or CONFIG_KEYS is not set then the key will not be dereferenced and
813 no increment will take place. 813 no increment will take place.
814 814
815 815
816 (*) A key's serial number can be obtained by calling: 816 (*) A key's serial number can be obtained by calling:
817 817
818 key_serial_t key_serial(struct key *key); 818 key_serial_t key_serial(struct key *key);
819 819
820 If key is NULL or if CONFIG_KEYS is not set then 0 will be returned (in the 820 If key is NULL or if CONFIG_KEYS is not set then 0 will be returned (in the
821 latter case without parsing the argument). 821 latter case without parsing the argument).
822 822
823 823
824 (*) If a keyring was found in the search, this can be further searched by: 824 (*) If a keyring was found in the search, this can be further searched by:
825 825
826 key_ref_t keyring_search(key_ref_t keyring_ref, 826 key_ref_t keyring_search(key_ref_t keyring_ref,
827 const struct key_type *type, 827 const struct key_type *type,
828 const char *description) 828 const char *description)
829 829
830 This searches the keyring tree specified for a matching key. Error ENOKEY 830 This searches the keyring tree specified for a matching key. Error ENOKEY
831 is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful, 831 is returned upon failure (use IS_ERR/PTR_ERR to determine). If successful,
832 the returned key will need to be released. 832 the returned key will need to be released.
833 833
834 The possession attribute from the keyring reference is used to control 834 The possession attribute from the keyring reference is used to control
835 access through the permissions mask and is propagated to the returned key 835 access through the permissions mask and is propagated to the returned key
836 reference pointer if successful. 836 reference pointer if successful.
837 837
838 838
839 (*) To check the validity of a key, this function can be called: 839 (*) To check the validity of a key, this function can be called:
840 840
841 int validate_key(struct key *key); 841 int validate_key(struct key *key);
842 842
843 This checks that the key in question hasn't expired or and hasn't been 843 This checks that the key in question hasn't expired or and hasn't been
844 revoked. Should the key be invalid, error EKEYEXPIRED or EKEYREVOKED will 844 revoked. Should the key be invalid, error EKEYEXPIRED or EKEYREVOKED will
845 be returned. If the key is NULL or if CONFIG_KEYS is not set then 0 will be 845 be returned. If the key is NULL or if CONFIG_KEYS is not set then 0 will be
846 returned (in the latter case without parsing the argument). 846 returned (in the latter case without parsing the argument).
847 847
848 848
849 (*) To register a key type, the following function should be called: 849 (*) To register a key type, the following function should be called:
850 850
851 int register_key_type(struct key_type *type); 851 int register_key_type(struct key_type *type);
852 852
853 This will return error EEXIST if a type of the same name is already 853 This will return error EEXIST if a type of the same name is already
854 present. 854 present.
855 855
856 856
857 (*) To unregister a key type, call: 857 (*) To unregister a key type, call:
858 858
859 void unregister_key_type(struct key_type *type); 859 void unregister_key_type(struct key_type *type);
860 860
861 861
862 =================================== 862 ===================================
863 NOTES ON ACCESSING PAYLOAD CONTENTS 863 NOTES ON ACCESSING PAYLOAD CONTENTS
864 =================================== 864 ===================================
865 865
866 The simplest payload is just a number in key->payload.value. In this case, 866 The simplest payload is just a number in key->payload.value. In this case,
867 there's no need to indulge in RCU or locking when accessing the payload. 867 there's no need to indulge in RCU or locking when accessing the payload.
868 868
869 More complex payload contents must be allocated and a pointer to them set in 869 More complex payload contents must be allocated and a pointer to them set in
870 key->payload.data. One of the following ways must be selected to access the 870 key->payload.data. One of the following ways must be selected to access the
871 data: 871 data:
872 872
873 (1) Unmodifiable key type. 873 (1) Unmodifiable key type.
874 874
875 If the key type does not have a modify method, then the key's payload can 875 If the key type does not have a modify method, then the key's payload can
876 be accessed without any form of locking, provided that it's known to be 876 be accessed without any form of locking, provided that it's known to be
877 instantiated (uninstantiated keys cannot be "found"). 877 instantiated (uninstantiated keys cannot be "found").
878 878
879 (2) The key's semaphore. 879 (2) The key's semaphore.
880 880
881 The semaphore could be used to govern access to the payload and to control 881 The semaphore could be used to govern access to the payload and to control
882 the payload pointer. It must be write-locked for modifications and would 882 the payload pointer. It must be write-locked for modifications and would
883 have to be read-locked for general access. The disadvantage of doing this 883 have to be read-locked for general access. The disadvantage of doing this
884 is that the accessor may be required to sleep. 884 is that the accessor may be required to sleep.
885 885
886 (3) RCU. 886 (3) RCU.
887 887
888 RCU must be used when the semaphore isn't already held; if the semaphore 888 RCU must be used when the semaphore isn't already held; if the semaphore
889 is held then the contents can't change under you unexpectedly as the 889 is held then the contents can't change under you unexpectedly as the
890 semaphore must still be used to serialise modifications to the key. The 890 semaphore must still be used to serialise modifications to the key. The
891 key management code takes care of this for the key type. 891 key management code takes care of this for the key type.
892 892
893 However, this means using: 893 However, this means using:
894 894
895 rcu_read_lock() ... rcu_dereference() ... rcu_read_unlock() 895 rcu_read_lock() ... rcu_dereference() ... rcu_read_unlock()
896 896
897 to read the pointer, and: 897 to read the pointer, and:
898 898
899 rcu_dereference() ... rcu_assign_pointer() ... call_rcu() 899 rcu_dereference() ... rcu_assign_pointer() ... call_rcu()
900 900
901 to set the pointer and dispose of the old contents after a grace period. 901 to set the pointer and dispose of the old contents after a grace period.
902 Note that only the key type should ever modify a key's payload. 902 Note that only the key type should ever modify a key's payload.
903 903
904 Furthermore, an RCU controlled payload must hold a struct rcu_head for the 904 Furthermore, an RCU controlled payload must hold a struct rcu_head for the
905 use of call_rcu() and, if the payload is of variable size, the length of 905 use of call_rcu() and, if the payload is of variable size, the length of
906 the payload. key->datalen cannot be relied upon to be consistent with the 906 the payload. key->datalen cannot be relied upon to be consistent with the
907 payload just dereferenced if the key's semaphore is not held. 907 payload just dereferenced if the key's semaphore is not held.
908 908
909 909
910 =================== 910 ===================
911 DEFINING A KEY TYPE 911 DEFINING A KEY TYPE
912 =================== 912 ===================
913 913
914 A kernel service may want to define its own key type. For instance, an AFS 914 A kernel service may want to define its own key type. For instance, an AFS
915 filesystem might want to define a Kerberos 5 ticket key type. To do this, it 915 filesystem might want to define a Kerberos 5 ticket key type. To do this, it
916 author fills in a struct key_type and registers it with the system. 916 author fills in a struct key_type and registers it with the system.
917 917
918 The structure has a number of fields, some of which are mandatory: 918 The structure has a number of fields, some of which are mandatory:
919 919
920 (*) const char *name 920 (*) const char *name
921 921
922 The name of the key type. This is used to translate a key type name 922 The name of the key type. This is used to translate a key type name
923 supplied by userspace into a pointer to the structure. 923 supplied by userspace into a pointer to the structure.
924 924
925 925
926 (*) size_t def_datalen 926 (*) size_t def_datalen
927 927
928 This is optional - it supplies the default payload data length as 928 This is optional - it supplies the default payload data length as
929 contributed to the quota. If the key type's payload is always or almost 929 contributed to the quota. If the key type's payload is always or almost
930 always the same size, then this is a more efficient way to do things. 930 always the same size, then this is a more efficient way to do things.
931 931
932 The data length (and quota) on a particular key can always be changed 932 The data length (and quota) on a particular key can always be changed
933 during instantiation or update by calling: 933 during instantiation or update by calling:
934 934
935 int key_payload_reserve(struct key *key, size_t datalen); 935 int key_payload_reserve(struct key *key, size_t datalen);
936 936
937 With the revised data length. Error EDQUOT will be returned if this is not 937 With the revised data length. Error EDQUOT will be returned if this is not
938 viable. 938 viable.
939 939
940 940
941 (*) int (*instantiate)(struct key *key, const void *data, size_t datalen); 941 (*) int (*instantiate)(struct key *key, const void *data, size_t datalen);
942 942
943 This method is called to attach a payload to a key during construction. 943 This method is called to attach a payload to a key during construction.
944 The payload attached need not bear any relation to the data passed to this 944 The payload attached need not bear any relation to the data passed to this
945 function. 945 function.
946 946
947 If the amount of data attached to the key differs from the size in 947 If the amount of data attached to the key differs from the size in
948 keytype->def_datalen, then key_payload_reserve() should be called. 948 keytype->def_datalen, then key_payload_reserve() should be called.
949 949
950 This method does not have to lock the key in order to attach a payload. 950 This method does not have to lock the key in order to attach a payload.
951 The fact that KEY_FLAG_INSTANTIATED is not set in key->flags prevents 951 The fact that KEY_FLAG_INSTANTIATED is not set in key->flags prevents
952 anything else from gaining access to the key. 952 anything else from gaining access to the key.
953 953
954 It is safe to sleep in this method. 954 It is safe to sleep in this method.
955 955
956 956
957 (*) int (*update)(struct key *key, const void *data, size_t datalen); 957 (*) int (*update)(struct key *key, const void *data, size_t datalen);
958 958
959 If this type of key can be updated, then this method should be provided. 959 If this type of key can be updated, then this method should be provided.
960 It is called to update a key's payload from the blob of data provided. 960 It is called to update a key's payload from the blob of data provided.
961 961
962 key_payload_reserve() should be called if the data length might change 962 key_payload_reserve() should be called if the data length might change
963 before any changes are actually made. Note that if this succeeds, the type 963 before any changes are actually made. Note that if this succeeds, the type
964 is committed to changing the key because it's already been altered, so all 964 is committed to changing the key because it's already been altered, so all
965 memory allocation must be done first. 965 memory allocation must be done first.
966 966
967 The key will have its semaphore write-locked before this method is called, 967 The key will have its semaphore write-locked before this method is called,
968 but this only deters other writers; any changes to the key's payload must 968 but this only deters other writers; any changes to the key's payload must
969 be made under RCU conditions, and call_rcu() must be used to dispose of 969 be made under RCU conditions, and call_rcu() must be used to dispose of
970 the old payload. 970 the old payload.
971 971
972 key_payload_reserve() should be called before the changes are made, but 972 key_payload_reserve() should be called before the changes are made, but
973 after all allocations and other potentially failing function calls are 973 after all allocations and other potentially failing function calls are
974 made. 974 made.
975 975
976 It is safe to sleep in this method. 976 It is safe to sleep in this method.
977 977
978 978
979 (*) int (*match)(const struct key *key, const void *desc); 979 (*) int (*match)(const struct key *key, const void *desc);
980 980
981 This method is called to match a key against a description. It should 981 This method is called to match a key against a description. It should
982 return non-zero if the two match, zero if they don't. 982 return non-zero if the two match, zero if they don't.
983 983
984 This method should not need to lock the key in any way. The type and 984 This method should not need to lock the key in any way. The type and
985 description can be considered invariant, and the payload should not be 985 description can be considered invariant, and the payload should not be
986 accessed (the key may not yet be instantiated). 986 accessed (the key may not yet be instantiated).
987 987
988 It is not safe to sleep in this method; the caller may hold spinlocks. 988 It is not safe to sleep in this method; the caller may hold spinlocks.
989 989
990 990
991 (*) void (*revoke)(struct key *key); 991 (*) void (*revoke)(struct key *key);
992 992
993 This method is optional. It is called to discard part of the payload 993 This method is optional. It is called to discard part of the payload
994 data upon a key being revoked. The caller will have the key semaphore 994 data upon a key being revoked. The caller will have the key semaphore
995 write-locked. 995 write-locked.
996 996
997 It is safe to sleep in this method, though care should be taken to avoid 997 It is safe to sleep in this method, though care should be taken to avoid
998 a deadlock against the key semaphore. 998 a deadlock against the key semaphore.
999 999
1000 1000
1001 (*) void (*destroy)(struct key *key); 1001 (*) void (*destroy)(struct key *key);
1002 1002
1003 This method is optional. It is called to discard the payload data on a key 1003 This method is optional. It is called to discard the payload data on a key
1004 when it is being destroyed. 1004 when it is being destroyed.
1005 1005
1006 This method does not need to lock the key to access the payload; it can 1006 This method does not need to lock the key to access the payload; it can
1007 consider the key as being inaccessible at this time. Note that the key's 1007 consider the key as being inaccessible at this time. Note that the key's
1008 type may have been changed before this function is called. 1008 type may have been changed before this function is called.
1009 1009
1010 It is not safe to sleep in this method; the caller may hold spinlocks. 1010 It is not safe to sleep in this method; the caller may hold spinlocks.
1011 1011
1012 1012
1013 (*) void (*describe)(const struct key *key, struct seq_file *p); 1013 (*) void (*describe)(const struct key *key, struct seq_file *p);
1014 1014
1015 This method is optional. It is called during /proc/keys reading to 1015 This method is optional. It is called during /proc/keys reading to
1016 summarise a key's description and payload in text form. 1016 summarise a key's description and payload in text form.
1017 1017
1018 This method will be called with the RCU read lock held. rcu_dereference() 1018 This method will be called with the RCU read lock held. rcu_dereference()
1019 should be used to read the payload pointer if the payload is to be 1019 should be used to read the payload pointer if the payload is to be
1020 accessed. key->datalen cannot be trusted to stay consistent with the 1020 accessed. key->datalen cannot be trusted to stay consistent with the
1021 contents of the payload. 1021 contents of the payload.
1022 1022
1023 The description will not change, though the key's state may. 1023 The description will not change, though the key's state may.
1024 1024
1025 It is not safe to sleep in this method; the RCU read lock is held by the 1025 It is not safe to sleep in this method; the RCU read lock is held by the
1026 caller. 1026 caller.
1027 1027
1028 1028
1029 (*) long (*read)(const struct key *key, char __user *buffer, size_t buflen); 1029 (*) long (*read)(const struct key *key, char __user *buffer, size_t buflen);
1030 1030
1031 This method is optional. It is called by KEYCTL_READ to translate the 1031 This method is optional. It is called by KEYCTL_READ to translate the
1032 key's payload into something a blob of data for userspace to deal with. 1032 key's payload into something a blob of data for userspace to deal with.
1033 Ideally, the blob should be in the same format as that passed in to the 1033 Ideally, the blob should be in the same format as that passed in to the
1034 instantiate and update methods. 1034 instantiate and update methods.
1035 1035
1036 If successful, the blob size that could be produced should be returned 1036 If successful, the blob size that could be produced should be returned
1037 rather than the size copied. 1037 rather than the size copied.
1038 1038
1039 This method will be called with the key's semaphore read-locked. This will 1039 This method will be called with the key's semaphore read-locked. This will
1040 prevent the key's payload changing. It is not necessary to use RCU locking 1040 prevent the key's payload changing. It is not necessary to use RCU locking
1041 when accessing the key's payload. It is safe to sleep in this method, such 1041 when accessing the key's payload. It is safe to sleep in this method, such
1042 as might happen when the userspace buffer is accessed. 1042 as might happen when the userspace buffer is accessed.
1043 1043
1044 1044
1045 (*) int (*request_key)(struct key *key, struct key *authkey, const char *op, 1045 (*) int (*request_key)(struct key *key, struct key *authkey, const char *op,
1046 void *aux); 1046 void *aux);
1047 1047
1048 This method is optional. If provided, request_key() and 1048 This method is optional. If provided, request_key() and
1049 request_key_with_auxdata() will invoke this function rather than 1049 request_key_with_auxdata() will invoke this function rather than
1050 upcalling to /sbin/request-key to operate upon a key of this type. 1050 upcalling to /sbin/request-key to operate upon a key of this type.
1051 1051
1052 The aux parameter is as passed to request_key_with_auxdata() or is NULL 1052 The aux parameter is as passed to request_key_with_auxdata() or is NULL
1053 otherwise. Also passed are the key to be operated upon, the 1053 otherwise. Also passed are the key to be operated upon, the
1054 authorisation key for this operation and the operation type (currently 1054 authorisation key for this operation and the operation type (currently
1055 only "create"). 1055 only "create").
1056 1056
1057 This function should return only when the upcall is complete. Upon return 1057 This function should return only when the upcall is complete. Upon return
1058 the authorisation key will be revoked, and the target key will be 1058 the authorisation key will be revoked, and the target key will be
1059 negatively instantiated if it is still uninstantiated. The error will be 1059 negatively instantiated if it is still uninstantiated. The error will be
1060 returned to the caller of request_key*(). 1060 returned to the caller of request_key*().
1061 1061
1062 1062
1063 ============================ 1063 ============================
1064 REQUEST-KEY CALLBACK SERVICE 1064 REQUEST-KEY CALLBACK SERVICE
1065 ============================ 1065 ============================
1066 1066
1067 To create a new key, the kernel will attempt to execute the following command 1067 To create a new key, the kernel will attempt to execute the following command
1068 line: 1068 line:
1069 1069
1070 /sbin/request-key create <key> <uid> <gid> \ 1070 /sbin/request-key create <key> <uid> <gid> \
1071 <threadring> <processring> <sessionring> <callout_info> 1071 <threadring> <processring> <sessionring> <callout_info>
1072 1072
1073 <key> is the key being constructed, and the three keyrings are the process 1073 <key> is the key being constructed, and the three keyrings are the process
1074 keyrings from the process that caused the search to be issued. These are 1074 keyrings from the process that caused the search to be issued. These are
1075 included for two reasons: 1075 included for two reasons:
1076 1076
1077 (1) There may be an authentication token in one of the keyrings that is 1077 (1) There may be an authentication token in one of the keyrings that is
1078 required to obtain the key, eg: a Kerberos Ticket-Granting Ticket. 1078 required to obtain the key, eg: a Kerberos Ticket-Granting Ticket.
1079 1079
1080 (2) The new key should probably be cached in one of these rings. 1080 (2) The new key should probably be cached in one of these rings.
1081 1081
1082 This program should set it UID and GID to those specified before attempting to 1082 This program should set it UID and GID to those specified before attempting to
1083 access any more keys. It may then look around for a user specific process to 1083 access any more keys. It may then look around for a user specific process to
1084 hand the request off to (perhaps a path held in placed in another key by, for 1084 hand the request off to (perhaps a path held in placed in another key by, for
1085 example, the KDE desktop manager). 1085 example, the KDE desktop manager).
1086 1086
1087 The program (or whatever it calls) should finish construction of the key by 1087 The program (or whatever it calls) should finish construction of the key by
1088 calling KEYCTL_INSTANTIATE, which also permits it to cache the key in one of 1088 calling KEYCTL_INSTANTIATE, which also permits it to cache the key in one of
1089 the keyrings (probably the session ring) before returning. Alternatively, the 1089 the keyrings (probably the session ring) before returning. Alternatively, the
1090 key can be marked as negative with KEYCTL_NEGATE; this also permits the key to 1090 key can be marked as negative with KEYCTL_NEGATE; this also permits the key to
1091 be cached in one of the keyrings. 1091 be cached in one of the keyrings.
1092 1092
1093 If it returns with the key remaining in the unconstructed state, the key will 1093 If it returns with the key remaining in the unconstructed state, the key will
1094 be marked as being negative, it will be added to the session keyring, and an 1094 be marked as being negative, it will be added to the session keyring, and an
1095 error will be returned to the key requestor. 1095 error will be returned to the key requestor.
1096 1096
1097 Supplementary information may be provided from whoever or whatever invoked this 1097 Supplementary information may be provided from whoever or whatever invoked this
1098 service. This will be passed as the <callout_info> parameter. If no such 1098 service. This will be passed as the <callout_info> parameter. If no such
1099 information was made available, then "-" will be passed as this parameter 1099 information was made available, then "-" will be passed as this parameter
1100 instead. 1100 instead.
1101 1101
1102 1102
1103 Similarly, the kernel may attempt to update an expired or a soon to expire key 1103 Similarly, the kernel may attempt to update an expired or a soon to expire key
1104 by executing: 1104 by executing:
1105 1105
1106 /sbin/request-key update <key> <uid> <gid> \ 1106 /sbin/request-key update <key> <uid> <gid> \
1107 <threadring> <processring> <sessionring> 1107 <threadring> <processring> <sessionring>
1108 1108
1109 In this case, the program isn't required to actually attach the key to a ring; 1109 In this case, the program isn't required to actually attach the key to a ring;
1110 the rings are provided for reference. 1110 the rings are provided for reference.
1111 1111
Documentation/m68k/kernel-options.txt
1 1
2 2
3 Command Line Options for Linux/m68k 3 Command Line Options for Linux/m68k
4 =================================== 4 ===================================
5 5
6 Last Update: 2 May 1999 6 Last Update: 2 May 1999
7 Linux/m68k version: 2.2.6 7 Linux/m68k version: 2.2.6
8 Author: Roman.Hodek@informatik.uni-erlangen.de (Roman Hodek) 8 Author: Roman.Hodek@informatik.uni-erlangen.de (Roman Hodek)
9 Update: jds@kom.auc.dk (Jes Sorensen) and faq@linux-m68k.org (Chris Lawrence) 9 Update: jds@kom.auc.dk (Jes Sorensen) and faq@linux-m68k.org (Chris Lawrence)
10 10
11 0) Introduction 11 0) Introduction
12 =============== 12 ===============
13 13
14 Often I've been asked which command line options the Linux/m68k 14 Often I've been asked which command line options the Linux/m68k
15 kernel understands, or how the exact syntax for the ... option is, or 15 kernel understands, or how the exact syntax for the ... option is, or
16 ... about the option ... . I hope, this document supplies all the 16 ... about the option ... . I hope, this document supplies all the
17 answers... 17 answers...
18 18
19 Note that some options might be outdated, their descriptions being 19 Note that some options might be outdated, their descriptions being
20 incomplete or missing. Please update the information and send in the 20 incomplete or missing. Please update the information and send in the
21 patches. 21 patches.
22 22
23 23
24 1) Overview of the Kernel's Option Processing 24 1) Overview of the Kernel's Option Processing
25 ============================================= 25 =============================================
26 26
27 The kernel knows three kinds of options on its command line: 27 The kernel knows three kinds of options on its command line:
28 28
29 1) kernel options 29 1) kernel options
30 2) environment settings 30 2) environment settings
31 3) arguments for init 31 3) arguments for init
32 32
33 To which of these classes an argument belongs is determined as 33 To which of these classes an argument belongs is determined as
34 follows: If the option is known to the kernel itself, i.e. if the name 34 follows: If the option is known to the kernel itself, i.e. if the name
35 (the part before the '=') or, in some cases, the whole argument string 35 (the part before the '=') or, in some cases, the whole argument string
36 is known to the kernel, it belongs to class 1. Otherwise, if the 36 is known to the kernel, it belongs to class 1. Otherwise, if the
37 argument contains an '=', it is of class 2, and the definition is put 37 argument contains an '=', it is of class 2, and the definition is put
38 into init's environment. All other arguments are passed to init as 38 into init's environment. All other arguments are passed to init as
39 command line options. 39 command line options.
40 40
41 This document describes the valid kernel options for Linux/m68k in 41 This document describes the valid kernel options for Linux/m68k in
42 the version mentioned at the start of this file. Later revisions may 42 the version mentioned at the start of this file. Later revisions may
43 add new such options, and some may be missing in older versions. 43 add new such options, and some may be missing in older versions.
44 44
45 In general, the value (the part after the '=') of an option is a 45 In general, the value (the part after the '=') of an option is a
46 list of values separated by commas. The interpretation of these values 46 list of values separated by commas. The interpretation of these values
47 is up to the driver that "owns" the option. This association of 47 is up to the driver that "owns" the option. This association of
48 options with drivers is also the reason that some are further 48 options with drivers is also the reason that some are further
49 subdivided. 49 subdivided.
50 50
51 51
52 2) General Kernel Options 52 2) General Kernel Options
53 ========================= 53 =========================
54 54
55 2.1) root= 55 2.1) root=
56 ---------- 56 ----------
57 57
58 Syntax: root=/dev/<device> 58 Syntax: root=/dev/<device>
59 or: root=<hex_number> 59 or: root=<hex_number>
60 60
61 This tells the kernel which device it should mount as the root 61 This tells the kernel which device it should mount as the root
62 filesystem. The device must be a block device with a valid filesystem 62 filesystem. The device must be a block device with a valid filesystem
63 on it. 63 on it.
64 64
65 The first syntax gives the device by name. These names are converted 65 The first syntax gives the device by name. These names are converted
66 into a major/minor number internally in the kernel in an unusual way. 66 into a major/minor number internally in the kernel in an unusual way.
67 Normally, this "conversion" is done by the device files in /dev, but 67 Normally, this "conversion" is done by the device files in /dev, but
68 this isn't possible here, because the root filesystem (with /dev) 68 this isn't possible here, because the root filesystem (with /dev)
69 isn't mounted yet... So the kernel parses the name itself, with some 69 isn't mounted yet... So the kernel parses the name itself, with some
70 hardcoded name to number mappings. The name must always be a 70 hardcoded name to number mappings. The name must always be a
71 combination of two or three letters, followed by a decimal number. 71 combination of two or three letters, followed by a decimal number.
72 Valid names are: 72 Valid names are:
73 73
74 /dev/ram: -> 0x0100 (initial ramdisk) 74 /dev/ram: -> 0x0100 (initial ramdisk)
75 /dev/hda: -> 0x0300 (first IDE disk) 75 /dev/hda: -> 0x0300 (first IDE disk)
76 /dev/hdb: -> 0x0340 (second IDE disk) 76 /dev/hdb: -> 0x0340 (second IDE disk)
77 /dev/sda: -> 0x0800 (first SCSI disk) 77 /dev/sda: -> 0x0800 (first SCSI disk)
78 /dev/sdb: -> 0x0810 (second SCSI disk) 78 /dev/sdb: -> 0x0810 (second SCSI disk)
79 /dev/sdc: -> 0x0820 (third SCSI disk) 79 /dev/sdc: -> 0x0820 (third SCSI disk)
80 /dev/sdd: -> 0x0830 (forth SCSI disk) 80 /dev/sdd: -> 0x0830 (forth SCSI disk)
81 /dev/sde: -> 0x0840 (fifth SCSI disk) 81 /dev/sde: -> 0x0840 (fifth SCSI disk)
82 /dev/fd : -> 0x0200 (floppy disk) 82 /dev/fd : -> 0x0200 (floppy disk)
83 /dev/xda: -> 0x0c00 (first XT disk, unused in Linux/m68k) 83 /dev/xda: -> 0x0c00 (first XT disk, unused in Linux/m68k)
84 /dev/xdb: -> 0x0c40 (second XT disk, unused in Linux/m68k) 84 /dev/xdb: -> 0x0c40 (second XT disk, unused in Linux/m68k)
85 /dev/ada: -> 0x1c00 (first ACSI device) 85 /dev/ada: -> 0x1c00 (first ACSI device)
86 /dev/adb: -> 0x1c10 (second ACSI device) 86 /dev/adb: -> 0x1c10 (second ACSI device)
87 /dev/adc: -> 0x1c20 (third ACSI device) 87 /dev/adc: -> 0x1c20 (third ACSI device)
88 /dev/add: -> 0x1c30 (forth ACSI device) 88 /dev/add: -> 0x1c30 (forth ACSI device)
89 89
90 The last four names are available only if the kernel has been compiled 90 The last four names are available only if the kernel has been compiled
91 with Atari and ACSI support. 91 with Atari and ACSI support.
92 92
93 The name must be followed by a decimal number, that stands for the 93 The name must be followed by a decimal number, that stands for the
94 partition number. Internally, the value of the number is just 94 partition number. Internally, the value of the number is just
95 added to the device number mentioned in the table above. The 95 added to the device number mentioned in the table above. The
96 exceptions are /dev/ram and /dev/fd, where /dev/ram refers to an 96 exceptions are /dev/ram and /dev/fd, where /dev/ram refers to an
97 initial ramdisk loaded by your bootstrap program (please consult the 97 initial ramdisk loaded by your bootstrap program (please consult the
98 instructions for your bootstrap program to find out how to load an 98 instructions for your bootstrap program to find out how to load an
99 initial ramdisk). As of kernel version 2.0.18 you must specify 99 initial ramdisk). As of kernel version 2.0.18 you must specify
100 /dev/ram as the root device if you want to boot from an initial 100 /dev/ram as the root device if you want to boot from an initial
101 ramdisk. For the floppy devices, /dev/fd, the number stands for the 101 ramdisk. For the floppy devices, /dev/fd, the number stands for the
102 floppy drive number (there are no partitions on floppy disks). I.e., 102 floppy drive number (there are no partitions on floppy disks). I.e.,
103 /dev/fd0 stands for the first drive, /dev/fd1 for the second, and so 103 /dev/fd0 stands for the first drive, /dev/fd1 for the second, and so
104 on. Since the number is just added, you can also force the disk format 104 on. Since the number is just added, you can also force the disk format
105 by adding a number greater than 3. If you look into your /dev 105 by adding a number greater than 3. If you look into your /dev
106 directory, use can see the /dev/fd0D720 has major 2 and minor 16. You 106 directory, use can see the /dev/fd0D720 has major 2 and minor 16. You
107 can specify this device for the root FS by writing "root=/dev/fd16" on 107 can specify this device for the root FS by writing "root=/dev/fd16" on
108 the kernel command line. 108 the kernel command line.
109 109
110 [Strange and maybe uninteresting stuff ON] 110 [Strange and maybe uninteresting stuff ON]
111 111
112 This unusual translation of device names has some strange 112 This unusual translation of device names has some strange
113 consequences: If, for example, you have a symbolic link from /dev/fd 113 consequences: If, for example, you have a symbolic link from /dev/fd
114 to /dev/fd0D720 as an abbreviation for floppy driver #0 in DD format, 114 to /dev/fd0D720 as an abbreviation for floppy driver #0 in DD format,
115 you cannot use this name for specifying the root device, because the 115 you cannot use this name for specifying the root device, because the
116 kernel cannot see this symlink before mounting the root FS and it 116 kernel cannot see this symlink before mounting the root FS and it
117 isn't in the table above. If you use it, the root device will not be 117 isn't in the table above. If you use it, the root device will not be
118 set at all, without an error message. Another example: You cannot use a 118 set at all, without an error message. Another example: You cannot use a
119 partition on e.g. the sixth SCSI disk as the root filesystem, if you 119 partition on e.g. the sixth SCSI disk as the root filesystem, if you
120 want to specify it by name. This is, because only the devices up to 120 want to specify it by name. This is, because only the devices up to
121 /dev/sde are in the table above, but not /dev/sdf. Although, you can 121 /dev/sde are in the table above, but not /dev/sdf. Although, you can
122 use the sixth SCSI disk for the root FS, but you have to specify the 122 use the sixth SCSI disk for the root FS, but you have to specify the
123 device by number... (see below). Or, even more strange, you can use the 123 device by number... (see below). Or, even more strange, you can use the
124 fact that there is no range checking of the partition number, and your 124 fact that there is no range checking of the partition number, and your
125 knowledge that each disk uses 16 minors, and write "root=/dev/sde17" 125 knowledge that each disk uses 16 minors, and write "root=/dev/sde17"
126 (for /dev/sdf1). 126 (for /dev/sdf1).
127 127
128 [Strange and maybe uninteresting stuff OFF] 128 [Strange and maybe uninteresting stuff OFF]
129 129
130 If the device containing your root partition isn't in the table 130 If the device containing your root partition isn't in the table
131 above, you can also specify it by major and minor numbers. These are 131 above, you can also specify it by major and minor numbers. These are
132 written in hex, with no prefix and no separator between. E.g., if you 132 written in hex, with no prefix and no separator between. E.g., if you
133 have a CD with contents appropriate as a root filesystem in the first 133 have a CD with contents appropriate as a root filesystem in the first
134 SCSI CD-ROM drive, you boot from it by "root=0b00". Here, hex "0b" = 134 SCSI CD-ROM drive, you boot from it by "root=0b00". Here, hex "0b" =
135 decimal 11 is the major of SCSI CD-ROMs, and the minor 0 stands for 135 decimal 11 is the major of SCSI CD-ROMs, and the minor 0 stands for
136 the first of these. You can find out all valid major numbers by 136 the first of these. You can find out all valid major numbers by
137 looking into include/linux/major.h. 137 looking into include/linux/major.h.
138 138
139 139
140 2.2) ro, rw 140 2.2) ro, rw
141 ----------- 141 -----------
142 142
143 Syntax: ro 143 Syntax: ro
144 or: rw 144 or: rw
145 145
146 These two options tell the kernel whether it should mount the root 146 These two options tell the kernel whether it should mount the root
147 filesystem read-only or read-write. The default is read-only, except 147 filesystem read-only or read-write. The default is read-only, except
148 for ramdisks, which default to read-write. 148 for ramdisks, which default to read-write.
149 149
150 150
151 2.3) debug 151 2.3) debug
152 ---------- 152 ----------
153 153
154 Syntax: debug 154 Syntax: debug
155 155
156 This raises the kernel log level to 10 (the default is 7). This is the 156 This raises the kernel log level to 10 (the default is 7). This is the
157 same level as set by the "dmesg" command, just that the maximum level 157 same level as set by the "dmesg" command, just that the maximum level
158 selectable by dmesg is 8. 158 selectable by dmesg is 8.
159 159
160 160
161 2.4) debug= 161 2.4) debug=
162 ----------- 162 -----------
163 163
164 Syntax: debug=<device> 164 Syntax: debug=<device>
165 165
166 This option causes certain kernel messages be printed to the selected 166 This option causes certain kernel messages be printed to the selected
167 debugging device. This can aid debugging the kernel, since the 167 debugging device. This can aid debugging the kernel, since the
168 messages can be captured and analyzed on some other machine. Which 168 messages can be captured and analyzed on some other machine. Which
169 devices are possible depends on the machine type. There are no checks 169 devices are possible depends on the machine type. There are no checks
170 for the validity of the device name. If the device isn't implemented, 170 for the validity of the device name. If the device isn't implemented,
171 nothing happens. 171 nothing happens.
172 172
173 Messages logged this way are in general stack dumps after kernel 173 Messages logged this way are in general stack dumps after kernel
174 memory faults or bad kernel traps, and kernel panics. To be exact: all 174 memory faults or bad kernel traps, and kernel panics. To be exact: all
175 messages of level 0 (panic messages) and all messages printed while 175 messages of level 0 (panic messages) and all messages printed while
176 the log level is 8 or more (their level doesn't matter). Before stack 176 the log level is 8 or more (their level doesn't matter). Before stack
177 dumps, the kernel sets the log level to 10 automatically. A level of 177 dumps, the kernel sets the log level to 10 automatically. A level of
178 at least 8 can also be set by the "debug" command line option (see 178 at least 8 can also be set by the "debug" command line option (see
179 2.3) and at run time with "dmesg -n 8". 179 2.3) and at run time with "dmesg -n 8".
180 180
181 Devices possible for Amiga: 181 Devices possible for Amiga:
182 182
183 - "ser": built-in serial port; parameters: 9600bps, 8N1 183 - "ser": built-in serial port; parameters: 9600bps, 8N1
184 - "mem": Save the messages to a reserved area in chip mem. After 184 - "mem": Save the messages to a reserved area in chip mem. After
185 rebooting, they can be read under AmigaOS with the tool 185 rebooting, they can be read under AmigaOS with the tool
186 'dmesg'. 186 'dmesg'.
187 187
188 Devices possible for Atari: 188 Devices possible for Atari:
189 189
190 - "ser1": ST-MFP serial port ("Modem1"); parameters: 9600bps, 8N1 190 - "ser1": ST-MFP serial port ("Modem1"); parameters: 9600bps, 8N1
191 - "ser2": SCC channel B serial port ("Modem2"); parameters: 9600bps, 8N1 191 - "ser2": SCC channel B serial port ("Modem2"); parameters: 9600bps, 8N1
192 - "ser" : default serial port 192 - "ser" : default serial port
193 This is "ser2" for a Falcon, and "ser1" for any other machine 193 This is "ser2" for a Falcon, and "ser1" for any other machine
194 - "midi": The MIDI port; parameters: 31250bps, 8N1 194 - "midi": The MIDI port; parameters: 31250bps, 8N1
195 - "par" : parallel port 195 - "par" : parallel port
196 The printing routine for this implements a timeout for the 196 The printing routine for this implements a timeout for the
197 case there's no printer connected (else the kernel would 197 case there's no printer connected (else the kernel would
198 lock up). The timeout is not exact, but usually a few 198 lock up). The timeout is not exact, but usually a few
199 seconds. 199 seconds.
200 200
201 201
202 2.6) ramdisk= 202 2.6) ramdisk=
203 ------------- 203 -------------
204 204
205 Syntax: ramdisk=<size> 205 Syntax: ramdisk=<size>
206 206
207 This option instructs the kernel to set up a ramdisk of the given 207 This option instructs the kernel to set up a ramdisk of the given
208 size in KBytes. Do not use this option if the ramdisk contents are 208 size in KBytes. Do not use this option if the ramdisk contents are
209 passed by bootstrap! In this case, the size is selected automatically 209 passed by bootstrap! In this case, the size is selected automatically
210 and should not be overwritten. 210 and should not be overwritten.
211 211
212 The only application is for root filesystems on floppy disks, that 212 The only application is for root filesystems on floppy disks, that
213 should be loaded into memory. To do that, select the corresponding 213 should be loaded into memory. To do that, select the corresponding
214 size of the disk as ramdisk size, and set the root device to the disk 214 size of the disk as ramdisk size, and set the root device to the disk
215 drive (with "root="). 215 drive (with "root=").
216 216
217 217
218 2.7) swap= 218 2.7) swap=
219 2.8) buff= 219 2.8) buff=
220 ----------- 220 -----------
221 221
222 I can't find any sign of these options in 2.2.6. 222 I can't find any sign of these options in 2.2.6.
223 223
224 224
225 3) General Device Options (Amiga and Atari) 225 3) General Device Options (Amiga and Atari)
226 =========================================== 226 ===========================================
227 227
228 3.1) ether= 228 3.1) ether=
229 ----------- 229 -----------
230 230
231 Syntax: ether=[<irq>[,<base_addr>[,<mem_start>[,<mem_end>]]]],<dev-name> 231 Syntax: ether=[<irq>[,<base_addr>[,<mem_start>[,<mem_end>]]]],<dev-name>
232 232
233 <dev-name> is the name of a net driver, as specified in 233 <dev-name> is the name of a net driver, as specified in
234 drivers/net/Space.c in the Linux source. Most prominent are eth0, ... 234 drivers/net/Space.c in the Linux source. Most prominent are eth0, ...
235 eth3, sl0, ... sl3, ppp0, ..., ppp3, dummy, and lo. 235 eth3, sl0, ... sl3, ppp0, ..., ppp3, dummy, and lo.
236 236
237 The non-ethernet drivers (sl, ppp, dummy, lo) obviously ignore the 237 The non-ethernet drivers (sl, ppp, dummy, lo) obviously ignore the
238 settings by this options. Also, the existing ethernet drivers for 238 settings by this options. Also, the existing ethernet drivers for
239 Linux/m68k (ariadne, a2065, hydra) don't use them because Zorro boards 239 Linux/m68k (ariadne, a2065, hydra) don't use them because Zorro boards
240 are really Plug-'n-Play, so the "ether=" option is useless altogether 240 are really Plug-'n-Play, so the "ether=" option is useless altogether
241 for Linux/m68k. 241 for Linux/m68k.
242 242
243 243
244 3.2) hd= 244 3.2) hd=
245 -------- 245 --------
246 246
247 Syntax: hd=<cylinders>,<heads>,<sectors> 247 Syntax: hd=<cylinders>,<heads>,<sectors>
248 248
249 This option sets the disk geometry of an IDE disk. The first hd= 249 This option sets the disk geometry of an IDE disk. The first hd=
250 option is for the first IDE disk, the second for the second one. 250 option is for the first IDE disk, the second for the second one.
251 (I.e., you can give this option twice.) In most cases, you won't have 251 (I.e., you can give this option twice.) In most cases, you won't have
252 to use this option, since the kernel can obtain the geometry data 252 to use this option, since the kernel can obtain the geometry data
253 itself. It exists just for the case that this fails for one of your 253 itself. It exists just for the case that this fails for one of your
254 disks. 254 disks.
255 255
256 256
257 3.3) max_scsi_luns= 257 3.3) max_scsi_luns=
258 ------------------- 258 -------------------
259 259
260 Syntax: max_scsi_luns=<n> 260 Syntax: max_scsi_luns=<n>
261 261
262 Sets the maximum number of LUNs (logical units) of SCSI devices to 262 Sets the maximum number of LUNs (logical units) of SCSI devices to
263 be scanned. Valid values for <n> are between 1 and 8. Default is 8 if 263 be scanned. Valid values for <n> are between 1 and 8. Default is 8 if
264 "Probe all LUNs on each SCSI device" was selected during the kernel 264 "Probe all LUNs on each SCSI device" was selected during the kernel
265 configuration, else 1. 265 configuration, else 1.
266 266
267 267
268 3.4) st= 268 3.4) st=
269 -------- 269 --------
270 270
271 Syntax: st=<buffer_size>,[<write_thres>,[<max_buffers>]] 271 Syntax: st=<buffer_size>,[<write_thres>,[<max_buffers>]]
272 272
273 Sets several parameters of the SCSI tape driver. <buffer_size> is 273 Sets several parameters of the SCSI tape driver. <buffer_size> is
274 the number of 512-byte buffers reserved for tape operations for each 274 the number of 512-byte buffers reserved for tape operations for each
275 device. <write_thres> sets the number of blocks which must be filled 275 device. <write_thres> sets the number of blocks which must be filled
276 to start an actual write operation to the tape. Maximum value is the 276 to start an actual write operation to the tape. Maximum value is the
277 total number of buffers. <max_buffer> limits the total number of 277 total number of buffers. <max_buffer> limits the total number of
278 buffers allocated for all tape devices. 278 buffers allocated for all tape devices.
279 279
280 280
281 3.5) dmasound= 281 3.5) dmasound=
282 -------------- 282 --------------
283 283
284 Syntax: dmasound=[<buffers>,<buffer-size>[,<catch-radius>]] 284 Syntax: dmasound=[<buffers>,<buffer-size>[,<catch-radius>]]
285 285
286 This option controls some configurations of the Linux/m68k DMA sound 286 This option controls some configurations of the Linux/m68k DMA sound
287 driver (Amiga and Atari): <buffers> is the number of buffers you want 287 driver (Amiga and Atari): <buffers> is the number of buffers you want
288 to use (minimum 4, default 4), <buffer-size> is the size of each 288 to use (minimum 4, default 4), <buffer-size> is the size of each
289 buffer in kilobytes (minimum 4, default 32) and <catch-radius> says 289 buffer in kilobytes (minimum 4, default 32) and <catch-radius> says
290 how much percent of error will be tolerated when setting a frequency 290 how much percent of error will be tolerated when setting a frequency
291 (maximum 10, default 0). For example with 3% you can play 8000Hz 291 (maximum 10, default 0). For example with 3% you can play 8000Hz
292 AU-Files on the Falcon with its hardware frequency of 8195Hz and thus 292 AU-Files on the Falcon with its hardware frequency of 8195Hz and thus
293 don't need to expand the sound. 293 don't need to expand the sound.
294 294
295 295
296 296
297 4) Options for Atari Only 297 4) Options for Atari Only
298 ========================= 298 =========================
299 299
300 4.1) video= 300 4.1) video=
301 ----------- 301 -----------
302 302
303 Syntax: video=<fbname>:<sub-options...> 303 Syntax: video=<fbname>:<sub-options...>
304 304
305 The <fbname> parameter specifies the name of the frame buffer, 305 The <fbname> parameter specifies the name of the frame buffer,
306 eg. most atari users will want to specify `atafb' here. The 306 eg. most atari users will want to specify `atafb' here. The
307 <sub-options> is a comma-separated list of the sub-options listed 307 <sub-options> is a comma-separated list of the sub-options listed
308 below. 308 below.
309 309
310 NB: Please notice that this option was renamed from `atavideo' to 310 NB: Please notice that this option was renamed from `atavideo' to
311 `video' during the development of the 1.3.x kernels, thus you 311 `video' during the development of the 1.3.x kernels, thus you
312 might need to update your boot-scripts if upgrading to 2.x from 312 might need to update your boot-scripts if upgrading to 2.x from
313 an 1.2.x kernel. 313 an 1.2.x kernel.
314 314
315 NBB: The behavior of video= was changed in 2.1.57 so the recommended 315 NBB: The behavior of video= was changed in 2.1.57 so the recommended
316 option is to specify the name of the frame buffer. 316 option is to specify the name of the frame buffer.
317 317
318 4.1.1) Video Mode 318 4.1.1) Video Mode
319 ----------------- 319 -----------------
320 320
321 This sub-option may be any of the predefined video modes, as listed 321 This sub-option may be any of the predefined video modes, as listed
322 in atari/atafb.c in the Linux/m68k source tree. The kernel will 322 in atari/atafb.c in the Linux/m68k source tree. The kernel will
323 activate the given video mode at boot time and make it the default 323 activate the given video mode at boot time and make it the default
324 mode, if the hardware allows. Currently defined names are: 324 mode, if the hardware allows. Currently defined names are:
325 325
326 - stlow : 320x200x4 326 - stlow : 320x200x4
327 - stmid, default5 : 640x200x2 327 - stmid, default5 : 640x200x2
328 - sthigh, default4: 640x400x1 328 - sthigh, default4: 640x400x1
329 - ttlow : 320x480x8, TT only 329 - ttlow : 320x480x8, TT only
330 - ttmid, default1 : 640x480x4, TT only 330 - ttmid, default1 : 640x480x4, TT only
331 - tthigh, default2: 1280x960x1, TT only 331 - tthigh, default2: 1280x960x1, TT only
332 - vga2 : 640x480x1, Falcon only 332 - vga2 : 640x480x1, Falcon only
333 - vga4 : 640x480x2, Falcon only 333 - vga4 : 640x480x2, Falcon only
334 - vga16, default3 : 640x480x4, Falcon only 334 - vga16, default3 : 640x480x4, Falcon only
335 - vga256 : 640x480x8, Falcon only 335 - vga256 : 640x480x8, Falcon only
336 - falh2 : 896x608x1, Falcon only 336 - falh2 : 896x608x1, Falcon only
337 - falh16 : 896x608x4, Falcon only 337 - falh16 : 896x608x4, Falcon only
338 338
339 If no video mode is given on the command line, the kernel tries the 339 If no video mode is given on the command line, the kernel tries the
340 modes names "default<n>" in turn, until one is possible with the 340 modes names "default<n>" in turn, until one is possible with the
341 hardware in use. 341 hardware in use.
342 342
343 A video mode setting doesn't make sense, if the external driver is 343 A video mode setting doesn't make sense, if the external driver is
344 activated by a "external:" sub-option. 344 activated by a "external:" sub-option.
345 345
346 4.1.2) inverse 346 4.1.2) inverse
347 -------------- 347 --------------
348 348
349 Invert the display. This affects both, text (consoles) and graphics 349 Invert the display. This affects both, text (consoles) and graphics
350 (X) display. Usually, the background is chosen to be black. With this 350 (X) display. Usually, the background is chosen to be black. With this
351 option, you can make the background white. 351 option, you can make the background white.
352 352
353 4.1.3) font 353 4.1.3) font
354 ----------- 354 -----------
355 355
356 Syntax: font:<fontname> 356 Syntax: font:<fontname>
357 357
358 Specify the font to use in text modes. Currently you can choose only 358 Specify the font to use in text modes. Currently you can choose only
359 between `VGA8x8', `VGA8x16' and `PEARL8x8'. `VGA8x8' is default, if the 359 between `VGA8x8', `VGA8x16' and `PEARL8x8'. `VGA8x8' is default, if the
360 vertical size of the display is less than 400 pixel rows. Otherwise, the 360 vertical size of the display is less than 400 pixel rows. Otherwise, the
361 `VGA8x16' font is the default. 361 `VGA8x16' font is the default.
362 362
363 4.1.4) hwscroll_ 363 4.1.4) hwscroll_
364 ---------------- 364 ----------------
365 365
366 Syntax: hwscroll_<n> 366 Syntax: hwscroll_<n>
367 367
368 The number of additional lines of video memory to reserve for 368 The number of additional lines of video memory to reserve for
369 speeding up the scrolling ("hardware scrolling"). Hardware scrolling 369 speeding up the scrolling ("hardware scrolling"). Hardware scrolling
370 is possible only if the kernel can set the video base address in steps 370 is possible only if the kernel can set the video base address in steps
371 fine enough. This is true for STE, MegaSTE, TT, and Falcon. It is not 371 fine enough. This is true for STE, MegaSTE, TT, and Falcon. It is not
372 possible with plain STs and graphics cards (The former because the 372 possible with plain STs and graphics cards (The former because the
373 base address must be on a 256 byte boundary there, the latter because 373 base address must be on a 256 byte boundary there, the latter because
374 the kernel doesn't know how to set the base address at all.) 374 the kernel doesn't know how to set the base address at all.)
375 375
376 By default, <n> is set to the number of visible text lines on the 376 By default, <n> is set to the number of visible text lines on the
377 display. Thus, the amount of video memory is doubled, compared to no 377 display. Thus, the amount of video memory is doubled, compared to no
378 hardware scrolling. You can turn off the hardware scrolling altogether 378 hardware scrolling. You can turn off the hardware scrolling altogether
379 by setting <n> to 0. 379 by setting <n> to 0.
380 380
381 4.1.5) internal: 381 4.1.5) internal:
382 ---------------- 382 ----------------
383 383
384 Syntax: internal:<xres>;<yres>[;<xres_max>;<yres_max>;<offset>] 384 Syntax: internal:<xres>;<yres>[;<xres_max>;<yres_max>;<offset>]
385 385
386 This option specifies the capabilities of some extended internal video 386 This option specifies the capabilities of some extended internal video
387 hardware, like e.g. OverScan. <xres> and <yres> give the (extended) 387 hardware, like e.g. OverScan. <xres> and <yres> give the (extended)
388 dimensions of the screen. 388 dimensions of the screen.
389 389
390 If your OverScan needs a black border, you have to write the last 390 If your OverScan needs a black border, you have to write the last
391 three arguments of the "internal:". <xres_max> is the maximum line 391 three arguments of the "internal:". <xres_max> is the maximum line
392 length the hardware allows, <yres_max> the maximum number of lines. 392 length the hardware allows, <yres_max> the maximum number of lines.
393 <offset> is the offset of the visible part of the screen memory to its 393 <offset> is the offset of the visible part of the screen memory to its
394 physical start, in bytes. 394 physical start, in bytes.
395 395
396 Often, extended interval video hardware has to be activated somehow. 396 Often, extended interval video hardware has to be activated somehow.
397 For this, see the "sw_*" options below. 397 For this, see the "sw_*" options below.
398 398
399 4.1.6) external: 399 4.1.6) external:
400 ---------------- 400 ----------------
401 401
402 Syntax: 402 Syntax:
403 external:<xres>;<yres>;<depth>;<org>;<scrmem>[;<scrlen>[;<vgabase>\ 403 external:<xres>;<yres>;<depth>;<org>;<scrmem>[;<scrlen>[;<vgabase>\
404 [;<colw>[;<coltype>[;<xres_virtual>]]]]] 404 [;<colw>[;<coltype>[;<xres_virtual>]]]]]
405 405
406 [I had to break this line...] 406 [I had to break this line...]
407 407
408 This is probably the most complicated parameter... It specifies that 408 This is probably the most complicated parameter... It specifies that
409 you have some external video hardware (a graphics board), and how to 409 you have some external video hardware (a graphics board), and how to
410 use it under Linux/m68k. The kernel cannot know more about the hardware 410 use it under Linux/m68k. The kernel cannot know more about the hardware
411 than you tell it here! The kernel also is unable to set or change any 411 than you tell it here! The kernel also is unable to set or change any
412 video modes, since it doesn't know about any board internal. So, you 412 video modes, since it doesn't know about any board internal. So, you
413 have to switch to that video mode before you start Linux, and cannot 413 have to switch to that video mode before you start Linux, and cannot
414 switch to another mode once Linux has started. 414 switch to another mode once Linux has started.
415 415
416 The first 3 parameters of this sub-option should be obvious: <xres>, 416 The first 3 parameters of this sub-option should be obvious: <xres>,
417 <yres> and <depth> give the dimensions of the screen and the number of 417 <yres> and <depth> give the dimensions of the screen and the number of
418 planes (depth). The depth is is the logarithm to base 2 of the number 418 planes (depth). The depth is the logarithm to base 2 of the number
419 of colors possible. (Or, the other way round: The number of colors is 419 of colors possible. (Or, the other way round: The number of colors is
420 2^depth). 420 2^depth).
421 421
422 You have to tell the kernel furthermore how the video memory is 422 You have to tell the kernel furthermore how the video memory is
423 organized. This is done by a letter as <org> parameter: 423 organized. This is done by a letter as <org> parameter:
424 424
425 'n': "normal planes", i.e. one whole plane after another 425 'n': "normal planes", i.e. one whole plane after another
426 'i': "interleaved planes", i.e. 16 bit of the first plane, than 16 bit 426 'i': "interleaved planes", i.e. 16 bit of the first plane, than 16 bit
427 of the next, and so on... This mode is used only with the 427 of the next, and so on... This mode is used only with the
428 built-in Atari video modes, I think there is no card that 428 built-in Atari video modes, I think there is no card that
429 supports this mode. 429 supports this mode.
430 'p': "packed pixels", i.e. <depth> consecutive bits stand for all 430 'p': "packed pixels", i.e. <depth> consecutive bits stand for all
431 planes of one pixel; this is the most common mode for 8 planes 431 planes of one pixel; this is the most common mode for 8 planes
432 (256 colors) on graphic cards 432 (256 colors) on graphic cards
433 't': "true color" (more or less packed pixels, but without a color 433 't': "true color" (more or less packed pixels, but without a color
434 lookup table); usually depth is 24 434 lookup table); usually depth is 24
435 435
436 For monochrome modes (i.e., <depth> is 1), the <org> letter has a 436 For monochrome modes (i.e., <depth> is 1), the <org> letter has a
437 different meaning: 437 different meaning:
438 438
439 'n': normal colors, i.e. 0=white, 1=black 439 'n': normal colors, i.e. 0=white, 1=black
440 'i': inverted colors, i.e. 0=black, 1=white 440 'i': inverted colors, i.e. 0=black, 1=white
441 441
442 The next important information about the video hardware is the base 442 The next important information about the video hardware is the base
443 address of the video memory. That is given in the <scrmem> parameter, 443 address of the video memory. That is given in the <scrmem> parameter,
444 as a hexadecimal number with a "0x" prefix. You have to find out this 444 as a hexadecimal number with a "0x" prefix. You have to find out this
445 address in the documentation of your hardware. 445 address in the documentation of your hardware.
446 446
447 The next parameter, <scrlen>, tells the kernel about the size of the 447 The next parameter, <scrlen>, tells the kernel about the size of the
448 video memory. If it's missing, the size is calculated from <xres>, 448 video memory. If it's missing, the size is calculated from <xres>,
449 <yres>, and <depth>. For now, it is not useful to write a value here. 449 <yres>, and <depth>. For now, it is not useful to write a value here.
450 It would be used only for hardware scrolling (which isn't possible 450 It would be used only for hardware scrolling (which isn't possible
451 with the external driver, because the kernel cannot set the video base 451 with the external driver, because the kernel cannot set the video base
452 address), or for virtual resolutions under X (which the X server 452 address), or for virtual resolutions under X (which the X server
453 doesn't support yet). So, it's currently best to leave this field 453 doesn't support yet). So, it's currently best to leave this field
454 empty, either by ending the "external:" after the video address or by 454 empty, either by ending the "external:" after the video address or by
455 writing two consecutive semicolons, if you want to give a <vgabase> 455 writing two consecutive semicolons, if you want to give a <vgabase>
456 (it is allowed to leave this parameter empty). 456 (it is allowed to leave this parameter empty).
457 457
458 The <vgabase> parameter is optional. If it is not given, the kernel 458 The <vgabase> parameter is optional. If it is not given, the kernel
459 cannot read or write any color registers of the video hardware, and 459 cannot read or write any color registers of the video hardware, and
460 thus you have to set appropriate colors before you start Linux. But if 460 thus you have to set appropriate colors before you start Linux. But if
461 your card is somehow VGA compatible, you can tell the kernel the base 461 your card is somehow VGA compatible, you can tell the kernel the base
462 address of the VGA register set, so it can change the color lookup 462 address of the VGA register set, so it can change the color lookup
463 table. You have to look up this address in your board's documentation. 463 table. You have to look up this address in your board's documentation.
464 To avoid misunderstandings: <vgabase> is the _base_ address, i.e. a 4k 464 To avoid misunderstandings: <vgabase> is the _base_ address, i.e. a 4k
465 aligned address. For read/writing the color registers, the kernel 465 aligned address. For read/writing the color registers, the kernel
466 uses the addresses vgabase+0x3c7...vgabase+0x3c9. The <vgabase> 466 uses the addresses vgabase+0x3c7...vgabase+0x3c9. The <vgabase>
467 parameter is written in hexadecimal with a "0x" prefix, just as 467 parameter is written in hexadecimal with a "0x" prefix, just as
468 <scrmem>. 468 <scrmem>.
469 469
470 <colw> is meaningful only if <vgabase> is specified. It tells the 470 <colw> is meaningful only if <vgabase> is specified. It tells the
471 kernel how wide each of the color register is, i.e. the number of bits 471 kernel how wide each of the color register is, i.e. the number of bits
472 per single color (red/green/blue). Default is 6, another quite usual 472 per single color (red/green/blue). Default is 6, another quite usual
473 value is 8. 473 value is 8.
474 474
475 Also <coltype> is used together with <vgabase>. It tells the kernel 475 Also <coltype> is used together with <vgabase>. It tells the kernel
476 about the color register model of your gfx board. Currently, the types 476 about the color register model of your gfx board. Currently, the types
477 "vga" (which is also the default) and "mv300" (SANG MV300) are 477 "vga" (which is also the default) and "mv300" (SANG MV300) are
478 implemented. 478 implemented.
479 479
480 Parameter <xres_virtual> is required for ProMST or ET4000 cards where 480 Parameter <xres_virtual> is required for ProMST or ET4000 cards where
481 the physical linelength differs from the visible length. With ProMST, 481 the physical linelength differs from the visible length. With ProMST,
482 xres_virtual must be set to 2048. For ET4000, xres_virtual depends on the 482 xres_virtual must be set to 2048. For ET4000, xres_virtual depends on the
483 initialisation of the video-card. 483 initialisation of the video-card.
484 If you're missing a corresponding yres_virtual: the external part is legacy, 484 If you're missing a corresponding yres_virtual: the external part is legacy,
485 therefore we don't support hardware-dependent functions like hardware-scroll, 485 therefore we don't support hardware-dependent functions like hardware-scroll,
486 panning or blanking. 486 panning or blanking.
487 487
488 4.1.7) eclock: 488 4.1.7) eclock:
489 -------------- 489 --------------
490 490
491 The external pixel clock attached to the Falcon VIDEL shifter. This 491 The external pixel clock attached to the Falcon VIDEL shifter. This
492 currently works only with the ScreenWonder! 492 currently works only with the ScreenWonder!
493 493
494 4.1.8) monitorcap: 494 4.1.8) monitorcap:
495 ------------------- 495 -------------------
496 496
497 Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax> 497 Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax>
498 498
499 This describes the capabilities of a multisync monitor. Don't use it 499 This describes the capabilities of a multisync monitor. Don't use it
500 with a fixed-frequency monitor! For now, only the Falcon frame buffer 500 with a fixed-frequency monitor! For now, only the Falcon frame buffer
501 uses the settings of "monitorcap:". 501 uses the settings of "monitorcap:".
502 502
503 <vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies 503 <vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies
504 your monitor can work with, in Hz. <hmin> and <hmax> are the same for 504 your monitor can work with, in Hz. <hmin> and <hmax> are the same for
505 the horizontal frequency, in kHz. 505 the horizontal frequency, in kHz.
506 506
507 The defaults are 58;62;31;32 (VGA compatible). 507 The defaults are 58;62;31;32 (VGA compatible).
508 508
509 The defaults for TV/SC1224/SC1435 cover both PAL and NTSC standards. 509 The defaults for TV/SC1224/SC1435 cover both PAL and NTSC standards.
510 510
511 4.1.9) keep 511 4.1.9) keep
512 ------------ 512 ------------
513 513
514 If this option is given, the framebuffer device doesn't do any video 514 If this option is given, the framebuffer device doesn't do any video
515 mode calculations and settings on its own. The only Atari fb device 515 mode calculations and settings on its own. The only Atari fb device
516 that does this currently is the Falcon. 516 that does this currently is the Falcon.
517 517
518 What you reach with this: Settings for unknown video extensions 518 What you reach with this: Settings for unknown video extensions
519 aren't overridden by the driver, so you can still use the mode found 519 aren't overridden by the driver, so you can still use the mode found
520 when booting, when the driver doesn't know to set this mode itself. 520 when booting, when the driver doesn't know to set this mode itself.
521 But this also means, that you can't switch video modes anymore... 521 But this also means, that you can't switch video modes anymore...
522 522
523 An example where you may want to use "keep" is the ScreenBlaster for 523 An example where you may want to use "keep" is the ScreenBlaster for
524 the Falcon. 524 the Falcon.
525 525
526 526
527 4.2) atamouse= 527 4.2) atamouse=
528 -------------- 528 --------------
529 529
530 Syntax: atamouse=<x-threshold>,[<y-threshold>] 530 Syntax: atamouse=<x-threshold>,[<y-threshold>]
531 531
532 With this option, you can set the mouse movement reporting threshold. 532 With this option, you can set the mouse movement reporting threshold.
533 This is the number of pixels of mouse movement that have to accumulate 533 This is the number of pixels of mouse movement that have to accumulate
534 before the IKBD sends a new mouse packet to the kernel. Higher values 534 before the IKBD sends a new mouse packet to the kernel. Higher values
535 reduce the mouse interrupt load and thus reduce the chance of keyboard 535 reduce the mouse interrupt load and thus reduce the chance of keyboard
536 overruns. Lower values give a slightly faster mouse responses and 536 overruns. Lower values give a slightly faster mouse responses and
537 slightly better mouse tracking. 537 slightly better mouse tracking.
538 538
539 You can set the threshold in x and y separately, but usually this is 539 You can set the threshold in x and y separately, but usually this is
540 of little practical use. If there's just one number in the option, it 540 of little practical use. If there's just one number in the option, it
541 is used for both dimensions. The default value is 2 for both 541 is used for both dimensions. The default value is 2 for both
542 thresholds. 542 thresholds.
543 543
544 544
545 4.3) ataflop= 545 4.3) ataflop=
546 ------------- 546 -------------
547 547
548 Syntax: ataflop=<drive type>[,<trackbuffering>[,<steprateA>[,<steprateB>]]] 548 Syntax: ataflop=<drive type>[,<trackbuffering>[,<steprateA>[,<steprateB>]]]
549 549
550 The drive type may be 0, 1, or 2, for DD, HD, and ED, resp. This 550 The drive type may be 0, 1, or 2, for DD, HD, and ED, resp. This
551 setting affects how many buffers are reserved and which formats are 551 setting affects how many buffers are reserved and which formats are
552 probed (see also below). The default is 1 (HD). Only one drive type 552 probed (see also below). The default is 1 (HD). Only one drive type
553 can be selected. If you have two disk drives, select the "better" 553 can be selected. If you have two disk drives, select the "better"
554 type. 554 type.
555 555
556 The second parameter <trackbuffer> tells the kernel whether to use 556 The second parameter <trackbuffer> tells the kernel whether to use
557 track buffering (1) or not (0). The default is machine-dependent: 557 track buffering (1) or not (0). The default is machine-dependent:
558 no for the Medusa and yes for all others. 558 no for the Medusa and yes for all others.
559 559
560 With the two following parameters, you can change the default 560 With the two following parameters, you can change the default
561 steprate used for drive A and B, resp. 561 steprate used for drive A and B, resp.
562 562
563 563
564 4.4) atascsi= 564 4.4) atascsi=
565 ------------- 565 -------------
566 566
567 Syntax: atascsi=<can_queue>[,<cmd_per_lun>[,<scat-gat>[,<host-id>[,<tagged>]]]] 567 Syntax: atascsi=<can_queue>[,<cmd_per_lun>[,<scat-gat>[,<host-id>[,<tagged>]]]]
568 568
569 This option sets some parameters for the Atari native SCSI driver. 569 This option sets some parameters for the Atari native SCSI driver.
570 Generally, any number of arguments can be omitted from the end. And 570 Generally, any number of arguments can be omitted from the end. And
571 for each of the numbers, a negative value means "use default". The 571 for each of the numbers, a negative value means "use default". The
572 defaults depend on whether TT-style or Falcon-style SCSI is used. 572 defaults depend on whether TT-style or Falcon-style SCSI is used.
573 Below, defaults are noted as n/m, where the first value refers to 573 Below, defaults are noted as n/m, where the first value refers to
574 TT-SCSI and the latter to Falcon-SCSI. If an illegal value is given 574 TT-SCSI and the latter to Falcon-SCSI. If an illegal value is given
575 for one parameter, an error message is printed and that one setting is 575 for one parameter, an error message is printed and that one setting is
576 ignored (others aren't affected). 576 ignored (others aren't affected).
577 577
578 <can_queue>: 578 <can_queue>:
579 This is the maximum number of SCSI commands queued internally to the 579 This is the maximum number of SCSI commands queued internally to the
580 Atari SCSI driver. A value of 1 effectively turns off the driver 580 Atari SCSI driver. A value of 1 effectively turns off the driver
581 internal multitasking (if it causes problems). Legal values are >= 581 internal multitasking (if it causes problems). Legal values are >=
582 1. <can_queue> can be as high as you like, but values greater than 582 1. <can_queue> can be as high as you like, but values greater than
583 <cmd_per_lun> times the number of SCSI targets (LUNs) you have 583 <cmd_per_lun> times the number of SCSI targets (LUNs) you have
584 don't make sense. Default: 16/8. 584 don't make sense. Default: 16/8.
585 585
586 <cmd_per_lun>: 586 <cmd_per_lun>:
587 Maximum number of SCSI commands issued to the driver for one 587 Maximum number of SCSI commands issued to the driver for one
588 logical unit (LUN, usually one SCSI target). Legal values start 588 logical unit (LUN, usually one SCSI target). Legal values start
589 from 1. If tagged queuing (see below) is not used, values greater 589 from 1. If tagged queuing (see below) is not used, values greater
590 than 2 don't make sense, but waste memory. Otherwise, the maximum 590 than 2 don't make sense, but waste memory. Otherwise, the maximum
591 is the number of command tags available to the driver (currently 591 is the number of command tags available to the driver (currently
592 32). Default: 8/1. (Note: Values > 1 seem to cause problems on a 592 32). Default: 8/1. (Note: Values > 1 seem to cause problems on a
593 Falcon, cause not yet known.) 593 Falcon, cause not yet known.)
594 594
595 The <cmd_per_lun> value at a great part determines the amount of 595 The <cmd_per_lun> value at a great part determines the amount of
596 memory SCSI reserves for itself. The formula is rather 596 memory SCSI reserves for itself. The formula is rather
597 complicated, but I can give you some hints: 597 complicated, but I can give you some hints:
598 no scatter-gather : cmd_per_lun * 232 bytes 598 no scatter-gather : cmd_per_lun * 232 bytes
599 full scatter-gather: cmd_per_lun * approx. 17 Kbytes 599 full scatter-gather: cmd_per_lun * approx. 17 Kbytes
600 600
601 <scat-gat>: 601 <scat-gat>:
602 Size of the scatter-gather table, i.e. the number of requests 602 Size of the scatter-gather table, i.e. the number of requests
603 consecutive on the disk that can be merged into one SCSI command. 603 consecutive on the disk that can be merged into one SCSI command.
604 Legal values are between 0 and 255. Default: 255/0. Note: This 604 Legal values are between 0 and 255. Default: 255/0. Note: This
605 value is forced to 0 on a Falcon, since scatter-gather isn't 605 value is forced to 0 on a Falcon, since scatter-gather isn't
606 possible with the ST-DMA. Not using scatter-gather hurts 606 possible with the ST-DMA. Not using scatter-gather hurts
607 performance significantly. 607 performance significantly.
608 608
609 <host-id>: 609 <host-id>:
610 The SCSI ID to be used by the initiator (your Atari). This is 610 The SCSI ID to be used by the initiator (your Atari). This is
611 usually 7, the highest possible ID. Every ID on the SCSI bus must 611 usually 7, the highest possible ID. Every ID on the SCSI bus must
612 be unique. Default: determined at run time: If the NV-RAM checksum 612 be unique. Default: determined at run time: If the NV-RAM checksum
613 is valid, and bit 7 in byte 30 of the NV-RAM is set, the lower 3 613 is valid, and bit 7 in byte 30 of the NV-RAM is set, the lower 3
614 bits of this byte are used as the host ID. (This method is defined 614 bits of this byte are used as the host ID. (This method is defined
615 by Atari and also used by some TOS HD drivers.) If the above 615 by Atari and also used by some TOS HD drivers.) If the above
616 isn't given, the default ID is 7. (both, TT and Falcon). 616 isn't given, the default ID is 7. (both, TT and Falcon).
617 617
618 <tagged>: 618 <tagged>:
619 0 means turn off tagged queuing support, all other values > 0 mean 619 0 means turn off tagged queuing support, all other values > 0 mean
620 use tagged queuing for targets that support it. Default: currently 620 use tagged queuing for targets that support it. Default: currently
621 off, but this may change when tagged queuing handling has been 621 off, but this may change when tagged queuing handling has been
622 proved to be reliable. 622 proved to be reliable.
623 623
624 Tagged queuing means that more than one command can be issued to 624 Tagged queuing means that more than one command can be issued to
625 one LUN, and the SCSI device itself orders the requests so they 625 one LUN, and the SCSI device itself orders the requests so they
626 can be performed in optimal order. Not all SCSI devices support 626 can be performed in optimal order. Not all SCSI devices support
627 tagged queuing (:-(). 627 tagged queuing (:-().
628 628
629 4.5 switches= 629 4.5 switches=
630 ------------- 630 -------------
631 631
632 Syntax: switches=<list of switches> 632 Syntax: switches=<list of switches>
633 633
634 With this option you can switch some hardware lines that are often 634 With this option you can switch some hardware lines that are often
635 used to enable/disable certain hardware extensions. Examples are 635 used to enable/disable certain hardware extensions. Examples are
636 OverScan, overclocking, ... 636 OverScan, overclocking, ...
637 637
638 The <list of switches> is a comma-separated list of the following 638 The <list of switches> is a comma-separated list of the following
639 items: 639 items:
640 640
641 ikbd: set RTS of the keyboard ACIA high 641 ikbd: set RTS of the keyboard ACIA high
642 midi: set RTS of the MIDI ACIA high 642 midi: set RTS of the MIDI ACIA high
643 snd6: set bit 6 of the PSG port A 643 snd6: set bit 6 of the PSG port A
644 snd7: set bit 6 of the PSG port A 644 snd7: set bit 6 of the PSG port A
645 645
646 It doesn't make sense to mention a switch more than once (no 646 It doesn't make sense to mention a switch more than once (no
647 difference to only once), but you can give as many switches as you 647 difference to only once), but you can give as many switches as you
648 want to enable different features. The switch lines are set as early 648 want to enable different features. The switch lines are set as early
649 as possible during kernel initialization (even before determining the 649 as possible during kernel initialization (even before determining the
650 present hardware.) 650 present hardware.)
651 651
652 All of the items can also be prefixed with "ov_", i.e. "ov_ikbd", 652 All of the items can also be prefixed with "ov_", i.e. "ov_ikbd",
653 "ov_midi", ... These options are meant for switching on an OverScan 653 "ov_midi", ... These options are meant for switching on an OverScan
654 video extension. The difference to the bare option is that the 654 video extension. The difference to the bare option is that the
655 switch-on is done after video initialization, and somehow synchronized 655 switch-on is done after video initialization, and somehow synchronized
656 to the HBLANK. A speciality is that ov_ikbd and ov_midi are switched 656 to the HBLANK. A speciality is that ov_ikbd and ov_midi are switched
657 off before rebooting, so that OverScan is disabled and TOS boots 657 off before rebooting, so that OverScan is disabled and TOS boots
658 correctly. 658 correctly.
659 659
660 If you give an option both, with and without the "ov_" prefix, the 660 If you give an option both, with and without the "ov_" prefix, the
661 earlier initialization ("ov_"-less) takes precedence. But the 661 earlier initialization ("ov_"-less) takes precedence. But the
662 switching-off on reset still happens in this case. 662 switching-off on reset still happens in this case.
663 663
664 5) Options for Amiga Only: 664 5) Options for Amiga Only:
665 ========================== 665 ==========================
666 666
667 5.1) video= 667 5.1) video=
668 ----------- 668 -----------
669 669
670 Syntax: video=<fbname>:<sub-options...> 670 Syntax: video=<fbname>:<sub-options...>
671 671
672 The <fbname> parameter specifies the name of the frame buffer, valid 672 The <fbname> parameter specifies the name of the frame buffer, valid
673 options are `amifb', `cyber', 'virge', `retz3' and `clgen', provided 673 options are `amifb', `cyber', 'virge', `retz3' and `clgen', provided
674 that the respective frame buffer devices have been compiled into the 674 that the respective frame buffer devices have been compiled into the
675 kernel (or compiled as loadable modules). The behavior of the <fbname> 675 kernel (or compiled as loadable modules). The behavior of the <fbname>
676 option was changed in 2.1.57 so it is now recommended to specify this 676 option was changed in 2.1.57 so it is now recommended to specify this
677 option. 677 option.
678 678
679 The <sub-options> is a comma-separated list of the sub-options listed 679 The <sub-options> is a comma-separated list of the sub-options listed
680 below. This option is organized similar to the Atari version of the 680 below. This option is organized similar to the Atari version of the
681 "video"-option (4.1), but knows fewer sub-options. 681 "video"-option (4.1), but knows fewer sub-options.
682 682
683 5.1.1) video mode 683 5.1.1) video mode
684 ----------------- 684 -----------------
685 685
686 Again, similar to the video mode for the Atari (see 4.1.1). Predefined 686 Again, similar to the video mode for the Atari (see 4.1.1). Predefined
687 modes depend on the used frame buffer device. 687 modes depend on the used frame buffer device.
688 688
689 OCS, ECS and AGA machines all use the color frame buffer. The following 689 OCS, ECS and AGA machines all use the color frame buffer. The following
690 predefined video modes are available: 690 predefined video modes are available:
691 691
692 NTSC modes: 692 NTSC modes:
693 - ntsc : 640x200, 15 kHz, 60 Hz 693 - ntsc : 640x200, 15 kHz, 60 Hz
694 - ntsc-lace : 640x400, 15 kHz, 60 Hz interlaced 694 - ntsc-lace : 640x400, 15 kHz, 60 Hz interlaced
695 PAL modes: 695 PAL modes:
696 - pal : 640x256, 15 kHz, 50 Hz 696 - pal : 640x256, 15 kHz, 50 Hz
697 - pal-lace : 640x512, 15 kHz, 50 Hz interlaced 697 - pal-lace : 640x512, 15 kHz, 50 Hz interlaced
698 ECS modes: 698 ECS modes:
699 - multiscan : 640x480, 29 kHz, 57 Hz 699 - multiscan : 640x480, 29 kHz, 57 Hz
700 - multiscan-lace : 640x960, 29 kHz, 57 Hz interlaced 700 - multiscan-lace : 640x960, 29 kHz, 57 Hz interlaced
701 - euro36 : 640x200, 15 kHz, 72 Hz 701 - euro36 : 640x200, 15 kHz, 72 Hz
702 - euro36-lace : 640x400, 15 kHz, 72 Hz interlaced 702 - euro36-lace : 640x400, 15 kHz, 72 Hz interlaced
703 - euro72 : 640x400, 29 kHz, 68 Hz 703 - euro72 : 640x400, 29 kHz, 68 Hz
704 - euro72-lace : 640x800, 29 kHz, 68 Hz interlaced 704 - euro72-lace : 640x800, 29 kHz, 68 Hz interlaced
705 - super72 : 800x300, 23 kHz, 70 Hz 705 - super72 : 800x300, 23 kHz, 70 Hz
706 - super72-lace : 800x600, 23 kHz, 70 Hz interlaced 706 - super72-lace : 800x600, 23 kHz, 70 Hz interlaced
707 - dblntsc-ff : 640x400, 27 kHz, 57 Hz 707 - dblntsc-ff : 640x400, 27 kHz, 57 Hz
708 - dblntsc-lace : 640x800, 27 kHz, 57 Hz interlaced 708 - dblntsc-lace : 640x800, 27 kHz, 57 Hz interlaced
709 - dblpal-ff : 640x512, 27 kHz, 47 Hz 709 - dblpal-ff : 640x512, 27 kHz, 47 Hz
710 - dblpal-lace : 640x1024, 27 kHz, 47 Hz interlaced 710 - dblpal-lace : 640x1024, 27 kHz, 47 Hz interlaced
711 - dblntsc : 640x200, 27 kHz, 57 Hz doublescan 711 - dblntsc : 640x200, 27 kHz, 57 Hz doublescan
712 - dblpal : 640x256, 27 kHz, 47 Hz doublescan 712 - dblpal : 640x256, 27 kHz, 47 Hz doublescan
713 VGA modes: 713 VGA modes:
714 - vga : 640x480, 31 kHz, 60 Hz 714 - vga : 640x480, 31 kHz, 60 Hz
715 - vga70 : 640x400, 31 kHz, 70 Hz 715 - vga70 : 640x400, 31 kHz, 70 Hz
716 716
717 Please notice that the ECS and VGA modes require either an ECS or AGA 717 Please notice that the ECS and VGA modes require either an ECS or AGA
718 chipset, and that these modes are limited to 2-bit color for the ECS 718 chipset, and that these modes are limited to 2-bit color for the ECS
719 chipset and 8-bit color for the AGA chipset. 719 chipset and 8-bit color for the AGA chipset.
720 720
721 5.1.2) depth 721 5.1.2) depth
722 ------------ 722 ------------
723 723
724 Syntax: depth:<nr. of bit-planes> 724 Syntax: depth:<nr. of bit-planes>
725 725
726 Specify the number of bit-planes for the selected video-mode. 726 Specify the number of bit-planes for the selected video-mode.
727 727
728 5.1.3) inverse 728 5.1.3) inverse
729 -------------- 729 --------------
730 730
731 Use inverted display (black on white). Functionally the same as the 731 Use inverted display (black on white). Functionally the same as the
732 "inverse" sub-option for the Atari. 732 "inverse" sub-option for the Atari.
733 733
734 5.1.4) font 734 5.1.4) font
735 ----------- 735 -----------
736 736
737 Syntax: font:<fontname> 737 Syntax: font:<fontname>
738 738
739 Specify the font to use in text modes. Functionally the same as the 739 Specify the font to use in text modes. Functionally the same as the
740 "font" sub-option for the Atari, except that `PEARL8x8' is used instead 740 "font" sub-option for the Atari, except that `PEARL8x8' is used instead
741 of `VGA8x8' if the vertical size of the display is less than 400 pixel 741 of `VGA8x8' if the vertical size of the display is less than 400 pixel
742 rows. 742 rows.
743 743
744 5.1.5) monitorcap: 744 5.1.5) monitorcap:
745 ------------------- 745 -------------------
746 746
747 Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax> 747 Syntax: monitorcap:<vmin>;<vmax>;<hmin>;<hmax>
748 748
749 This describes the capabilities of a multisync monitor. For now, only 749 This describes the capabilities of a multisync monitor. For now, only
750 the color frame buffer uses the settings of "monitorcap:". 750 the color frame buffer uses the settings of "monitorcap:".
751 751
752 <vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies 752 <vmin> and <vmax> are the minimum and maximum, resp., vertical frequencies
753 your monitor can work with, in Hz. <hmin> and <hmax> are the same for 753 your monitor can work with, in Hz. <hmin> and <hmax> are the same for
754 the horizontal frequency, in kHz. 754 the horizontal frequency, in kHz.
755 755
756 The defaults are 50;90;15;38 (Generic Amiga multisync monitor). 756 The defaults are 50;90;15;38 (Generic Amiga multisync monitor).
757 757
758 758
759 5.2) fd_def_df0= 759 5.2) fd_def_df0=
760 ---------------- 760 ----------------
761 761
762 Syntax: fd_def_df0=<value> 762 Syntax: fd_def_df0=<value>
763 763
764 Sets the df0 value for "silent" floppy drives. The value should be in 764 Sets the df0 value for "silent" floppy drives. The value should be in
765 hexadecimal with "0x" prefix. 765 hexadecimal with "0x" prefix.
766 766
767 767
768 5.3) wd33c93= 768 5.3) wd33c93=
769 ------------- 769 -------------
770 770
771 Syntax: wd33c93=<sub-options...> 771 Syntax: wd33c93=<sub-options...>
772 772
773 These options affect the A590/A2091, A3000 and GVP Series II SCSI 773 These options affect the A590/A2091, A3000 and GVP Series II SCSI
774 controllers. 774 controllers.
775 775
776 The <sub-options> is a comma-separated list of the sub-options listed 776 The <sub-options> is a comma-separated list of the sub-options listed
777 below. 777 below.
778 778
779 5.3.1) nosync 779 5.3.1) nosync
780 ------------- 780 -------------
781 781
782 Syntax: nosync:bitmask 782 Syntax: nosync:bitmask
783 783
784 bitmask is a byte where the 1st 7 bits correspond with the 7 784 bitmask is a byte where the 1st 7 bits correspond with the 7
785 possible SCSI devices. Set a bit to prevent sync negotiation on that 785 possible SCSI devices. Set a bit to prevent sync negotiation on that
786 device. To maintain backwards compatibility, a command-line such as 786 device. To maintain backwards compatibility, a command-line such as
787 "wd33c93=255" will be automatically translated to 787 "wd33c93=255" will be automatically translated to
788 "wd33c93=nosync:0xff". The default is to disable sync negotiation for 788 "wd33c93=nosync:0xff". The default is to disable sync negotiation for
789 all devices, eg. nosync:0xff. 789 all devices, eg. nosync:0xff.
790 790
791 5.3.2) period 791 5.3.2) period
792 ------------- 792 -------------
793 793
794 Syntax: period:ns 794 Syntax: period:ns
795 795
796 `ns' is the minimum # of nanoseconds in a SCSI data transfer 796 `ns' is the minimum # of nanoseconds in a SCSI data transfer
797 period. Default is 500; acceptable values are 250 - 1000. 797 period. Default is 500; acceptable values are 250 - 1000.
798 798
799 5.3.3) disconnect 799 5.3.3) disconnect
800 ----------------- 800 -----------------
801 801
802 Syntax: disconnect:x 802 Syntax: disconnect:x
803 803
804 Specify x = 0 to never allow disconnects, 2 to always allow them. 804 Specify x = 0 to never allow disconnects, 2 to always allow them.
805 x = 1 does 'adaptive' disconnects, which is the default and generally 805 x = 1 does 'adaptive' disconnects, which is the default and generally
806 the best choice. 806 the best choice.
807 807
808 5.3.4) debug 808 5.3.4) debug
809 ------------ 809 ------------
810 810
811 Syntax: debug:x 811 Syntax: debug:x
812 812
813 If `DEBUGGING_ON' is defined, x is a bit mask that causes various 813 If `DEBUGGING_ON' is defined, x is a bit mask that causes various
814 types of debug output to printed - see the DB_xxx defines in 814 types of debug output to printed - see the DB_xxx defines in
815 wd33c93.h. 815 wd33c93.h.
816 816
817 5.3.5) clock 817 5.3.5) clock
818 ------------ 818 ------------
819 819
820 Syntax: clock:x 820 Syntax: clock:x
821 821
822 x = clock input in MHz for WD33c93 chip. Normal values would be from 822 x = clock input in MHz for WD33c93 chip. Normal values would be from
823 8 through 20. The default value depends on your hostadapter(s), 823 8 through 20. The default value depends on your hostadapter(s),
824 default for the A3000 internal controller is 14, for the A2091 it's 8 824 default for the A3000 internal controller is 14, for the A2091 it's 8
825 and for the GVP hostadapters it's either 8 or 14, depending on the 825 and for the GVP hostadapters it's either 8 or 14, depending on the
826 hostadapter and the SCSI-clock jumper present on some GVP 826 hostadapter and the SCSI-clock jumper present on some GVP
827 hostadapters. 827 hostadapters.
828 828
829 5.3.6) next 829 5.3.6) next
830 ----------- 830 -----------
831 831
832 No argument. Used to separate blocks of keywords when there's more 832 No argument. Used to separate blocks of keywords when there's more
833 than one wd33c93-based host adapter in the system. 833 than one wd33c93-based host adapter in the system.
834 834
835 5.3.7) nodma 835 5.3.7) nodma
836 ------------ 836 ------------
837 837
838 Syntax: nodma:x 838 Syntax: nodma:x
839 839
840 If x is 1 (or if the option is just written as "nodma"), the WD33c93 840 If x is 1 (or if the option is just written as "nodma"), the WD33c93
841 controller will not use DMA (= direct memory access) to access the 841 controller will not use DMA (= direct memory access) to access the
842 Amiga's memory. This is useful for some systems (like A3000's and 842 Amiga's memory. This is useful for some systems (like A3000's and
843 A4000's with the A3640 accelerator, revision 3.0) that have problems 843 A4000's with the A3640 accelerator, revision 3.0) that have problems
844 using DMA to chip memory. The default is 0, i.e. to use DMA if 844 using DMA to chip memory. The default is 0, i.e. to use DMA if
845 possible. 845 possible.
846 846
847 847
848 5.4) gvp11= 848 5.4) gvp11=
849 ----------- 849 -----------
850 850
851 Syntax: gvp11=<addr-mask> 851 Syntax: gvp11=<addr-mask>
852 852
853 The earlier versions of the GVP driver did not handle DMA 853 The earlier versions of the GVP driver did not handle DMA
854 address-mask settings correctly which made it necessary for some 854 address-mask settings correctly which made it necessary for some
855 people to use this option, in order to get their GVP controller 855 people to use this option, in order to get their GVP controller
856 running under Linux. These problems have hopefully been solved and the 856 running under Linux. These problems have hopefully been solved and the
857 use of this option is now highly unrecommended! 857 use of this option is now highly unrecommended!
858 858
859 Incorrect use can lead to unpredictable behavior, so please only use 859 Incorrect use can lead to unpredictable behavior, so please only use
860 this option if you *know* what you are doing and have a reason to do 860 this option if you *know* what you are doing and have a reason to do
861 so. In any case if you experience problems and need to use this 861 so. In any case if you experience problems and need to use this
862 option, please inform us about it by mailing to the Linux/68k kernel 862 option, please inform us about it by mailing to the Linux/68k kernel
863 mailing list. 863 mailing list.
864 864
865 The address mask set by this option specifies which addresses are 865 The address mask set by this option specifies which addresses are
866 valid for DMA with the GVP Series II SCSI controller. An address is 866 valid for DMA with the GVP Series II SCSI controller. An address is
867 valid, if no bits are set except the bits that are set in the mask, 867 valid, if no bits are set except the bits that are set in the mask,
868 too. 868 too.
869 869
870 Some versions of the GVP can only DMA into a 24 bit address range, 870 Some versions of the GVP can only DMA into a 24 bit address range,
871 some can address a 25 bit address range while others can use the whole 871 some can address a 25 bit address range while others can use the whole
872 32 bit address range for DMA. The correct setting depends on your 872 32 bit address range for DMA. The correct setting depends on your
873 controller and should be autodetected by the driver. An example is the 873 controller and should be autodetected by the driver. An example is the
874 24 bit region which is specified by a mask of 0x00fffffe. 874 24 bit region which is specified by a mask of 0x00fffffe.
875 875
876 876
877 5.5) 53c7xx= 877 5.5) 53c7xx=
878 ------------ 878 ------------
879 879
880 Syntax: 53c7xx=<sub-options...> 880 Syntax: 53c7xx=<sub-options...>
881 881
882 These options affect the A4000T, A4091, WarpEngine, Blizzard 603e+, 882 These options affect the A4000T, A4091, WarpEngine, Blizzard 603e+,
883 and GForce 040/060 SCSI controllers on the Amiga, as well as the 883 and GForce 040/060 SCSI controllers on the Amiga, as well as the
884 builtin MVME 16x SCSI controller. 884 builtin MVME 16x SCSI controller.
885 885
886 The <sub-options> is a comma-separated list of the sub-options listed 886 The <sub-options> is a comma-separated list of the sub-options listed
887 below. 887 below.
888 888
889 5.5.1) nosync 889 5.5.1) nosync
890 ------------- 890 -------------
891 891
892 Syntax: nosync:0 892 Syntax: nosync:0
893 893
894 Disables sync negotiation for all devices. Any value after the 894 Disables sync negotiation for all devices. Any value after the
895 colon is acceptable (and has the same effect). 895 colon is acceptable (and has the same effect).
896 896
897 5.5.2) noasync 897 5.5.2) noasync
898 -------------- 898 --------------
899 899
900 Syntax: noasync:0 900 Syntax: noasync:0
901 901
902 Disables async and sync negotiation for all devices. Any value 902 Disables async and sync negotiation for all devices. Any value
903 after the colon is acceptable (and has the same effect). 903 after the colon is acceptable (and has the same effect).
904 904
905 5.5.3) nodisconnect 905 5.5.3) nodisconnect
906 ------------------- 906 -------------------
907 907
908 Syntax: nodisconnect:0 908 Syntax: nodisconnect:0
909 909
910 Disables SCSI disconnects. Any value after the colon is acceptable 910 Disables SCSI disconnects. Any value after the colon is acceptable
911 (and has the same effect). 911 (and has the same effect).
912 912
913 5.5.4) validids 913 5.5.4) validids
914 --------------- 914 ---------------
915 915
916 Syntax: validids:0xNN 916 Syntax: validids:0xNN
917 917
918 Specify which SCSI ids the driver should pay attention to. This is 918 Specify which SCSI ids the driver should pay attention to. This is
919 a bitmask (i.e. to only pay attention to ID#4, you'd use 0x10). 919 a bitmask (i.e. to only pay attention to ID#4, you'd use 0x10).
920 Default is 0x7f (devices 0-6). 920 Default is 0x7f (devices 0-6).
921 921
922 5.5.5) opthi 922 5.5.5) opthi
923 5.5.6) optlo 923 5.5.6) optlo
924 ------------ 924 ------------
925 925
926 Syntax: opthi:M,optlo:N 926 Syntax: opthi:M,optlo:N
927 927
928 Specify options for "hostdata->options". The acceptable definitions 928 Specify options for "hostdata->options". The acceptable definitions
929 are listed in drivers/scsi/53c7xx.h; the 32 high bits should be in 929 are listed in drivers/scsi/53c7xx.h; the 32 high bits should be in
930 opthi and the 32 low bits in optlo. They must be specified in the 930 opthi and the 32 low bits in optlo. They must be specified in the
931 order opthi=M,optlo=N. 931 order opthi=M,optlo=N.
932 932
933 5.5.7) next 933 5.5.7) next
934 ----------- 934 -----------
935 935
936 No argument. Used to separate blocks of keywords when there's more 936 No argument. Used to separate blocks of keywords when there's more
937 than one 53c7xx host adapter in the system. 937 than one 53c7xx host adapter in the system.
938 938
939 939
940 /* Local Variables: */ 940 /* Local Variables: */
941 /* mode: text */ 941 /* mode: text */
942 /* End: */ 942 /* End: */
943 943
Documentation/memory-barriers.txt
1 ============================ 1 ============================
2 LINUX KERNEL MEMORY BARRIERS 2 LINUX KERNEL MEMORY BARRIERS
3 ============================ 3 ============================
4 4
5 By: David Howells <dhowells@redhat.com> 5 By: David Howells <dhowells@redhat.com>
6 6
7 Contents: 7 Contents:
8 8
9 (*) Abstract memory access model. 9 (*) Abstract memory access model.
10 10
11 - Device operations. 11 - Device operations.
12 - Guarantees. 12 - Guarantees.
13 13
14 (*) What are memory barriers? 14 (*) What are memory barriers?
15 15
16 - Varieties of memory barrier. 16 - Varieties of memory barrier.
17 - What may not be assumed about memory barriers? 17 - What may not be assumed about memory barriers?
18 - Data dependency barriers. 18 - Data dependency barriers.
19 - Control dependencies. 19 - Control dependencies.
20 - SMP barrier pairing. 20 - SMP barrier pairing.
21 - Examples of memory barrier sequences. 21 - Examples of memory barrier sequences.
22 - Read memory barriers vs load speculation. 22 - Read memory barriers vs load speculation.
23 23
24 (*) Explicit kernel barriers. 24 (*) Explicit kernel barriers.
25 25
26 - Compiler barrier. 26 - Compiler barrier.
27 - The CPU memory barriers. 27 - The CPU memory barriers.
28 - MMIO write barrier. 28 - MMIO write barrier.
29 29
30 (*) Implicit kernel memory barriers. 30 (*) Implicit kernel memory barriers.
31 31
32 - Locking functions. 32 - Locking functions.
33 - Interrupt disabling functions. 33 - Interrupt disabling functions.
34 - Miscellaneous functions. 34 - Miscellaneous functions.
35 35
36 (*) Inter-CPU locking barrier effects. 36 (*) Inter-CPU locking barrier effects.
37 37
38 - Locks vs memory accesses. 38 - Locks vs memory accesses.
39 - Locks vs I/O accesses. 39 - Locks vs I/O accesses.
40 40
41 (*) Where are memory barriers needed? 41 (*) Where are memory barriers needed?
42 42
43 - Interprocessor interaction. 43 - Interprocessor interaction.
44 - Atomic operations. 44 - Atomic operations.
45 - Accessing devices. 45 - Accessing devices.
46 - Interrupts. 46 - Interrupts.
47 47
48 (*) Kernel I/O barrier effects. 48 (*) Kernel I/O barrier effects.
49 49
50 (*) Assumed minimum execution ordering model. 50 (*) Assumed minimum execution ordering model.
51 51
52 (*) The effects of the cpu cache. 52 (*) The effects of the cpu cache.
53 53
54 - Cache coherency. 54 - Cache coherency.
55 - Cache coherency vs DMA. 55 - Cache coherency vs DMA.
56 - Cache coherency vs MMIO. 56 - Cache coherency vs MMIO.
57 57
58 (*) The things CPUs get up to. 58 (*) The things CPUs get up to.
59 59
60 - And then there's the Alpha. 60 - And then there's the Alpha.
61 61
62 (*) References. 62 (*) References.
63 63
64 64
65 ============================ 65 ============================
66 ABSTRACT MEMORY ACCESS MODEL 66 ABSTRACT MEMORY ACCESS MODEL
67 ============================ 67 ============================
68 68
69 Consider the following abstract model of the system: 69 Consider the following abstract model of the system:
70 70
71 : : 71 : :
72 : : 72 : :
73 : : 73 : :
74 +-------+ : +--------+ : +-------+ 74 +-------+ : +--------+ : +-------+
75 | | : | | : | | 75 | | : | | : | |
76 | | : | | : | | 76 | | : | | : | |
77 | CPU 1 |<----->| Memory |<----->| CPU 2 | 77 | CPU 1 |<----->| Memory |<----->| CPU 2 |
78 | | : | | : | | 78 | | : | | : | |
79 | | : | | : | | 79 | | : | | : | |
80 +-------+ : +--------+ : +-------+ 80 +-------+ : +--------+ : +-------+
81 ^ : ^ : ^ 81 ^ : ^ : ^
82 | : | : | 82 | : | : |
83 | : | : | 83 | : | : |
84 | : v : | 84 | : v : |
85 | : +--------+ : | 85 | : +--------+ : |
86 | : | | : | 86 | : | | : |
87 | : | | : | 87 | : | | : |
88 +---------->| Device |<----------+ 88 +---------->| Device |<----------+
89 : | | : 89 : | | :
90 : | | : 90 : | | :
91 : +--------+ : 91 : +--------+ :
92 : : 92 : :
93 93
94 Each CPU executes a program that generates memory access operations. In the 94 Each CPU executes a program that generates memory access operations. In the
95 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually 95 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
96 perform the memory operations in any order it likes, provided program causality 96 perform the memory operations in any order it likes, provided program causality
97 appears to be maintained. Similarly, the compiler may also arrange the 97 appears to be maintained. Similarly, the compiler may also arrange the
98 instructions it emits in any order it likes, provided it doesn't affect the 98 instructions it emits in any order it likes, provided it doesn't affect the
99 apparent operation of the program. 99 apparent operation of the program.
100 100
101 So in the above diagram, the effects of the memory operations performed by a 101 So in the above diagram, the effects of the memory operations performed by a
102 CPU are perceived by the rest of the system as the operations cross the 102 CPU are perceived by the rest of the system as the operations cross the
103 interface between the CPU and rest of the system (the dotted lines). 103 interface between the CPU and rest of the system (the dotted lines).
104 104
105 105
106 For example, consider the following sequence of events: 106 For example, consider the following sequence of events:
107 107
108 CPU 1 CPU 2 108 CPU 1 CPU 2
109 =============== =============== 109 =============== ===============
110 { A == 1; B == 2 } 110 { A == 1; B == 2 }
111 A = 3; x = A; 111 A = 3; x = A;
112 B = 4; y = B; 112 B = 4; y = B;
113 113
114 The set of accesses as seen by the memory system in the middle can be arranged 114 The set of accesses as seen by the memory system in the middle can be arranged
115 in 24 different combinations: 115 in 24 different combinations:
116 116
117 STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4 117 STORE A=3, STORE B=4, x=LOAD A->3, y=LOAD B->4
118 STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3 118 STORE A=3, STORE B=4, y=LOAD B->4, x=LOAD A->3
119 STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4 119 STORE A=3, x=LOAD A->3, STORE B=4, y=LOAD B->4
120 STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4 120 STORE A=3, x=LOAD A->3, y=LOAD B->2, STORE B=4
121 STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3 121 STORE A=3, y=LOAD B->2, STORE B=4, x=LOAD A->3
122 STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4 122 STORE A=3, y=LOAD B->2, x=LOAD A->3, STORE B=4
123 STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4 123 STORE B=4, STORE A=3, x=LOAD A->3, y=LOAD B->4
124 STORE B=4, ... 124 STORE B=4, ...
125 ... 125 ...
126 126
127 and can thus result in four different combinations of values: 127 and can thus result in four different combinations of values:
128 128
129 x == 1, y == 2 129 x == 1, y == 2
130 x == 1, y == 4 130 x == 1, y == 4
131 x == 3, y == 2 131 x == 3, y == 2
132 x == 3, y == 4 132 x == 3, y == 4
133 133
134 134
135 Furthermore, the stores committed by a CPU to the memory system may not be 135 Furthermore, the stores committed by a CPU to the memory system may not be
136 perceived by the loads made by another CPU in the same order as the stores were 136 perceived by the loads made by another CPU in the same order as the stores were
137 committed. 137 committed.
138 138
139 139
140 As a further example, consider this sequence of events: 140 As a further example, consider this sequence of events:
141 141
142 CPU 1 CPU 2 142 CPU 1 CPU 2
143 =============== =============== 143 =============== ===============
144 { A == 1, B == 2, C = 3, P == &A, Q == &C } 144 { A == 1, B == 2, C = 3, P == &A, Q == &C }
145 B = 4; Q = P; 145 B = 4; Q = P;
146 P = &B D = *Q; 146 P = &B D = *Q;
147 147
148 There is an obvious data dependency here, as the value loaded into D depends on 148 There is an obvious data dependency here, as the value loaded into D depends on
149 the address retrieved from P by CPU 2. At the end of the sequence, any of the 149 the address retrieved from P by CPU 2. At the end of the sequence, any of the
150 following results are possible: 150 following results are possible:
151 151
152 (Q == &A) and (D == 1) 152 (Q == &A) and (D == 1)
153 (Q == &B) and (D == 2) 153 (Q == &B) and (D == 2)
154 (Q == &B) and (D == 4) 154 (Q == &B) and (D == 4)
155 155
156 Note that CPU 2 will never try and load C into D because the CPU will load P 156 Note that CPU 2 will never try and load C into D because the CPU will load P
157 into Q before issuing the load of *Q. 157 into Q before issuing the load of *Q.
158 158
159 159
160 DEVICE OPERATIONS 160 DEVICE OPERATIONS
161 ----------------- 161 -----------------
162 162
163 Some devices present their control interfaces as collections of memory 163 Some devices present their control interfaces as collections of memory
164 locations, but the order in which the control registers are accessed is very 164 locations, but the order in which the control registers are accessed is very
165 important. For instance, imagine an ethernet card with a set of internal 165 important. For instance, imagine an ethernet card with a set of internal
166 registers that are accessed through an address port register (A) and a data 166 registers that are accessed through an address port register (A) and a data
167 port register (D). To read internal register 5, the following code might then 167 port register (D). To read internal register 5, the following code might then
168 be used: 168 be used:
169 169
170 *A = 5; 170 *A = 5;
171 x = *D; 171 x = *D;
172 172
173 but this might show up as either of the following two sequences: 173 but this might show up as either of the following two sequences:
174 174
175 STORE *A = 5, x = LOAD *D 175 STORE *A = 5, x = LOAD *D
176 x = LOAD *D, STORE *A = 5 176 x = LOAD *D, STORE *A = 5
177 177
178 the second of which will almost certainly result in a malfunction, since it set 178 the second of which will almost certainly result in a malfunction, since it set
179 the address _after_ attempting to read the register. 179 the address _after_ attempting to read the register.
180 180
181 181
182 GUARANTEES 182 GUARANTEES
183 ---------- 183 ----------
184 184
185 There are some minimal guarantees that may be expected of a CPU: 185 There are some minimal guarantees that may be expected of a CPU:
186 186
187 (*) On any given CPU, dependent memory accesses will be issued in order, with 187 (*) On any given CPU, dependent memory accesses will be issued in order, with
188 respect to itself. This means that for: 188 respect to itself. This means that for:
189 189
190 Q = P; D = *Q; 190 Q = P; D = *Q;
191 191
192 the CPU will issue the following memory operations: 192 the CPU will issue the following memory operations:
193 193
194 Q = LOAD P, D = LOAD *Q 194 Q = LOAD P, D = LOAD *Q
195 195
196 and always in that order. 196 and always in that order.
197 197
198 (*) Overlapping loads and stores within a particular CPU will appear to be 198 (*) Overlapping loads and stores within a particular CPU will appear to be
199 ordered within that CPU. This means that for: 199 ordered within that CPU. This means that for:
200 200
201 a = *X; *X = b; 201 a = *X; *X = b;
202 202
203 the CPU will only issue the following sequence of memory operations: 203 the CPU will only issue the following sequence of memory operations:
204 204
205 a = LOAD *X, STORE *X = b 205 a = LOAD *X, STORE *X = b
206 206
207 And for: 207 And for:
208 208
209 *X = c; d = *X; 209 *X = c; d = *X;
210 210
211 the CPU will only issue: 211 the CPU will only issue:
212 212
213 STORE *X = c, d = LOAD *X 213 STORE *X = c, d = LOAD *X
214 214
215 (Loads and stores overlap if they are targetted at overlapping pieces of 215 (Loads and stores overlap if they are targetted at overlapping pieces of
216 memory). 216 memory).
217 217
218 And there are a number of things that _must_ or _must_not_ be assumed: 218 And there are a number of things that _must_ or _must_not_ be assumed:
219 219
220 (*) It _must_not_ be assumed that independent loads and stores will be issued 220 (*) It _must_not_ be assumed that independent loads and stores will be issued
221 in the order given. This means that for: 221 in the order given. This means that for:
222 222
223 X = *A; Y = *B; *D = Z; 223 X = *A; Y = *B; *D = Z;
224 224
225 we may get any of the following sequences: 225 we may get any of the following sequences:
226 226
227 X = LOAD *A, Y = LOAD *B, STORE *D = Z 227 X = LOAD *A, Y = LOAD *B, STORE *D = Z
228 X = LOAD *A, STORE *D = Z, Y = LOAD *B 228 X = LOAD *A, STORE *D = Z, Y = LOAD *B
229 Y = LOAD *B, X = LOAD *A, STORE *D = Z 229 Y = LOAD *B, X = LOAD *A, STORE *D = Z
230 Y = LOAD *B, STORE *D = Z, X = LOAD *A 230 Y = LOAD *B, STORE *D = Z, X = LOAD *A
231 STORE *D = Z, X = LOAD *A, Y = LOAD *B 231 STORE *D = Z, X = LOAD *A, Y = LOAD *B
232 STORE *D = Z, Y = LOAD *B, X = LOAD *A 232 STORE *D = Z, Y = LOAD *B, X = LOAD *A
233 233
234 (*) It _must_ be assumed that overlapping memory accesses may be merged or 234 (*) It _must_ be assumed that overlapping memory accesses may be merged or
235 discarded. This means that for: 235 discarded. This means that for:
236 236
237 X = *A; Y = *(A + 4); 237 X = *A; Y = *(A + 4);
238 238
239 we may get any one of the following sequences: 239 we may get any one of the following sequences:
240 240
241 X = LOAD *A; Y = LOAD *(A + 4); 241 X = LOAD *A; Y = LOAD *(A + 4);
242 Y = LOAD *(A + 4); X = LOAD *A; 242 Y = LOAD *(A + 4); X = LOAD *A;
243 {X, Y} = LOAD {*A, *(A + 4) }; 243 {X, Y} = LOAD {*A, *(A + 4) };
244 244
245 And for: 245 And for:
246 246
247 *A = X; Y = *A; 247 *A = X; Y = *A;
248 248
249 we may get either of: 249 we may get either of:
250 250
251 STORE *A = X; Y = LOAD *A; 251 STORE *A = X; Y = LOAD *A;
252 STORE *A = Y = X; 252 STORE *A = Y = X;
253 253
254 254
255 ========================= 255 =========================
256 WHAT ARE MEMORY BARRIERS? 256 WHAT ARE MEMORY BARRIERS?
257 ========================= 257 =========================
258 258
259 As can be seen above, independent memory operations are effectively performed 259 As can be seen above, independent memory operations are effectively performed
260 in random order, but this can be a problem for CPU-CPU interaction and for I/O. 260 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
261 What is required is some way of intervening to instruct the compiler and the 261 What is required is some way of intervening to instruct the compiler and the
262 CPU to restrict the order. 262 CPU to restrict the order.
263 263
264 Memory barriers are such interventions. They impose a perceived partial 264 Memory barriers are such interventions. They impose a perceived partial
265 ordering over the memory operations on either side of the barrier. 265 ordering over the memory operations on either side of the barrier.
266 266
267 Such enforcement is important because the CPUs and other devices in a system 267 Such enforcement is important because the CPUs and other devices in a system
268 can use a variety of tricks to improve performance - including reordering, 268 can use a variety of tricks to improve performance - including reordering,
269 deferral and combination of memory operations; speculative loads; speculative 269 deferral and combination of memory operations; speculative loads; speculative
270 branch prediction and various types of caching. Memory barriers are used to 270 branch prediction and various types of caching. Memory barriers are used to
271 override or suppress these tricks, allowing the code to sanely control the 271 override or suppress these tricks, allowing the code to sanely control the
272 interaction of multiple CPUs and/or devices. 272 interaction of multiple CPUs and/or devices.
273 273
274 274
275 VARIETIES OF MEMORY BARRIER 275 VARIETIES OF MEMORY BARRIER
276 --------------------------- 276 ---------------------------
277 277
278 Memory barriers come in four basic varieties: 278 Memory barriers come in four basic varieties:
279 279
280 (1) Write (or store) memory barriers. 280 (1) Write (or store) memory barriers.
281 281
282 A write memory barrier gives a guarantee that all the STORE operations 282 A write memory barrier gives a guarantee that all the STORE operations
283 specified before the barrier will appear to happen before all the STORE 283 specified before the barrier will appear to happen before all the STORE
284 operations specified after the barrier with respect to the other 284 operations specified after the barrier with respect to the other
285 components of the system. 285 components of the system.
286 286
287 A write barrier is a partial ordering on stores only; it is not required 287 A write barrier is a partial ordering on stores only; it is not required
288 to have any effect on loads. 288 to have any effect on loads.
289 289
290 A CPU can be viewed as committing a sequence of store operations to the 290 A CPU can be viewed as committing a sequence of store operations to the
291 memory system as time progresses. All stores before a write barrier will 291 memory system as time progresses. All stores before a write barrier will
292 occur in the sequence _before_ all the stores after the write barrier. 292 occur in the sequence _before_ all the stores after the write barrier.
293 293
294 [!] Note that write barriers should normally be paired with read or data 294 [!] Note that write barriers should normally be paired with read or data
295 dependency barriers; see the "SMP barrier pairing" subsection. 295 dependency barriers; see the "SMP barrier pairing" subsection.
296 296
297 297
298 (2) Data dependency barriers. 298 (2) Data dependency barriers.
299 299
300 A data dependency barrier is a weaker form of read barrier. In the case 300 A data dependency barrier is a weaker form of read barrier. In the case
301 where two loads are performed such that the second depends on the result 301 where two loads are performed such that the second depends on the result
302 of the first (eg: the first load retrieves the address to which the second 302 of the first (eg: the first load retrieves the address to which the second
303 load will be directed), a data dependency barrier would be required to 303 load will be directed), a data dependency barrier would be required to
304 make sure that the target of the second load is updated before the address 304 make sure that the target of the second load is updated before the address
305 obtained by the first load is accessed. 305 obtained by the first load is accessed.
306 306
307 A data dependency barrier is a partial ordering on interdependent loads 307 A data dependency barrier is a partial ordering on interdependent loads
308 only; it is not required to have any effect on stores, independent loads 308 only; it is not required to have any effect on stores, independent loads
309 or overlapping loads. 309 or overlapping loads.
310 310
311 As mentioned in (1), the other CPUs in the system can be viewed as 311 As mentioned in (1), the other CPUs in the system can be viewed as
312 committing sequences of stores to the memory system that the CPU being 312 committing sequences of stores to the memory system that the CPU being
313 considered can then perceive. A data dependency barrier issued by the CPU 313 considered can then perceive. A data dependency barrier issued by the CPU
314 under consideration guarantees that for any load preceding it, if that 314 under consideration guarantees that for any load preceding it, if that
315 load touches one of a sequence of stores from another CPU, then by the 315 load touches one of a sequence of stores from another CPU, then by the
316 time the barrier completes, the effects of all the stores prior to that 316 time the barrier completes, the effects of all the stores prior to that
317 touched by the load will be perceptible to any loads issued after the data 317 touched by the load will be perceptible to any loads issued after the data
318 dependency barrier. 318 dependency barrier.
319 319
320 See the "Examples of memory barrier sequences" subsection for diagrams 320 See the "Examples of memory barrier sequences" subsection for diagrams
321 showing the ordering constraints. 321 showing the ordering constraints.
322 322
323 [!] Note that the first load really has to have a _data_ dependency and 323 [!] Note that the first load really has to have a _data_ dependency and
324 not a control dependency. If the address for the second load is dependent 324 not a control dependency. If the address for the second load is dependent
325 on the first load, but the dependency is through a conditional rather than 325 on the first load, but the dependency is through a conditional rather than
326 actually loading the address itself, then it's a _control_ dependency and 326 actually loading the address itself, then it's a _control_ dependency and
327 a full read barrier or better is required. See the "Control dependencies" 327 a full read barrier or better is required. See the "Control dependencies"
328 subsection for more information. 328 subsection for more information.
329 329
330 [!] Note that data dependency barriers should normally be paired with 330 [!] Note that data dependency barriers should normally be paired with
331 write barriers; see the "SMP barrier pairing" subsection. 331 write barriers; see the "SMP barrier pairing" subsection.
332 332
333 333
334 (3) Read (or load) memory barriers. 334 (3) Read (or load) memory barriers.
335 335
336 A read barrier is a data dependency barrier plus a guarantee that all the 336 A read barrier is a data dependency barrier plus a guarantee that all the
337 LOAD operations specified before the barrier will appear to happen before 337 LOAD operations specified before the barrier will appear to happen before
338 all the LOAD operations specified after the barrier with respect to the 338 all the LOAD operations specified after the barrier with respect to the
339 other components of the system. 339 other components of the system.
340 340
341 A read barrier is a partial ordering on loads only; it is not required to 341 A read barrier is a partial ordering on loads only; it is not required to
342 have any effect on stores. 342 have any effect on stores.
343 343
344 Read memory barriers imply data dependency barriers, and so can substitute 344 Read memory barriers imply data dependency barriers, and so can substitute
345 for them. 345 for them.
346 346
347 [!] Note that read barriers should normally be paired with write barriers; 347 [!] Note that read barriers should normally be paired with write barriers;
348 see the "SMP barrier pairing" subsection. 348 see the "SMP barrier pairing" subsection.
349 349
350 350
351 (4) General memory barriers. 351 (4) General memory barriers.
352 352
353 A general memory barrier gives a guarantee that all the LOAD and STORE 353 A general memory barrier gives a guarantee that all the LOAD and STORE
354 operations specified before the barrier will appear to happen before all 354 operations specified before the barrier will appear to happen before all
355 the LOAD and STORE operations specified after the barrier with respect to 355 the LOAD and STORE operations specified after the barrier with respect to
356 the other components of the system. 356 the other components of the system.
357 357
358 A general memory barrier is a partial ordering over both loads and stores. 358 A general memory barrier is a partial ordering over both loads and stores.
359 359
360 General memory barriers imply both read and write memory barriers, and so 360 General memory barriers imply both read and write memory barriers, and so
361 can substitute for either. 361 can substitute for either.
362 362
363 363
364 And a couple of implicit varieties: 364 And a couple of implicit varieties:
365 365
366 (5) LOCK operations. 366 (5) LOCK operations.
367 367
368 This acts as a one-way permeable barrier. It guarantees that all memory 368 This acts as a one-way permeable barrier. It guarantees that all memory
369 operations after the LOCK operation will appear to happen after the LOCK 369 operations after the LOCK operation will appear to happen after the LOCK
370 operation with respect to the other components of the system. 370 operation with respect to the other components of the system.
371 371
372 Memory operations that occur before a LOCK operation may appear to happen 372 Memory operations that occur before a LOCK operation may appear to happen
373 after it completes. 373 after it completes.
374 374
375 A LOCK operation should almost always be paired with an UNLOCK operation. 375 A LOCK operation should almost always be paired with an UNLOCK operation.
376 376
377 377
378 (6) UNLOCK operations. 378 (6) UNLOCK operations.
379 379
380 This also acts as a one-way permeable barrier. It guarantees that all 380 This also acts as a one-way permeable barrier. It guarantees that all
381 memory operations before the UNLOCK operation will appear to happen before 381 memory operations before the UNLOCK operation will appear to happen before
382 the UNLOCK operation with respect to the other components of the system. 382 the UNLOCK operation with respect to the other components of the system.
383 383
384 Memory operations that occur after an UNLOCK operation may appear to 384 Memory operations that occur after an UNLOCK operation may appear to
385 happen before it completes. 385 happen before it completes.
386 386
387 LOCK and UNLOCK operations are guaranteed to appear with respect to each 387 LOCK and UNLOCK operations are guaranteed to appear with respect to each
388 other strictly in the order specified. 388 other strictly in the order specified.
389 389
390 The use of LOCK and UNLOCK operations generally precludes the need for 390 The use of LOCK and UNLOCK operations generally precludes the need for
391 other sorts of memory barrier (but note the exceptions mentioned in the 391 other sorts of memory barrier (but note the exceptions mentioned in the
392 subsection "MMIO write barrier"). 392 subsection "MMIO write barrier").
393 393
394 394
395 Memory barriers are only required where there's a possibility of interaction 395 Memory barriers are only required where there's a possibility of interaction
396 between two CPUs or between a CPU and a device. If it can be guaranteed that 396 between two CPUs or between a CPU and a device. If it can be guaranteed that
397 there won't be any such interaction in any particular piece of code, then 397 there won't be any such interaction in any particular piece of code, then
398 memory barriers are unnecessary in that piece of code. 398 memory barriers are unnecessary in that piece of code.
399 399
400 400
401 Note that these are the _minimum_ guarantees. Different architectures may give 401 Note that these are the _minimum_ guarantees. Different architectures may give
402 more substantial guarantees, but they may _not_ be relied upon outside of arch 402 more substantial guarantees, but they may _not_ be relied upon outside of arch
403 specific code. 403 specific code.
404 404
405 405
406 WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS? 406 WHAT MAY NOT BE ASSUMED ABOUT MEMORY BARRIERS?
407 ---------------------------------------------- 407 ----------------------------------------------
408 408
409 There are certain things that the Linux kernel memory barriers do not guarantee: 409 There are certain things that the Linux kernel memory barriers do not guarantee:
410 410
411 (*) There is no guarantee that any of the memory accesses specified before a 411 (*) There is no guarantee that any of the memory accesses specified before a
412 memory barrier will be _complete_ by the completion of a memory barrier 412 memory barrier will be _complete_ by the completion of a memory barrier
413 instruction; the barrier can be considered to draw a line in that CPU's 413 instruction; the barrier can be considered to draw a line in that CPU's
414 access queue that accesses of the appropriate type may not cross. 414 access queue that accesses of the appropriate type may not cross.
415 415
416 (*) There is no guarantee that issuing a memory barrier on one CPU will have 416 (*) There is no guarantee that issuing a memory barrier on one CPU will have
417 any direct effect on another CPU or any other hardware in the system. The 417 any direct effect on another CPU or any other hardware in the system. The
418 indirect effect will be the order in which the second CPU sees the effects 418 indirect effect will be the order in which the second CPU sees the effects
419 of the first CPU's accesses occur, but see the next point: 419 of the first CPU's accesses occur, but see the next point:
420 420
421 (*) There is no guarantee that a CPU will see the correct order of effects 421 (*) There is no guarantee that a CPU will see the correct order of effects
422 from a second CPU's accesses, even _if_ the second CPU uses a memory 422 from a second CPU's accesses, even _if_ the second CPU uses a memory
423 barrier, unless the first CPU _also_ uses a matching memory barrier (see 423 barrier, unless the first CPU _also_ uses a matching memory barrier (see
424 the subsection on "SMP Barrier Pairing"). 424 the subsection on "SMP Barrier Pairing").
425 425
426 (*) There is no guarantee that some intervening piece of off-the-CPU 426 (*) There is no guarantee that some intervening piece of off-the-CPU
427 hardware[*] will not reorder the memory accesses. CPU cache coherency 427 hardware[*] will not reorder the memory accesses. CPU cache coherency
428 mechanisms should propagate the indirect effects of a memory barrier 428 mechanisms should propagate the indirect effects of a memory barrier
429 between CPUs, but might not do so in order. 429 between CPUs, but might not do so in order.
430 430
431 [*] For information on bus mastering DMA and coherency please read: 431 [*] For information on bus mastering DMA and coherency please read:
432 432
433 Documentation/pci.txt 433 Documentation/pci.txt
434 Documentation/DMA-mapping.txt 434 Documentation/DMA-mapping.txt
435 Documentation/DMA-API.txt 435 Documentation/DMA-API.txt
436 436
437 437
438 DATA DEPENDENCY BARRIERS 438 DATA DEPENDENCY BARRIERS
439 ------------------------ 439 ------------------------
440 440
441 The usage requirements of data dependency barriers are a little subtle, and 441 The usage requirements of data dependency barriers are a little subtle, and
442 it's not always obvious that they're needed. To illustrate, consider the 442 it's not always obvious that they're needed. To illustrate, consider the
443 following sequence of events: 443 following sequence of events:
444 444
445 CPU 1 CPU 2 445 CPU 1 CPU 2
446 =============== =============== 446 =============== ===============
447 { A == 1, B == 2, C = 3, P == &A, Q == &C } 447 { A == 1, B == 2, C = 3, P == &A, Q == &C }
448 B = 4; 448 B = 4;
449 <write barrier> 449 <write barrier>
450 P = &B 450 P = &B
451 Q = P; 451 Q = P;
452 D = *Q; 452 D = *Q;
453 453
454 There's a clear data dependency here, and it would seem that by the end of the 454 There's a clear data dependency here, and it would seem that by the end of the
455 sequence, Q must be either &A or &B, and that: 455 sequence, Q must be either &A or &B, and that:
456 456
457 (Q == &A) implies (D == 1) 457 (Q == &A) implies (D == 1)
458 (Q == &B) implies (D == 4) 458 (Q == &B) implies (D == 4)
459 459
460 But! CPU 2's perception of P may be updated _before_ its perception of B, thus 460 But! CPU 2's perception of P may be updated _before_ its perception of B, thus
461 leading to the following situation: 461 leading to the following situation:
462 462
463 (Q == &B) and (D == 2) ???? 463 (Q == &B) and (D == 2) ????
464 464
465 Whilst this may seem like a failure of coherency or causality maintenance, it 465 Whilst this may seem like a failure of coherency or causality maintenance, it
466 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC 466 isn't, and this behaviour can be observed on certain real CPUs (such as the DEC
467 Alpha). 467 Alpha).
468 468
469 To deal with this, a data dependency barrier or better must be inserted 469 To deal with this, a data dependency barrier or better must be inserted
470 between the address load and the data load: 470 between the address load and the data load:
471 471
472 CPU 1 CPU 2 472 CPU 1 CPU 2
473 =============== =============== 473 =============== ===============
474 { A == 1, B == 2, C = 3, P == &A, Q == &C } 474 { A == 1, B == 2, C = 3, P == &A, Q == &C }
475 B = 4; 475 B = 4;
476 <write barrier> 476 <write barrier>
477 P = &B 477 P = &B
478 Q = P; 478 Q = P;
479 <data dependency barrier> 479 <data dependency barrier>
480 D = *Q; 480 D = *Q;
481 481
482 This enforces the occurrence of one of the two implications, and prevents the 482 This enforces the occurrence of one of the two implications, and prevents the
483 third possibility from arising. 483 third possibility from arising.
484 484
485 [!] Note that this extremely counterintuitive situation arises most easily on 485 [!] Note that this extremely counterintuitive situation arises most easily on
486 machines with split caches, so that, for example, one cache bank processes 486 machines with split caches, so that, for example, one cache bank processes
487 even-numbered cache lines and the other bank processes odd-numbered cache 487 even-numbered cache lines and the other bank processes odd-numbered cache
488 lines. The pointer P might be stored in an odd-numbered cache line, and the 488 lines. The pointer P might be stored in an odd-numbered cache line, and the
489 variable B might be stored in an even-numbered cache line. Then, if the 489 variable B might be stored in an even-numbered cache line. Then, if the
490 even-numbered bank of the reading CPU's cache is extremely busy while the 490 even-numbered bank of the reading CPU's cache is extremely busy while the
491 odd-numbered bank is idle, one can see the new value of the pointer P (&B), 491 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
492 but the old value of the variable B (2). 492 but the old value of the variable B (2).
493 493
494 494
495 Another example of where data dependency barriers might by required is where a 495 Another example of where data dependency barriers might by required is where a
496 number is read from memory and then used to calculate the index for an array 496 number is read from memory and then used to calculate the index for an array
497 access: 497 access:
498 498
499 CPU 1 CPU 2 499 CPU 1 CPU 2
500 =============== =============== 500 =============== ===============
501 { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } 501 { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
502 M[1] = 4; 502 M[1] = 4;
503 <write barrier> 503 <write barrier>
504 P = 1 504 P = 1
505 Q = P; 505 Q = P;
506 <data dependency barrier> 506 <data dependency barrier>
507 D = M[Q]; 507 D = M[Q];
508 508
509 509
510 The data dependency barrier is very important to the RCU system, for example. 510 The data dependency barrier is very important to the RCU system, for example.
511 See rcu_dereference() in include/linux/rcupdate.h. This permits the current 511 See rcu_dereference() in include/linux/rcupdate.h. This permits the current
512 target of an RCU'd pointer to be replaced with a new modified target, without 512 target of an RCU'd pointer to be replaced with a new modified target, without
513 the replacement target appearing to be incompletely initialised. 513 the replacement target appearing to be incompletely initialised.
514 514
515 See also the subsection on "Cache Coherency" for a more thorough example. 515 See also the subsection on "Cache Coherency" for a more thorough example.
516 516
517 517
518 CONTROL DEPENDENCIES 518 CONTROL DEPENDENCIES
519 -------------------- 519 --------------------
520 520
521 A control dependency requires a full read memory barrier, not simply a data 521 A control dependency requires a full read memory barrier, not simply a data
522 dependency barrier to make it work correctly. Consider the following bit of 522 dependency barrier to make it work correctly. Consider the following bit of
523 code: 523 code:
524 524
525 q = &a; 525 q = &a;
526 if (p) 526 if (p)
527 q = &b; 527 q = &b;
528 <data dependency barrier> 528 <data dependency barrier>
529 x = *q; 529 x = *q;
530 530
531 This will not have the desired effect because there is no actual data 531 This will not have the desired effect because there is no actual data
532 dependency, but rather a control dependency that the CPU may short-circuit by 532 dependency, but rather a control dependency that the CPU may short-circuit by
533 attempting to predict the outcome in advance. In such a case what's actually 533 attempting to predict the outcome in advance. In such a case what's actually
534 required is: 534 required is:
535 535
536 q = &a; 536 q = &a;
537 if (p) 537 if (p)
538 q = &b; 538 q = &b;
539 <read barrier> 539 <read barrier>
540 x = *q; 540 x = *q;
541 541
542 542
543 SMP BARRIER PAIRING 543 SMP BARRIER PAIRING
544 ------------------- 544 -------------------
545 545
546 When dealing with CPU-CPU interactions, certain types of memory barrier should 546 When dealing with CPU-CPU interactions, certain types of memory barrier should
547 always be paired. A lack of appropriate pairing is almost certainly an error. 547 always be paired. A lack of appropriate pairing is almost certainly an error.
548 548
549 A write barrier should always be paired with a data dependency barrier or read 549 A write barrier should always be paired with a data dependency barrier or read
550 barrier, though a general barrier would also be viable. Similarly a read 550 barrier, though a general barrier would also be viable. Similarly a read
551 barrier or a data dependency barrier should always be paired with at least an 551 barrier or a data dependency barrier should always be paired with at least an
552 write barrier, though, again, a general barrier is viable: 552 write barrier, though, again, a general barrier is viable:
553 553
554 CPU 1 CPU 2 554 CPU 1 CPU 2
555 =============== =============== 555 =============== ===============
556 a = 1; 556 a = 1;
557 <write barrier> 557 <write barrier>
558 b = 2; x = b; 558 b = 2; x = b;
559 <read barrier> 559 <read barrier>
560 y = a; 560 y = a;
561 561
562 Or: 562 Or:
563 563
564 CPU 1 CPU 2 564 CPU 1 CPU 2
565 =============== =============================== 565 =============== ===============================
566 a = 1; 566 a = 1;
567 <write barrier> 567 <write barrier>
568 b = &a; x = b; 568 b = &a; x = b;
569 <data dependency barrier> 569 <data dependency barrier>
570 y = *x; 570 y = *x;
571 571
572 Basically, the read barrier always has to be there, even though it can be of 572 Basically, the read barrier always has to be there, even though it can be of
573 the "weaker" type. 573 the "weaker" type.
574 574
575 [!] Note that the stores before the write barrier would normally be expected to 575 [!] Note that the stores before the write barrier would normally be expected to
576 match the loads after the read barrier or data dependency barrier, and vice 576 match the loads after the read barrier or data dependency barrier, and vice
577 versa: 577 versa:
578 578
579 CPU 1 CPU 2 579 CPU 1 CPU 2
580 =============== =============== 580 =============== ===============
581 a = 1; }---- --->{ v = c 581 a = 1; }---- --->{ v = c
582 b = 2; } \ / { w = d 582 b = 2; } \ / { w = d
583 <write barrier> \ <read barrier> 583 <write barrier> \ <read barrier>
584 c = 3; } / \ { x = a; 584 c = 3; } / \ { x = a;
585 d = 4; }---- --->{ y = b; 585 d = 4; }---- --->{ y = b;
586 586
587 587
588 EXAMPLES OF MEMORY BARRIER SEQUENCES 588 EXAMPLES OF MEMORY BARRIER SEQUENCES
589 ------------------------------------ 589 ------------------------------------
590 590
591 Firstly, write barriers act as a partial orderings on store operations. 591 Firstly, write barriers act as a partial orderings on store operations.
592 Consider the following sequence of events: 592 Consider the following sequence of events:
593 593
594 CPU 1 594 CPU 1
595 ======================= 595 =======================
596 STORE A = 1 596 STORE A = 1
597 STORE B = 2 597 STORE B = 2
598 STORE C = 3 598 STORE C = 3
599 <write barrier> 599 <write barrier>
600 STORE D = 4 600 STORE D = 4
601 STORE E = 5 601 STORE E = 5
602 602
603 This sequence of events is committed to the memory coherence system in an order 603 This sequence of events is committed to the memory coherence system in an order
604 that the rest of the system might perceive as the unordered set of { STORE A, 604 that the rest of the system might perceive as the unordered set of { STORE A,
605 STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E 605 STORE B, STORE C } all occurring before the unordered set of { STORE D, STORE E
606 }: 606 }:
607 607
608 +-------+ : : 608 +-------+ : :
609 | | +------+ 609 | | +------+
610 | |------>| C=3 | } /\ 610 | |------>| C=3 | } /\
611 | | : +------+ }----- \ -----> Events perceptible 611 | | : +------+ }----- \ -----> Events perceptible
612 | | : | A=1 | } \/ to rest of system 612 | | : | A=1 | } \/ to rest of system
613 | | : +------+ } 613 | | : +------+ }
614 | CPU 1 | : | B=2 | } 614 | CPU 1 | : | B=2 | }
615 | | +------+ } 615 | | +------+ }
616 | | wwwwwwwwwwwwwwww } <--- At this point the write barrier 616 | | wwwwwwwwwwwwwwww } <--- At this point the write barrier
617 | | +------+ } requires all stores prior to the 617 | | +------+ } requires all stores prior to the
618 | | : | E=5 | } barrier to be committed before 618 | | : | E=5 | } barrier to be committed before
619 | | : +------+ } further stores may be take place. 619 | | : +------+ } further stores may be take place.
620 | |------>| D=4 | } 620 | |------>| D=4 | }
621 | | +------+ 621 | | +------+
622 +-------+ : : 622 +-------+ : :
623 | 623 |
624 | Sequence in which stores are committed to the 624 | Sequence in which stores are committed to the
625 | memory system by CPU 1 625 | memory system by CPU 1
626 V 626 V
627 627
628 628
629 Secondly, data dependency barriers act as a partial orderings on data-dependent 629 Secondly, data dependency barriers act as a partial orderings on data-dependent
630 loads. Consider the following sequence of events: 630 loads. Consider the following sequence of events:
631 631
632 CPU 1 CPU 2 632 CPU 1 CPU 2
633 ======================= ======================= 633 ======================= =======================
634 { B = 7; X = 9; Y = 8; C = &Y } 634 { B = 7; X = 9; Y = 8; C = &Y }
635 STORE A = 1 635 STORE A = 1
636 STORE B = 2 636 STORE B = 2
637 <write barrier> 637 <write barrier>
638 STORE C = &B LOAD X 638 STORE C = &B LOAD X
639 STORE D = 4 LOAD C (gets &B) 639 STORE D = 4 LOAD C (gets &B)
640 LOAD *C (reads B) 640 LOAD *C (reads B)
641 641
642 Without intervention, CPU 2 may perceive the events on CPU 1 in some 642 Without intervention, CPU 2 may perceive the events on CPU 1 in some
643 effectively random order, despite the write barrier issued by CPU 1: 643 effectively random order, despite the write barrier issued by CPU 1:
644 644
645 +-------+ : : : : 645 +-------+ : : : :
646 | | +------+ +-------+ | Sequence of update 646 | | +------+ +-------+ | Sequence of update
647 | |------>| B=2 |----- --->| Y->8 | | of perception on 647 | |------>| B=2 |----- --->| Y->8 | | of perception on
648 | | : +------+ \ +-------+ | CPU 2 648 | | : +------+ \ +-------+ | CPU 2
649 | CPU 1 | : | A=1 | \ --->| C->&Y | V 649 | CPU 1 | : | A=1 | \ --->| C->&Y | V
650 | | +------+ | +-------+ 650 | | +------+ | +-------+
651 | | wwwwwwwwwwwwwwww | : : 651 | | wwwwwwwwwwwwwwww | : :
652 | | +------+ | : : 652 | | +------+ | : :
653 | | : | C=&B |--- | : : +-------+ 653 | | : | C=&B |--- | : : +-------+
654 | | : +------+ \ | +-------+ | | 654 | | : +------+ \ | +-------+ | |
655 | |------>| D=4 | ----------->| C->&B |------>| | 655 | |------>| D=4 | ----------->| C->&B |------>| |
656 | | +------+ | +-------+ | | 656 | | +------+ | +-------+ | |
657 +-------+ : : | : : | | 657 +-------+ : : | : : | |
658 | : : | | 658 | : : | |
659 | : : | CPU 2 | 659 | : : | CPU 2 |
660 | +-------+ | | 660 | +-------+ | |
661 Apparently incorrect ---> | | B->7 |------>| | 661 Apparently incorrect ---> | | B->7 |------>| |
662 perception of B (!) | +-------+ | | 662 perception of B (!) | +-------+ | |
663 | : : | | 663 | : : | |
664 | +-------+ | | 664 | +-------+ | |
665 The load of X holds ---> \ | X->9 |------>| | 665 The load of X holds ---> \ | X->9 |------>| |
666 up the maintenance \ +-------+ | | 666 up the maintenance \ +-------+ | |
667 of coherence of B ----->| B->2 | +-------+ 667 of coherence of B ----->| B->2 | +-------+
668 +-------+ 668 +-------+
669 : : 669 : :
670 670
671 671
672 In the above example, CPU 2 perceives that B is 7, despite the load of *C 672 In the above example, CPU 2 perceives that B is 7, despite the load of *C
673 (which would be B) coming after the the LOAD of C. 673 (which would be B) coming after the LOAD of C.
674 674
675 If, however, a data dependency barrier were to be placed between the load of C 675 If, however, a data dependency barrier were to be placed between the load of C
676 and the load of *C (ie: B) on CPU 2: 676 and the load of *C (ie: B) on CPU 2:
677 677
678 CPU 1 CPU 2 678 CPU 1 CPU 2
679 ======================= ======================= 679 ======================= =======================
680 { B = 7; X = 9; Y = 8; C = &Y } 680 { B = 7; X = 9; Y = 8; C = &Y }
681 STORE A = 1 681 STORE A = 1
682 STORE B = 2 682 STORE B = 2
683 <write barrier> 683 <write barrier>
684 STORE C = &B LOAD X 684 STORE C = &B LOAD X
685 STORE D = 4 LOAD C (gets &B) 685 STORE D = 4 LOAD C (gets &B)
686 <data dependency barrier> 686 <data dependency barrier>
687 LOAD *C (reads B) 687 LOAD *C (reads B)
688 688
689 then the following will occur: 689 then the following will occur:
690 690
691 +-------+ : : : : 691 +-------+ : : : :
692 | | +------+ +-------+ 692 | | +------+ +-------+
693 | |------>| B=2 |----- --->| Y->8 | 693 | |------>| B=2 |----- --->| Y->8 |
694 | | : +------+ \ +-------+ 694 | | : +------+ \ +-------+
695 | CPU 1 | : | A=1 | \ --->| C->&Y | 695 | CPU 1 | : | A=1 | \ --->| C->&Y |
696 | | +------+ | +-------+ 696 | | +------+ | +-------+
697 | | wwwwwwwwwwwwwwww | : : 697 | | wwwwwwwwwwwwwwww | : :
698 | | +------+ | : : 698 | | +------+ | : :
699 | | : | C=&B |--- | : : +-------+ 699 | | : | C=&B |--- | : : +-------+
700 | | : +------+ \ | +-------+ | | 700 | | : +------+ \ | +-------+ | |
701 | |------>| D=4 | ----------->| C->&B |------>| | 701 | |------>| D=4 | ----------->| C->&B |------>| |
702 | | +------+ | +-------+ | | 702 | | +------+ | +-------+ | |
703 +-------+ : : | : : | | 703 +-------+ : : | : : | |
704 | : : | | 704 | : : | |
705 | : : | CPU 2 | 705 | : : | CPU 2 |
706 | +-------+ | | 706 | +-------+ | |
707 | | X->9 |------>| | 707 | | X->9 |------>| |
708 | +-------+ | | 708 | +-------+ | |
709 Makes sure all effects ---> \ ddddddddddddddddd | | 709 Makes sure all effects ---> \ ddddddddddddddddd | |
710 prior to the store of C \ +-------+ | | 710 prior to the store of C \ +-------+ | |
711 are perceptible to ----->| B->2 |------>| | 711 are perceptible to ----->| B->2 |------>| |
712 subsequent loads +-------+ | | 712 subsequent loads +-------+ | |
713 : : +-------+ 713 : : +-------+
714 714
715 715
716 And thirdly, a read barrier acts as a partial order on loads. Consider the 716 And thirdly, a read barrier acts as a partial order on loads. Consider the
717 following sequence of events: 717 following sequence of events:
718 718
719 CPU 1 CPU 2 719 CPU 1 CPU 2
720 ======================= ======================= 720 ======================= =======================
721 { A = 0, B = 9 } 721 { A = 0, B = 9 }
722 STORE A=1 722 STORE A=1
723 <write barrier> 723 <write barrier>
724 STORE B=2 724 STORE B=2
725 LOAD B 725 LOAD B
726 LOAD A 726 LOAD A
727 727
728 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in 728 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
729 some effectively random order, despite the write barrier issued by CPU 1: 729 some effectively random order, despite the write barrier issued by CPU 1:
730 730
731 +-------+ : : : : 731 +-------+ : : : :
732 | | +------+ +-------+ 732 | | +------+ +-------+
733 | |------>| A=1 |------ --->| A->0 | 733 | |------>| A=1 |------ --->| A->0 |
734 | | +------+ \ +-------+ 734 | | +------+ \ +-------+
735 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 735 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
736 | | +------+ | +-------+ 736 | | +------+ | +-------+
737 | |------>| B=2 |--- | : : 737 | |------>| B=2 |--- | : :
738 | | +------+ \ | : : +-------+ 738 | | +------+ \ | : : +-------+
739 +-------+ : : \ | +-------+ | | 739 +-------+ : : \ | +-------+ | |
740 ---------->| B->2 |------>| | 740 ---------->| B->2 |------>| |
741 | +-------+ | CPU 2 | 741 | +-------+ | CPU 2 |
742 | | A->0 |------>| | 742 | | A->0 |------>| |
743 | +-------+ | | 743 | +-------+ | |
744 | : : +-------+ 744 | : : +-------+
745 \ : : 745 \ : :
746 \ +-------+ 746 \ +-------+
747 ---->| A->1 | 747 ---->| A->1 |
748 +-------+ 748 +-------+
749 : : 749 : :
750 750
751 751
752 If, however, a read barrier were to be placed between the load of B and the 752 If, however, a read barrier were to be placed between the load of B and the
753 load of A on CPU 2: 753 load of A on CPU 2:
754 754
755 CPU 1 CPU 2 755 CPU 1 CPU 2
756 ======================= ======================= 756 ======================= =======================
757 { A = 0, B = 9 } 757 { A = 0, B = 9 }
758 STORE A=1 758 STORE A=1
759 <write barrier> 759 <write barrier>
760 STORE B=2 760 STORE B=2
761 LOAD B 761 LOAD B
762 <read barrier> 762 <read barrier>
763 LOAD A 763 LOAD A
764 764
765 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU 765 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
766 2: 766 2:
767 767
768 +-------+ : : : : 768 +-------+ : : : :
769 | | +------+ +-------+ 769 | | +------+ +-------+
770 | |------>| A=1 |------ --->| A->0 | 770 | |------>| A=1 |------ --->| A->0 |
771 | | +------+ \ +-------+ 771 | | +------+ \ +-------+
772 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 772 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
773 | | +------+ | +-------+ 773 | | +------+ | +-------+
774 | |------>| B=2 |--- | : : 774 | |------>| B=2 |--- | : :
775 | | +------+ \ | : : +-------+ 775 | | +------+ \ | : : +-------+
776 +-------+ : : \ | +-------+ | | 776 +-------+ : : \ | +-------+ | |
777 ---------->| B->2 |------>| | 777 ---------->| B->2 |------>| |
778 | +-------+ | CPU 2 | 778 | +-------+ | CPU 2 |
779 | : : | | 779 | : : | |
780 | : : | | 780 | : : | |
781 At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 781 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
782 barrier causes all effects \ +-------+ | | 782 barrier causes all effects \ +-------+ | |
783 prior to the storage of B ---->| A->1 |------>| | 783 prior to the storage of B ---->| A->1 |------>| |
784 to be perceptible to CPU 2 +-------+ | | 784 to be perceptible to CPU 2 +-------+ | |
785 : : +-------+ 785 : : +-------+
786 786
787 787
788 To illustrate this more completely, consider what could happen if the code 788 To illustrate this more completely, consider what could happen if the code
789 contained a load of A either side of the read barrier: 789 contained a load of A either side of the read barrier:
790 790
791 CPU 1 CPU 2 791 CPU 1 CPU 2
792 ======================= ======================= 792 ======================= =======================
793 { A = 0, B = 9 } 793 { A = 0, B = 9 }
794 STORE A=1 794 STORE A=1
795 <write barrier> 795 <write barrier>
796 STORE B=2 796 STORE B=2
797 LOAD B 797 LOAD B
798 LOAD A [first load of A] 798 LOAD A [first load of A]
799 <read barrier> 799 <read barrier>
800 LOAD A [second load of A] 800 LOAD A [second load of A]
801 801
802 Even though the two loads of A both occur after the load of B, they may both 802 Even though the two loads of A both occur after the load of B, they may both
803 come up with different values: 803 come up with different values:
804 804
805 +-------+ : : : : 805 +-------+ : : : :
806 | | +------+ +-------+ 806 | | +------+ +-------+
807 | |------>| A=1 |------ --->| A->0 | 807 | |------>| A=1 |------ --->| A->0 |
808 | | +------+ \ +-------+ 808 | | +------+ \ +-------+
809 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 809 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
810 | | +------+ | +-------+ 810 | | +------+ | +-------+
811 | |------>| B=2 |--- | : : 811 | |------>| B=2 |--- | : :
812 | | +------+ \ | : : +-------+ 812 | | +------+ \ | : : +-------+
813 +-------+ : : \ | +-------+ | | 813 +-------+ : : \ | +-------+ | |
814 ---------->| B->2 |------>| | 814 ---------->| B->2 |------>| |
815 | +-------+ | CPU 2 | 815 | +-------+ | CPU 2 |
816 | : : | | 816 | : : | |
817 | : : | | 817 | : : | |
818 | +-------+ | | 818 | +-------+ | |
819 | | A->0 |------>| 1st | 819 | | A->0 |------>| 1st |
820 | +-------+ | | 820 | +-------+ | |
821 At this point the read ----> \ rrrrrrrrrrrrrrrrr | | 821 At this point the read ----> \ rrrrrrrrrrrrrrrrr | |
822 barrier causes all effects \ +-------+ | | 822 barrier causes all effects \ +-------+ | |
823 prior to the storage of B ---->| A->1 |------>| 2nd | 823 prior to the storage of B ---->| A->1 |------>| 2nd |
824 to be perceptible to CPU 2 +-------+ | | 824 to be perceptible to CPU 2 +-------+ | |
825 : : +-------+ 825 : : +-------+
826 826
827 827
828 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2 828 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
829 before the read barrier completes anyway: 829 before the read barrier completes anyway:
830 830
831 +-------+ : : : : 831 +-------+ : : : :
832 | | +------+ +-------+ 832 | | +------+ +-------+
833 | |------>| A=1 |------ --->| A->0 | 833 | |------>| A=1 |------ --->| A->0 |
834 | | +------+ \ +-------+ 834 | | +------+ \ +-------+
835 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | 835 | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 |
836 | | +------+ | +-------+ 836 | | +------+ | +-------+
837 | |------>| B=2 |--- | : : 837 | |------>| B=2 |--- | : :
838 | | +------+ \ | : : +-------+ 838 | | +------+ \ | : : +-------+
839 +-------+ : : \ | +-------+ | | 839 +-------+ : : \ | +-------+ | |
840 ---------->| B->2 |------>| | 840 ---------->| B->2 |------>| |
841 | +-------+ | CPU 2 | 841 | +-------+ | CPU 2 |
842 | : : | | 842 | : : | |
843 \ : : | | 843 \ : : | |
844 \ +-------+ | | 844 \ +-------+ | |
845 ---->| A->1 |------>| 1st | 845 ---->| A->1 |------>| 1st |
846 +-------+ | | 846 +-------+ | |
847 rrrrrrrrrrrrrrrrr | | 847 rrrrrrrrrrrrrrrrr | |
848 +-------+ | | 848 +-------+ | |
849 | A->1 |------>| 2nd | 849 | A->1 |------>| 2nd |
850 +-------+ | | 850 +-------+ | |
851 : : +-------+ 851 : : +-------+
852 852
853 853
854 The guarantee is that the second load will always come up with A == 1 if the 854 The guarantee is that the second load will always come up with A == 1 if the
855 load of B came up with B == 2. No such guarantee exists for the first load of 855 load of B came up with B == 2. No such guarantee exists for the first load of
856 A; that may come up with either A == 0 or A == 1. 856 A; that may come up with either A == 0 or A == 1.
857 857
858 858
859 READ MEMORY BARRIERS VS LOAD SPECULATION 859 READ MEMORY BARRIERS VS LOAD SPECULATION
860 ---------------------------------------- 860 ----------------------------------------
861 861
862 Many CPUs speculate with loads: that is they see that they will need to load an 862 Many CPUs speculate with loads: that is they see that they will need to load an
863 item from memory, and they find a time where they're not using the bus for any 863 item from memory, and they find a time where they're not using the bus for any
864 other loads, and so do the load in advance - even though they haven't actually 864 other loads, and so do the load in advance - even though they haven't actually
865 got to that point in the instruction execution flow yet. This permits the 865 got to that point in the instruction execution flow yet. This permits the
866 actual load instruction to potentially complete immediately because the CPU 866 actual load instruction to potentially complete immediately because the CPU
867 already has the value to hand. 867 already has the value to hand.
868 868
869 It may turn out that the CPU didn't actually need the value - perhaps because a 869 It may turn out that the CPU didn't actually need the value - perhaps because a
870 branch circumvented the load - in which case it can discard the value or just 870 branch circumvented the load - in which case it can discard the value or just
871 cache it for later use. 871 cache it for later use.
872 872
873 Consider: 873 Consider:
874 874
875 CPU 1 CPU 2 875 CPU 1 CPU 2
876 ======================= ======================= 876 ======================= =======================
877 LOAD B 877 LOAD B
878 DIVIDE } Divide instructions generally 878 DIVIDE } Divide instructions generally
879 DIVIDE } take a long time to perform 879 DIVIDE } take a long time to perform
880 LOAD A 880 LOAD A
881 881
882 Which might appear as this: 882 Which might appear as this:
883 883
884 : : +-------+ 884 : : +-------+
885 +-------+ | | 885 +-------+ | |
886 --->| B->2 |------>| | 886 --->| B->2 |------>| |
887 +-------+ | CPU 2 | 887 +-------+ | CPU 2 |
888 : :DIVIDE | | 888 : :DIVIDE | |
889 +-------+ | | 889 +-------+ | |
890 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 890 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
891 division speculates on the +-------+ ~ | | 891 division speculates on the +-------+ ~ | |
892 LOAD of A : : ~ | | 892 LOAD of A : : ~ | |
893 : :DIVIDE | | 893 : :DIVIDE | |
894 : : ~ | | 894 : : ~ | |
895 Once the divisions are complete --> : : ~-->| | 895 Once the divisions are complete --> : : ~-->| |
896 the CPU can then perform the : : | | 896 the CPU can then perform the : : | |
897 LOAD with immediate effect : : +-------+ 897 LOAD with immediate effect : : +-------+
898 898
899 899
900 Placing a read barrier or a data dependency barrier just before the second 900 Placing a read barrier or a data dependency barrier just before the second
901 load: 901 load:
902 902
903 CPU 1 CPU 2 903 CPU 1 CPU 2
904 ======================= ======================= 904 ======================= =======================
905 LOAD B 905 LOAD B
906 DIVIDE 906 DIVIDE
907 DIVIDE 907 DIVIDE
908 <read barrier> 908 <read barrier>
909 LOAD A 909 LOAD A
910 910
911 will force any value speculatively obtained to be reconsidered to an extent 911 will force any value speculatively obtained to be reconsidered to an extent
912 dependent on the type of barrier used. If there was no change made to the 912 dependent on the type of barrier used. If there was no change made to the
913 speculated memory location, then the speculated value will just be used: 913 speculated memory location, then the speculated value will just be used:
914 914
915 : : +-------+ 915 : : +-------+
916 +-------+ | | 916 +-------+ | |
917 --->| B->2 |------>| | 917 --->| B->2 |------>| |
918 +-------+ | CPU 2 | 918 +-------+ | CPU 2 |
919 : :DIVIDE | | 919 : :DIVIDE | |
920 +-------+ | | 920 +-------+ | |
921 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 921 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
922 division speculates on the +-------+ ~ | | 922 division speculates on the +-------+ ~ | |
923 LOAD of A : : ~ | | 923 LOAD of A : : ~ | |
924 : :DIVIDE | | 924 : :DIVIDE | |
925 : : ~ | | 925 : : ~ | |
926 : : ~ | | 926 : : ~ | |
927 rrrrrrrrrrrrrrrr~ | | 927 rrrrrrrrrrrrrrrr~ | |
928 : : ~ | | 928 : : ~ | |
929 : : ~-->| | 929 : : ~-->| |
930 : : | | 930 : : | |
931 : : +-------+ 931 : : +-------+
932 932
933 933
934 but if there was an update or an invalidation from another CPU pending, then 934 but if there was an update or an invalidation from another CPU pending, then
935 the speculation will be cancelled and the value reloaded: 935 the speculation will be cancelled and the value reloaded:
936 936
937 : : +-------+ 937 : : +-------+
938 +-------+ | | 938 +-------+ | |
939 --->| B->2 |------>| | 939 --->| B->2 |------>| |
940 +-------+ | CPU 2 | 940 +-------+ | CPU 2 |
941 : :DIVIDE | | 941 : :DIVIDE | |
942 +-------+ | | 942 +-------+ | |
943 The CPU being busy doing a ---> --->| A->0 |~~~~ | | 943 The CPU being busy doing a ---> --->| A->0 |~~~~ | |
944 division speculates on the +-------+ ~ | | 944 division speculates on the +-------+ ~ | |
945 LOAD of A : : ~ | | 945 LOAD of A : : ~ | |
946 : :DIVIDE | | 946 : :DIVIDE | |
947 : : ~ | | 947 : : ~ | |
948 : : ~ | | 948 : : ~ | |
949 rrrrrrrrrrrrrrrrr | | 949 rrrrrrrrrrrrrrrrr | |
950 +-------+ | | 950 +-------+ | |
951 The speculation is discarded ---> --->| A->1 |------>| | 951 The speculation is discarded ---> --->| A->1 |------>| |
952 and an updated value is +-------+ | | 952 and an updated value is +-------+ | |
953 retrieved : : +-------+ 953 retrieved : : +-------+
954 954
955 955
956 ======================== 956 ========================
957 EXPLICIT KERNEL BARRIERS 957 EXPLICIT KERNEL BARRIERS
958 ======================== 958 ========================
959 959
960 The Linux kernel has a variety of different barriers that act at different 960 The Linux kernel has a variety of different barriers that act at different
961 levels: 961 levels:
962 962
963 (*) Compiler barrier. 963 (*) Compiler barrier.
964 964
965 (*) CPU memory barriers. 965 (*) CPU memory barriers.
966 966
967 (*) MMIO write barrier. 967 (*) MMIO write barrier.
968 968
969 969
970 COMPILER BARRIER 970 COMPILER BARRIER
971 ---------------- 971 ----------------
972 972
973 The Linux kernel has an explicit compiler barrier function that prevents the 973 The Linux kernel has an explicit compiler barrier function that prevents the
974 compiler from moving the memory accesses either side of it to the other side: 974 compiler from moving the memory accesses either side of it to the other side:
975 975
976 barrier(); 976 barrier();
977 977
978 This a general barrier - lesser varieties of compiler barrier do not exist. 978 This a general barrier - lesser varieties of compiler barrier do not exist.
979 979
980 The compiler barrier has no direct effect on the CPU, which may then reorder 980 The compiler barrier has no direct effect on the CPU, which may then reorder
981 things however it wishes. 981 things however it wishes.
982 982
983 983
984 CPU MEMORY BARRIERS 984 CPU MEMORY BARRIERS
985 ------------------- 985 -------------------
986 986
987 The Linux kernel has eight basic CPU memory barriers: 987 The Linux kernel has eight basic CPU memory barriers:
988 988
989 TYPE MANDATORY SMP CONDITIONAL 989 TYPE MANDATORY SMP CONDITIONAL
990 =============== ======================= =========================== 990 =============== ======================= ===========================
991 GENERAL mb() smp_mb() 991 GENERAL mb() smp_mb()
992 WRITE wmb() smp_wmb() 992 WRITE wmb() smp_wmb()
993 READ rmb() smp_rmb() 993 READ rmb() smp_rmb()
994 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends() 994 DATA DEPENDENCY read_barrier_depends() smp_read_barrier_depends()
995 995
996 996
997 All CPU memory barriers unconditionally imply compiler barriers. 997 All CPU memory barriers unconditionally imply compiler barriers.
998 998
999 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled 999 SMP memory barriers are reduced to compiler barriers on uniprocessor compiled
1000 systems because it is assumed that a CPU will be appear to be self-consistent, 1000 systems because it is assumed that a CPU will be appear to be self-consistent,
1001 and will order overlapping accesses correctly with respect to itself. 1001 and will order overlapping accesses correctly with respect to itself.
1002 1002
1003 [!] Note that SMP memory barriers _must_ be used to control the ordering of 1003 [!] Note that SMP memory barriers _must_ be used to control the ordering of
1004 references to shared memory on SMP systems, though the use of locking instead 1004 references to shared memory on SMP systems, though the use of locking instead
1005 is sufficient. 1005 is sufficient.
1006 1006
1007 Mandatory barriers should not be used to control SMP effects, since mandatory 1007 Mandatory barriers should not be used to control SMP effects, since mandatory
1008 barriers unnecessarily impose overhead on UP systems. They may, however, be 1008 barriers unnecessarily impose overhead on UP systems. They may, however, be
1009 used to control MMIO effects on accesses through relaxed memory I/O windows. 1009 used to control MMIO effects on accesses through relaxed memory I/O windows.
1010 These are required even on non-SMP systems as they affect the order in which 1010 These are required even on non-SMP systems as they affect the order in which
1011 memory operations appear to a device by prohibiting both the compiler and the 1011 memory operations appear to a device by prohibiting both the compiler and the
1012 CPU from reordering them. 1012 CPU from reordering them.
1013 1013
1014 1014
1015 There are some more advanced barrier functions: 1015 There are some more advanced barrier functions:
1016 1016
1017 (*) set_mb(var, value) 1017 (*) set_mb(var, value)
1018 1018
1019 This assigns the value to the variable and then inserts at least a write 1019 This assigns the value to the variable and then inserts at least a write
1020 barrier after it, depending on the function. It isn't guaranteed to 1020 barrier after it, depending on the function. It isn't guaranteed to
1021 insert anything more than a compiler barrier in a UP compilation. 1021 insert anything more than a compiler barrier in a UP compilation.
1022 1022
1023 1023
1024 (*) smp_mb__before_atomic_dec(); 1024 (*) smp_mb__before_atomic_dec();
1025 (*) smp_mb__after_atomic_dec(); 1025 (*) smp_mb__after_atomic_dec();
1026 (*) smp_mb__before_atomic_inc(); 1026 (*) smp_mb__before_atomic_inc();
1027 (*) smp_mb__after_atomic_inc(); 1027 (*) smp_mb__after_atomic_inc();
1028 1028
1029 These are for use with atomic add, subtract, increment and decrement 1029 These are for use with atomic add, subtract, increment and decrement
1030 functions that don't return a value, especially when used for reference 1030 functions that don't return a value, especially when used for reference
1031 counting. These functions do not imply memory barriers. 1031 counting. These functions do not imply memory barriers.
1032 1032
1033 As an example, consider a piece of code that marks an object as being dead 1033 As an example, consider a piece of code that marks an object as being dead
1034 and then decrements the object's reference count: 1034 and then decrements the object's reference count:
1035 1035
1036 obj->dead = 1; 1036 obj->dead = 1;
1037 smp_mb__before_atomic_dec(); 1037 smp_mb__before_atomic_dec();
1038 atomic_dec(&obj->ref_count); 1038 atomic_dec(&obj->ref_count);
1039 1039
1040 This makes sure that the death mark on the object is perceived to be set 1040 This makes sure that the death mark on the object is perceived to be set
1041 *before* the reference counter is decremented. 1041 *before* the reference counter is decremented.
1042 1042
1043 See Documentation/atomic_ops.txt for more information. See the "Atomic 1043 See Documentation/atomic_ops.txt for more information. See the "Atomic
1044 operations" subsection for information on where to use these. 1044 operations" subsection for information on where to use these.
1045 1045
1046 1046
1047 (*) smp_mb__before_clear_bit(void); 1047 (*) smp_mb__before_clear_bit(void);
1048 (*) smp_mb__after_clear_bit(void); 1048 (*) smp_mb__after_clear_bit(void);
1049 1049
1050 These are for use similar to the atomic inc/dec barriers. These are 1050 These are for use similar to the atomic inc/dec barriers. These are
1051 typically used for bitwise unlocking operations, so care must be taken as 1051 typically used for bitwise unlocking operations, so care must be taken as
1052 there are no implicit memory barriers here either. 1052 there are no implicit memory barriers here either.
1053 1053
1054 Consider implementing an unlock operation of some nature by clearing a 1054 Consider implementing an unlock operation of some nature by clearing a
1055 locking bit. The clear_bit() would then need to be barriered like this: 1055 locking bit. The clear_bit() would then need to be barriered like this:
1056 1056
1057 smp_mb__before_clear_bit(); 1057 smp_mb__before_clear_bit();
1058 clear_bit( ... ); 1058 clear_bit( ... );
1059 1059
1060 This prevents memory operations before the clear leaking to after it. See 1060 This prevents memory operations before the clear leaking to after it. See
1061 the subsection on "Locking Functions" with reference to UNLOCK operation 1061 the subsection on "Locking Functions" with reference to UNLOCK operation
1062 implications. 1062 implications.
1063 1063
1064 See Documentation/atomic_ops.txt for more information. See the "Atomic 1064 See Documentation/atomic_ops.txt for more information. See the "Atomic
1065 operations" subsection for information on where to use these. 1065 operations" subsection for information on where to use these.
1066 1066
1067 1067
1068 MMIO WRITE BARRIER 1068 MMIO WRITE BARRIER
1069 ------------------ 1069 ------------------
1070 1070
1071 The Linux kernel also has a special barrier for use with memory-mapped I/O 1071 The Linux kernel also has a special barrier for use with memory-mapped I/O
1072 writes: 1072 writes:
1073 1073
1074 mmiowb(); 1074 mmiowb();
1075 1075
1076 This is a variation on the mandatory write barrier that causes writes to weakly 1076 This is a variation on the mandatory write barrier that causes writes to weakly
1077 ordered I/O regions to be partially ordered. Its effects may go beyond the 1077 ordered I/O regions to be partially ordered. Its effects may go beyond the
1078 CPU->Hardware interface and actually affect the hardware at some level. 1078 CPU->Hardware interface and actually affect the hardware at some level.
1079 1079
1080 See the subsection "Locks vs I/O accesses" for more information. 1080 See the subsection "Locks vs I/O accesses" for more information.
1081 1081
1082 1082
1083 =============================== 1083 ===============================
1084 IMPLICIT KERNEL MEMORY BARRIERS 1084 IMPLICIT KERNEL MEMORY BARRIERS
1085 =============================== 1085 ===============================
1086 1086
1087 Some of the other functions in the linux kernel imply memory barriers, amongst 1087 Some of the other functions in the linux kernel imply memory barriers, amongst
1088 which are locking and scheduling functions. 1088 which are locking and scheduling functions.
1089 1089
1090 This specification is a _minimum_ guarantee; any particular architecture may 1090 This specification is a _minimum_ guarantee; any particular architecture may
1091 provide more substantial guarantees, but these may not be relied upon outside 1091 provide more substantial guarantees, but these may not be relied upon outside
1092 of arch specific code. 1092 of arch specific code.
1093 1093
1094 1094
1095 LOCKING FUNCTIONS 1095 LOCKING FUNCTIONS
1096 ----------------- 1096 -----------------
1097 1097
1098 The Linux kernel has a number of locking constructs: 1098 The Linux kernel has a number of locking constructs:
1099 1099
1100 (*) spin locks 1100 (*) spin locks
1101 (*) R/W spin locks 1101 (*) R/W spin locks
1102 (*) mutexes 1102 (*) mutexes
1103 (*) semaphores 1103 (*) semaphores
1104 (*) R/W semaphores 1104 (*) R/W semaphores
1105 (*) RCU 1105 (*) RCU
1106 1106
1107 In all cases there are variants on "LOCK" operations and "UNLOCK" operations 1107 In all cases there are variants on "LOCK" operations and "UNLOCK" operations
1108 for each construct. These operations all imply certain barriers: 1108 for each construct. These operations all imply certain barriers:
1109 1109
1110 (1) LOCK operation implication: 1110 (1) LOCK operation implication:
1111 1111
1112 Memory operations issued after the LOCK will be completed after the LOCK 1112 Memory operations issued after the LOCK will be completed after the LOCK
1113 operation has completed. 1113 operation has completed.
1114 1114
1115 Memory operations issued before the LOCK may be completed after the LOCK 1115 Memory operations issued before the LOCK may be completed after the LOCK
1116 operation has completed. 1116 operation has completed.
1117 1117
1118 (2) UNLOCK operation implication: 1118 (2) UNLOCK operation implication:
1119 1119
1120 Memory operations issued before the UNLOCK will be completed before the 1120 Memory operations issued before the UNLOCK will be completed before the
1121 UNLOCK operation has completed. 1121 UNLOCK operation has completed.
1122 1122
1123 Memory operations issued after the UNLOCK may be completed before the 1123 Memory operations issued after the UNLOCK may be completed before the
1124 UNLOCK operation has completed. 1124 UNLOCK operation has completed.
1125 1125
1126 (3) LOCK vs LOCK implication: 1126 (3) LOCK vs LOCK implication:
1127 1127
1128 All LOCK operations issued before another LOCK operation will be completed 1128 All LOCK operations issued before another LOCK operation will be completed
1129 before that LOCK operation. 1129 before that LOCK operation.
1130 1130
1131 (4) LOCK vs UNLOCK implication: 1131 (4) LOCK vs UNLOCK implication:
1132 1132
1133 All LOCK operations issued before an UNLOCK operation will be completed 1133 All LOCK operations issued before an UNLOCK operation will be completed
1134 before the UNLOCK operation. 1134 before the UNLOCK operation.
1135 1135
1136 All UNLOCK operations issued before a LOCK operation will be completed 1136 All UNLOCK operations issued before a LOCK operation will be completed
1137 before the LOCK operation. 1137 before the LOCK operation.
1138 1138
1139 (5) Failed conditional LOCK implication: 1139 (5) Failed conditional LOCK implication:
1140 1140
1141 Certain variants of the LOCK operation may fail, either due to being 1141 Certain variants of the LOCK operation may fail, either due to being
1142 unable to get the lock immediately, or due to receiving an unblocked 1142 unable to get the lock immediately, or due to receiving an unblocked
1143 signal whilst asleep waiting for the lock to become available. Failed 1143 signal whilst asleep waiting for the lock to become available. Failed
1144 locks do not imply any sort of barrier. 1144 locks do not imply any sort of barrier.
1145 1145
1146 Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is 1146 Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
1147 equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. 1147 equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
1148 1148
1149 [!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way 1149 [!] Note: one of the consequence of LOCKs and UNLOCKs being only one-way
1150 barriers is that the effects instructions outside of a critical section may 1150 barriers is that the effects instructions outside of a critical section may
1151 seep into the inside of the critical section. 1151 seep into the inside of the critical section.
1152 1152
1153 A LOCK followed by an UNLOCK may not be assumed to be full memory barrier 1153 A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
1154 because it is possible for an access preceding the LOCK to happen after the 1154 because it is possible for an access preceding the LOCK to happen after the
1155 LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the 1155 LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
1156 two accesses can themselves then cross: 1156 two accesses can themselves then cross:
1157 1157
1158 *A = a; 1158 *A = a;
1159 LOCK 1159 LOCK
1160 UNLOCK 1160 UNLOCK
1161 *B = b; 1161 *B = b;
1162 1162
1163 may occur as: 1163 may occur as:
1164 1164
1165 LOCK, STORE *B, STORE *A, UNLOCK 1165 LOCK, STORE *B, STORE *A, UNLOCK
1166 1166
1167 Locks and semaphores may not provide any guarantee of ordering on UP compiled 1167 Locks and semaphores may not provide any guarantee of ordering on UP compiled
1168 systems, and so cannot be counted on in such a situation to actually achieve 1168 systems, and so cannot be counted on in such a situation to actually achieve
1169 anything at all - especially with respect to I/O accesses - unless combined 1169 anything at all - especially with respect to I/O accesses - unless combined
1170 with interrupt disabling operations. 1170 with interrupt disabling operations.
1171 1171
1172 See also the section on "Inter-CPU locking barrier effects". 1172 See also the section on "Inter-CPU locking barrier effects".
1173 1173
1174 1174
1175 As an example, consider the following: 1175 As an example, consider the following:
1176 1176
1177 *A = a; 1177 *A = a;
1178 *B = b; 1178 *B = b;
1179 LOCK 1179 LOCK
1180 *C = c; 1180 *C = c;
1181 *D = d; 1181 *D = d;
1182 UNLOCK 1182 UNLOCK
1183 *E = e; 1183 *E = e;
1184 *F = f; 1184 *F = f;
1185 1185
1186 The following sequence of events is acceptable: 1186 The following sequence of events is acceptable:
1187 1187
1188 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK 1188 LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
1189 1189
1190 [+] Note that {*F,*A} indicates a combined access. 1190 [+] Note that {*F,*A} indicates a combined access.
1191 1191
1192 But none of the following are: 1192 But none of the following are:
1193 1193
1194 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E 1194 {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
1195 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F 1195 *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
1196 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F 1196 *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
1197 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E 1197 *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
1198 1198
1199 1199
1200 1200
1201 INTERRUPT DISABLING FUNCTIONS 1201 INTERRUPT DISABLING FUNCTIONS
1202 ----------------------------- 1202 -----------------------------
1203 1203
1204 Functions that disable interrupts (LOCK equivalent) and enable interrupts 1204 Functions that disable interrupts (LOCK equivalent) and enable interrupts
1205 (UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O 1205 (UNLOCK equivalent) will act as compiler barriers only. So if memory or I/O
1206 barriers are required in such a situation, they must be provided from some 1206 barriers are required in such a situation, they must be provided from some
1207 other means. 1207 other means.
1208 1208
1209 1209
1210 MISCELLANEOUS FUNCTIONS 1210 MISCELLANEOUS FUNCTIONS
1211 ----------------------- 1211 -----------------------
1212 1212
1213 Other functions that imply barriers: 1213 Other functions that imply barriers:
1214 1214
1215 (*) schedule() and similar imply full memory barriers. 1215 (*) schedule() and similar imply full memory barriers.
1216 1216
1217 1217
1218 ================================= 1218 =================================
1219 INTER-CPU LOCKING BARRIER EFFECTS 1219 INTER-CPU LOCKING BARRIER EFFECTS
1220 ================================= 1220 =================================
1221 1221
1222 On SMP systems locking primitives give a more substantial form of barrier: one 1222 On SMP systems locking primitives give a more substantial form of barrier: one
1223 that does affect memory access ordering on other CPUs, within the context of 1223 that does affect memory access ordering on other CPUs, within the context of
1224 conflict on any particular lock. 1224 conflict on any particular lock.
1225 1225
1226 1226
1227 LOCKS VS MEMORY ACCESSES 1227 LOCKS VS MEMORY ACCESSES
1228 ------------------------ 1228 ------------------------
1229 1229
1230 Consider the following: the system has a pair of spinlocks (M) and (Q), and 1230 Consider the following: the system has a pair of spinlocks (M) and (Q), and
1231 three CPUs; then should the following sequence of events occur: 1231 three CPUs; then should the following sequence of events occur:
1232 1232
1233 CPU 1 CPU 2 1233 CPU 1 CPU 2
1234 =============================== =============================== 1234 =============================== ===============================
1235 *A = a; *E = e; 1235 *A = a; *E = e;
1236 LOCK M LOCK Q 1236 LOCK M LOCK Q
1237 *B = b; *F = f; 1237 *B = b; *F = f;
1238 *C = c; *G = g; 1238 *C = c; *G = g;
1239 UNLOCK M UNLOCK Q 1239 UNLOCK M UNLOCK Q
1240 *D = d; *H = h; 1240 *D = d; *H = h;
1241 1241
1242 Then there is no guarantee as to what order CPU #3 will see the accesses to *A 1242 Then there is no guarantee as to what order CPU #3 will see the accesses to *A
1243 through *H occur in, other than the constraints imposed by the separate locks 1243 through *H occur in, other than the constraints imposed by the separate locks
1244 on the separate CPUs. It might, for example, see: 1244 on the separate CPUs. It might, for example, see:
1245 1245
1246 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M 1246 *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
1247 1247
1248 But it won't see any of: 1248 But it won't see any of:
1249 1249
1250 *B, *C or *D preceding LOCK M 1250 *B, *C or *D preceding LOCK M
1251 *A, *B or *C following UNLOCK M 1251 *A, *B or *C following UNLOCK M
1252 *F, *G or *H preceding LOCK Q 1252 *F, *G or *H preceding LOCK Q
1253 *E, *F or *G following UNLOCK Q 1253 *E, *F or *G following UNLOCK Q
1254 1254
1255 1255
1256 However, if the following occurs: 1256 However, if the following occurs:
1257 1257
1258 CPU 1 CPU 2 1258 CPU 1 CPU 2
1259 =============================== =============================== 1259 =============================== ===============================
1260 *A = a; 1260 *A = a;
1261 LOCK M [1] 1261 LOCK M [1]
1262 *B = b; 1262 *B = b;
1263 *C = c; 1263 *C = c;
1264 UNLOCK M [1] 1264 UNLOCK M [1]
1265 *D = d; *E = e; 1265 *D = d; *E = e;
1266 LOCK M [2] 1266 LOCK M [2]
1267 *F = f; 1267 *F = f;
1268 *G = g; 1268 *G = g;
1269 UNLOCK M [2] 1269 UNLOCK M [2]
1270 *H = h; 1270 *H = h;
1271 1271
1272 CPU #3 might see: 1272 CPU #3 might see:
1273 1273
1274 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1], 1274 *E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
1275 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D 1275 LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
1276 1276
1277 But assuming CPU #1 gets the lock first, it won't see any of: 1277 But assuming CPU #1 gets the lock first, it won't see any of:
1278 1278
1279 *B, *C, *D, *F, *G or *H preceding LOCK M [1] 1279 *B, *C, *D, *F, *G or *H preceding LOCK M [1]
1280 *A, *B or *C following UNLOCK M [1] 1280 *A, *B or *C following UNLOCK M [1]
1281 *F, *G or *H preceding LOCK M [2] 1281 *F, *G or *H preceding LOCK M [2]
1282 *A, *B, *C, *E, *F or *G following UNLOCK M [2] 1282 *A, *B, *C, *E, *F or *G following UNLOCK M [2]
1283 1283
1284 1284
1285 LOCKS VS I/O ACCESSES 1285 LOCKS VS I/O ACCESSES
1286 --------------------- 1286 ---------------------
1287 1287
1288 Under certain circumstances (especially involving NUMA), I/O accesses within 1288 Under certain circumstances (especially involving NUMA), I/O accesses within
1289 two spinlocked sections on two different CPUs may be seen as interleaved by the 1289 two spinlocked sections on two different CPUs may be seen as interleaved by the
1290 PCI bridge, because the PCI bridge does not necessarily participate in the 1290 PCI bridge, because the PCI bridge does not necessarily participate in the
1291 cache-coherence protocol, and is therefore incapable of issuing the required 1291 cache-coherence protocol, and is therefore incapable of issuing the required
1292 read memory barriers. 1292 read memory barriers.
1293 1293
1294 For example: 1294 For example:
1295 1295
1296 CPU 1 CPU 2 1296 CPU 1 CPU 2
1297 =============================== =============================== 1297 =============================== ===============================
1298 spin_lock(Q) 1298 spin_lock(Q)
1299 writel(0, ADDR) 1299 writel(0, ADDR)
1300 writel(1, DATA); 1300 writel(1, DATA);
1301 spin_unlock(Q); 1301 spin_unlock(Q);
1302 spin_lock(Q); 1302 spin_lock(Q);
1303 writel(4, ADDR); 1303 writel(4, ADDR);
1304 writel(5, DATA); 1304 writel(5, DATA);
1305 spin_unlock(Q); 1305 spin_unlock(Q);
1306 1306
1307 may be seen by the PCI bridge as follows: 1307 may be seen by the PCI bridge as follows:
1308 1308
1309 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5 1309 STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
1310 1310
1311 which would probably cause the hardware to malfunction. 1311 which would probably cause the hardware to malfunction.
1312 1312
1313 1313
1314 What is necessary here is to intervene with an mmiowb() before dropping the 1314 What is necessary here is to intervene with an mmiowb() before dropping the
1315 spinlock, for example: 1315 spinlock, for example:
1316 1316
1317 CPU 1 CPU 2 1317 CPU 1 CPU 2
1318 =============================== =============================== 1318 =============================== ===============================
1319 spin_lock(Q) 1319 spin_lock(Q)
1320 writel(0, ADDR) 1320 writel(0, ADDR)
1321 writel(1, DATA); 1321 writel(1, DATA);
1322 mmiowb(); 1322 mmiowb();
1323 spin_unlock(Q); 1323 spin_unlock(Q);
1324 spin_lock(Q); 1324 spin_lock(Q);
1325 writel(4, ADDR); 1325 writel(4, ADDR);
1326 writel(5, DATA); 1326 writel(5, DATA);
1327 mmiowb(); 1327 mmiowb();
1328 spin_unlock(Q); 1328 spin_unlock(Q);
1329 1329
1330 this will ensure that the two stores issued on CPU #1 appear at the PCI bridge 1330 this will ensure that the two stores issued on CPU #1 appear at the PCI bridge
1331 before either of the stores issued on CPU #2. 1331 before either of the stores issued on CPU #2.
1332 1332
1333 1333
1334 Furthermore, following a store by a load to the same device obviates the need 1334 Furthermore, following a store by a load to the same device obviates the need
1335 for an mmiowb(), because the load forces the store to complete before the load 1335 for an mmiowb(), because the load forces the store to complete before the load
1336 is performed: 1336 is performed:
1337 1337
1338 CPU 1 CPU 2 1338 CPU 1 CPU 2
1339 =============================== =============================== 1339 =============================== ===============================
1340 spin_lock(Q) 1340 spin_lock(Q)
1341 writel(0, ADDR) 1341 writel(0, ADDR)
1342 a = readl(DATA); 1342 a = readl(DATA);
1343 spin_unlock(Q); 1343 spin_unlock(Q);
1344 spin_lock(Q); 1344 spin_lock(Q);
1345 writel(4, ADDR); 1345 writel(4, ADDR);
1346 b = readl(DATA); 1346 b = readl(DATA);
1347 spin_unlock(Q); 1347 spin_unlock(Q);
1348 1348
1349 1349
1350 See Documentation/DocBook/deviceiobook.tmpl for more information. 1350 See Documentation/DocBook/deviceiobook.tmpl for more information.
1351 1351
1352 1352
1353 ================================= 1353 =================================
1354 WHERE ARE MEMORY BARRIERS NEEDED? 1354 WHERE ARE MEMORY BARRIERS NEEDED?
1355 ================================= 1355 =================================
1356 1356
1357 Under normal operation, memory operation reordering is generally not going to 1357 Under normal operation, memory operation reordering is generally not going to
1358 be a problem as a single-threaded linear piece of code will still appear to 1358 be a problem as a single-threaded linear piece of code will still appear to
1359 work correctly, even if it's in an SMP kernel. There are, however, three 1359 work correctly, even if it's in an SMP kernel. There are, however, three
1360 circumstances in which reordering definitely _could_ be a problem: 1360 circumstances in which reordering definitely _could_ be a problem:
1361 1361
1362 (*) Interprocessor interaction. 1362 (*) Interprocessor interaction.
1363 1363
1364 (*) Atomic operations. 1364 (*) Atomic operations.
1365 1365
1366 (*) Accessing devices (I/O). 1366 (*) Accessing devices (I/O).
1367 1367
1368 (*) Interrupts. 1368 (*) Interrupts.
1369 1369
1370 1370
1371 INTERPROCESSOR INTERACTION 1371 INTERPROCESSOR INTERACTION
1372 -------------------------- 1372 --------------------------
1373 1373
1374 When there's a system with more than one processor, more than one CPU in the 1374 When there's a system with more than one processor, more than one CPU in the
1375 system may be working on the same data set at the same time. This can cause 1375 system may be working on the same data set at the same time. This can cause
1376 synchronisation problems, and the usual way of dealing with them is to use 1376 synchronisation problems, and the usual way of dealing with them is to use
1377 locks. Locks, however, are quite expensive, and so it may be preferable to 1377 locks. Locks, however, are quite expensive, and so it may be preferable to
1378 operate without the use of a lock if at all possible. In such a case 1378 operate without the use of a lock if at all possible. In such a case
1379 operations that affect both CPUs may have to be carefully ordered to prevent 1379 operations that affect both CPUs may have to be carefully ordered to prevent
1380 a malfunction. 1380 a malfunction.
1381 1381
1382 Consider, for example, the R/W semaphore slow path. Here a waiting process is 1382 Consider, for example, the R/W semaphore slow path. Here a waiting process is
1383 queued on the semaphore, by virtue of it having a piece of its stack linked to 1383 queued on the semaphore, by virtue of it having a piece of its stack linked to
1384 the semaphore's list of waiting processes: 1384 the semaphore's list of waiting processes:
1385 1385
1386 struct rw_semaphore { 1386 struct rw_semaphore {
1387 ... 1387 ...
1388 spinlock_t lock; 1388 spinlock_t lock;
1389 struct list_head waiters; 1389 struct list_head waiters;
1390 }; 1390 };
1391 1391
1392 struct rwsem_waiter { 1392 struct rwsem_waiter {
1393 struct list_head list; 1393 struct list_head list;
1394 struct task_struct *task; 1394 struct task_struct *task;
1395 }; 1395 };
1396 1396
1397 To wake up a particular waiter, the up_read() or up_write() functions have to: 1397 To wake up a particular waiter, the up_read() or up_write() functions have to:
1398 1398
1399 (1) read the next pointer from this waiter's record to know as to where the 1399 (1) read the next pointer from this waiter's record to know as to where the
1400 next waiter record is; 1400 next waiter record is;
1401 1401
1402 (4) read the pointer to the waiter's task structure; 1402 (4) read the pointer to the waiter's task structure;
1403 1403
1404 (3) clear the task pointer to tell the waiter it has been given the semaphore; 1404 (3) clear the task pointer to tell the waiter it has been given the semaphore;
1405 1405
1406 (4) call wake_up_process() on the task; and 1406 (4) call wake_up_process() on the task; and
1407 1407
1408 (5) release the reference held on the waiter's task struct. 1408 (5) release the reference held on the waiter's task struct.
1409 1409
1410 In otherwords, it has to perform this sequence of events: 1410 In otherwords, it has to perform this sequence of events:
1411 1411
1412 LOAD waiter->list.next; 1412 LOAD waiter->list.next;
1413 LOAD waiter->task; 1413 LOAD waiter->task;
1414 STORE waiter->task; 1414 STORE waiter->task;
1415 CALL wakeup 1415 CALL wakeup
1416 RELEASE task 1416 RELEASE task
1417 1417
1418 and if any of these steps occur out of order, then the whole thing may 1418 and if any of these steps occur out of order, then the whole thing may
1419 malfunction. 1419 malfunction.
1420 1420
1421 Once it has queued itself and dropped the semaphore lock, the waiter does not 1421 Once it has queued itself and dropped the semaphore lock, the waiter does not
1422 get the lock again; it instead just waits for its task pointer to be cleared 1422 get the lock again; it instead just waits for its task pointer to be cleared
1423 before proceeding. Since the record is on the waiter's stack, this means that 1423 before proceeding. Since the record is on the waiter's stack, this means that
1424 if the task pointer is cleared _before_ the next pointer in the list is read, 1424 if the task pointer is cleared _before_ the next pointer in the list is read,
1425 another CPU might start processing the waiter and might clobber the waiter's 1425 another CPU might start processing the waiter and might clobber the waiter's
1426 stack before the up*() function has a chance to read the next pointer. 1426 stack before the up*() function has a chance to read the next pointer.
1427 1427
1428 Consider then what might happen to the above sequence of events: 1428 Consider then what might happen to the above sequence of events:
1429 1429
1430 CPU 1 CPU 2 1430 CPU 1 CPU 2
1431 =============================== =============================== 1431 =============================== ===============================
1432 down_xxx() 1432 down_xxx()
1433 Queue waiter 1433 Queue waiter
1434 Sleep 1434 Sleep
1435 up_yyy() 1435 up_yyy()
1436 LOAD waiter->task; 1436 LOAD waiter->task;
1437 STORE waiter->task; 1437 STORE waiter->task;
1438 Woken up by other event 1438 Woken up by other event
1439 <preempt> 1439 <preempt>
1440 Resume processing 1440 Resume processing
1441 down_xxx() returns 1441 down_xxx() returns
1442 call foo() 1442 call foo()
1443 foo() clobbers *waiter 1443 foo() clobbers *waiter
1444 </preempt> 1444 </preempt>
1445 LOAD waiter->list.next; 1445 LOAD waiter->list.next;
1446 --- OOPS --- 1446 --- OOPS ---
1447 1447
1448 This could be dealt with using the semaphore lock, but then the down_xxx() 1448 This could be dealt with using the semaphore lock, but then the down_xxx()
1449 function has to needlessly get the spinlock again after being woken up. 1449 function has to needlessly get the spinlock again after being woken up.
1450 1450
1451 The way to deal with this is to insert a general SMP memory barrier: 1451 The way to deal with this is to insert a general SMP memory barrier:
1452 1452
1453 LOAD waiter->list.next; 1453 LOAD waiter->list.next;
1454 LOAD waiter->task; 1454 LOAD waiter->task;
1455 smp_mb(); 1455 smp_mb();
1456 STORE waiter->task; 1456 STORE waiter->task;
1457 CALL wakeup 1457 CALL wakeup
1458 RELEASE task 1458 RELEASE task
1459 1459
1460 In this case, the barrier makes a guarantee that all memory accesses before the 1460 In this case, the barrier makes a guarantee that all memory accesses before the
1461 barrier will appear to happen before all the memory accesses after the barrier 1461 barrier will appear to happen before all the memory accesses after the barrier
1462 with respect to the other CPUs on the system. It does _not_ guarantee that all 1462 with respect to the other CPUs on the system. It does _not_ guarantee that all
1463 the memory accesses before the barrier will be complete by the time the barrier 1463 the memory accesses before the barrier will be complete by the time the barrier
1464 instruction itself is complete. 1464 instruction itself is complete.
1465 1465
1466 On a UP system - where this wouldn't be a problem - the smp_mb() is just a 1466 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
1467 compiler barrier, thus making sure the compiler emits the instructions in the 1467 compiler barrier, thus making sure the compiler emits the instructions in the
1468 right order without actually intervening in the CPU. Since there's only one 1468 right order without actually intervening in the CPU. Since there's only one
1469 CPU, that CPU's dependency ordering logic will take care of everything else. 1469 CPU, that CPU's dependency ordering logic will take care of everything else.
1470 1470
1471 1471
1472 ATOMIC OPERATIONS 1472 ATOMIC OPERATIONS
1473 ----------------- 1473 -----------------
1474 1474
1475 Whilst they are technically interprocessor interaction considerations, atomic 1475 Whilst they are technically interprocessor interaction considerations, atomic
1476 operations are noted specially as some of them imply full memory barriers and 1476 operations are noted specially as some of them imply full memory barriers and
1477 some don't, but they're very heavily relied on as a group throughout the 1477 some don't, but they're very heavily relied on as a group throughout the
1478 kernel. 1478 kernel.
1479 1479
1480 Any atomic operation that modifies some state in memory and returns information 1480 Any atomic operation that modifies some state in memory and returns information
1481 about the state (old or new) implies an SMP-conditional general memory barrier 1481 about the state (old or new) implies an SMP-conditional general memory barrier
1482 (smp_mb()) on each side of the actual operation. These include: 1482 (smp_mb()) on each side of the actual operation. These include:
1483 1483
1484 xchg(); 1484 xchg();
1485 cmpxchg(); 1485 cmpxchg();
1486 atomic_cmpxchg(); 1486 atomic_cmpxchg();
1487 atomic_inc_return(); 1487 atomic_inc_return();
1488 atomic_dec_return(); 1488 atomic_dec_return();
1489 atomic_add_return(); 1489 atomic_add_return();
1490 atomic_sub_return(); 1490 atomic_sub_return();
1491 atomic_inc_and_test(); 1491 atomic_inc_and_test();
1492 atomic_dec_and_test(); 1492 atomic_dec_and_test();
1493 atomic_sub_and_test(); 1493 atomic_sub_and_test();
1494 atomic_add_negative(); 1494 atomic_add_negative();
1495 atomic_add_unless(); 1495 atomic_add_unless();
1496 test_and_set_bit(); 1496 test_and_set_bit();
1497 test_and_clear_bit(); 1497 test_and_clear_bit();
1498 test_and_change_bit(); 1498 test_and_change_bit();
1499 1499
1500 These are used for such things as implementing LOCK-class and UNLOCK-class 1500 These are used for such things as implementing LOCK-class and UNLOCK-class
1501 operations and adjusting reference counters towards object destruction, and as 1501 operations and adjusting reference counters towards object destruction, and as
1502 such the implicit memory barrier effects are necessary. 1502 such the implicit memory barrier effects are necessary.
1503 1503
1504 1504
1505 The following operation are potential problems as they do _not_ imply memory 1505 The following operation are potential problems as they do _not_ imply memory
1506 barriers, but might be used for implementing such things as UNLOCK-class 1506 barriers, but might be used for implementing such things as UNLOCK-class
1507 operations: 1507 operations:
1508 1508
1509 atomic_set(); 1509 atomic_set();
1510 set_bit(); 1510 set_bit();
1511 clear_bit(); 1511 clear_bit();
1512 change_bit(); 1512 change_bit();
1513 1513
1514 With these the appropriate explicit memory barrier should be used if necessary 1514 With these the appropriate explicit memory barrier should be used if necessary
1515 (smp_mb__before_clear_bit() for instance). 1515 (smp_mb__before_clear_bit() for instance).
1516 1516
1517 1517
1518 The following also do _not_ imply memory barriers, and so may require explicit 1518 The following also do _not_ imply memory barriers, and so may require explicit
1519 memory barriers under some circumstances (smp_mb__before_atomic_dec() for 1519 memory barriers under some circumstances (smp_mb__before_atomic_dec() for
1520 instance)): 1520 instance)):
1521 1521
1522 atomic_add(); 1522 atomic_add();
1523 atomic_sub(); 1523 atomic_sub();
1524 atomic_inc(); 1524 atomic_inc();
1525 atomic_dec(); 1525 atomic_dec();
1526 1526
1527 If they're used for statistics generation, then they probably don't need memory 1527 If they're used for statistics generation, then they probably don't need memory
1528 barriers, unless there's a coupling between statistical data. 1528 barriers, unless there's a coupling between statistical data.
1529 1529
1530 If they're used for reference counting on an object to control its lifetime, 1530 If they're used for reference counting on an object to control its lifetime,
1531 they probably don't need memory barriers because either the reference count 1531 they probably don't need memory barriers because either the reference count
1532 will be adjusted inside a locked section, or the caller will already hold 1532 will be adjusted inside a locked section, or the caller will already hold
1533 sufficient references to make the lock, and thus a memory barrier unnecessary. 1533 sufficient references to make the lock, and thus a memory barrier unnecessary.
1534 1534
1535 If they're used for constructing a lock of some description, then they probably 1535 If they're used for constructing a lock of some description, then they probably
1536 do need memory barriers as a lock primitive generally has to do things in a 1536 do need memory barriers as a lock primitive generally has to do things in a
1537 specific order. 1537 specific order.
1538 1538
1539 1539
1540 Basically, each usage case has to be carefully considered as to whether memory 1540 Basically, each usage case has to be carefully considered as to whether memory
1541 barriers are needed or not. 1541 barriers are needed or not.
1542 1542
1543 [!] Note that special memory barrier primitives are available for these 1543 [!] Note that special memory barrier primitives are available for these
1544 situations because on some CPUs the atomic instructions used imply full memory 1544 situations because on some CPUs the atomic instructions used imply full memory
1545 barriers, and so barrier instructions are superfluous in conjunction with them, 1545 barriers, and so barrier instructions are superfluous in conjunction with them,
1546 and in such cases the special barrier primitives will be no-ops. 1546 and in such cases the special barrier primitives will be no-ops.
1547 1547
1548 See Documentation/atomic_ops.txt for more information. 1548 See Documentation/atomic_ops.txt for more information.
1549 1549
1550 1550
1551 ACCESSING DEVICES 1551 ACCESSING DEVICES
1552 ----------------- 1552 -----------------
1553 1553
1554 Many devices can be memory mapped, and so appear to the CPU as if they're just 1554 Many devices can be memory mapped, and so appear to the CPU as if they're just
1555 a set of memory locations. To control such a device, the driver usually has to 1555 a set of memory locations. To control such a device, the driver usually has to
1556 make the right memory accesses in exactly the right order. 1556 make the right memory accesses in exactly the right order.
1557 1557
1558 However, having a clever CPU or a clever compiler creates a potential problem 1558 However, having a clever CPU or a clever compiler creates a potential problem
1559 in that the carefully sequenced accesses in the driver code won't reach the 1559 in that the carefully sequenced accesses in the driver code won't reach the
1560 device in the requisite order if the CPU or the compiler thinks it is more 1560 device in the requisite order if the CPU or the compiler thinks it is more
1561 efficient to reorder, combine or merge accesses - something that would cause 1561 efficient to reorder, combine or merge accesses - something that would cause
1562 the device to malfunction. 1562 the device to malfunction.
1563 1563
1564 Inside of the Linux kernel, I/O should be done through the appropriate accessor 1564 Inside of the Linux kernel, I/O should be done through the appropriate accessor
1565 routines - such as inb() or writel() - which know how to make such accesses 1565 routines - such as inb() or writel() - which know how to make such accesses
1566 appropriately sequential. Whilst this, for the most part, renders the explicit 1566 appropriately sequential. Whilst this, for the most part, renders the explicit
1567 use of memory barriers unnecessary, there are a couple of situations where they 1567 use of memory barriers unnecessary, there are a couple of situations where they
1568 might be needed: 1568 might be needed:
1569 1569
1570 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and 1570 (1) On some systems, I/O stores are not strongly ordered across all CPUs, and
1571 so for _all_ general drivers locks should be used and mmiowb() must be 1571 so for _all_ general drivers locks should be used and mmiowb() must be
1572 issued prior to unlocking the critical section. 1572 issued prior to unlocking the critical section.
1573 1573
1574 (2) If the accessor functions are used to refer to an I/O memory window with 1574 (2) If the accessor functions are used to refer to an I/O memory window with
1575 relaxed memory access properties, then _mandatory_ memory barriers are 1575 relaxed memory access properties, then _mandatory_ memory barriers are
1576 required to enforce ordering. 1576 required to enforce ordering.
1577 1577
1578 See Documentation/DocBook/deviceiobook.tmpl for more information. 1578 See Documentation/DocBook/deviceiobook.tmpl for more information.
1579 1579
1580 1580
1581 INTERRUPTS 1581 INTERRUPTS
1582 ---------- 1582 ----------
1583 1583
1584 A driver may be interrupted by its own interrupt service routine, and thus the 1584 A driver may be interrupted by its own interrupt service routine, and thus the
1585 two parts of the driver may interfere with each other's attempts to control or 1585 two parts of the driver may interfere with each other's attempts to control or
1586 access the device. 1586 access the device.
1587 1587
1588 This may be alleviated - at least in part - by disabling local interrupts (a 1588 This may be alleviated - at least in part - by disabling local interrupts (a
1589 form of locking), such that the critical operations are all contained within 1589 form of locking), such that the critical operations are all contained within
1590 the interrupt-disabled section in the driver. Whilst the driver's interrupt 1590 the interrupt-disabled section in the driver. Whilst the driver's interrupt
1591 routine is executing, the driver's core may not run on the same CPU, and its 1591 routine is executing, the driver's core may not run on the same CPU, and its
1592 interrupt is not permitted to happen again until the current interrupt has been 1592 interrupt is not permitted to happen again until the current interrupt has been
1593 handled, thus the interrupt handler does not need to lock against that. 1593 handled, thus the interrupt handler does not need to lock against that.
1594 1594
1595 However, consider a driver that was talking to an ethernet card that sports an 1595 However, consider a driver that was talking to an ethernet card that sports an
1596 address register and a data register. If that driver's core talks to the card 1596 address register and a data register. If that driver's core talks to the card
1597 under interrupt-disablement and then the driver's interrupt handler is invoked: 1597 under interrupt-disablement and then the driver's interrupt handler is invoked:
1598 1598
1599 LOCAL IRQ DISABLE 1599 LOCAL IRQ DISABLE
1600 writew(ADDR, 3); 1600 writew(ADDR, 3);
1601 writew(DATA, y); 1601 writew(DATA, y);
1602 LOCAL IRQ ENABLE 1602 LOCAL IRQ ENABLE
1603 <interrupt> 1603 <interrupt>
1604 writew(ADDR, 4); 1604 writew(ADDR, 4);
1605 q = readw(DATA); 1605 q = readw(DATA);
1606 </interrupt> 1606 </interrupt>
1607 1607
1608 The store to the data register might happen after the second store to the 1608 The store to the data register might happen after the second store to the
1609 address register if ordering rules are sufficiently relaxed: 1609 address register if ordering rules are sufficiently relaxed:
1610 1610
1611 STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA 1611 STORE *ADDR = 3, STORE *ADDR = 4, STORE *DATA = y, q = LOAD *DATA
1612 1612
1613 1613
1614 If ordering rules are relaxed, it must be assumed that accesses done inside an 1614 If ordering rules are relaxed, it must be assumed that accesses done inside an
1615 interrupt disabled section may leak outside of it and may interleave with 1615 interrupt disabled section may leak outside of it and may interleave with
1616 accesses performed in an interrupt - and vice versa - unless implicit or 1616 accesses performed in an interrupt - and vice versa - unless implicit or
1617 explicit barriers are used. 1617 explicit barriers are used.
1618 1618
1619 Normally this won't be a problem because the I/O accesses done inside such 1619 Normally this won't be a problem because the I/O accesses done inside such
1620 sections will include synchronous load operations on strictly ordered I/O 1620 sections will include synchronous load operations on strictly ordered I/O
1621 registers that form implicit I/O barriers. If this isn't sufficient then an 1621 registers that form implicit I/O barriers. If this isn't sufficient then an
1622 mmiowb() may need to be used explicitly. 1622 mmiowb() may need to be used explicitly.
1623 1623
1624 1624
1625 A similar situation may occur between an interrupt routine and two routines 1625 A similar situation may occur between an interrupt routine and two routines
1626 running on separate CPUs that communicate with each other. If such a case is 1626 running on separate CPUs that communicate with each other. If such a case is
1627 likely, then interrupt-disabling locks should be used to guarantee ordering. 1627 likely, then interrupt-disabling locks should be used to guarantee ordering.
1628 1628
1629 1629
1630 ========================== 1630 ==========================
1631 KERNEL I/O BARRIER EFFECTS 1631 KERNEL I/O BARRIER EFFECTS
1632 ========================== 1632 ==========================
1633 1633
1634 When accessing I/O memory, drivers should use the appropriate accessor 1634 When accessing I/O memory, drivers should use the appropriate accessor
1635 functions: 1635 functions:
1636 1636
1637 (*) inX(), outX(): 1637 (*) inX(), outX():
1638 1638
1639 These are intended to talk to I/O space rather than memory space, but 1639 These are intended to talk to I/O space rather than memory space, but
1640 that's primarily a CPU-specific concept. The i386 and x86_64 processors do 1640 that's primarily a CPU-specific concept. The i386 and x86_64 processors do
1641 indeed have special I/O space access cycles and instructions, but many 1641 indeed have special I/O space access cycles and instructions, but many
1642 CPUs don't have such a concept. 1642 CPUs don't have such a concept.
1643 1643
1644 The PCI bus, amongst others, defines an I/O space concept - which on such 1644 The PCI bus, amongst others, defines an I/O space concept - which on such
1645 CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O 1645 CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
1646 space. However, it may also be mapped as a virtual I/O space in the CPU's 1646 space. However, it may also be mapped as a virtual I/O space in the CPU's
1647 memory map, particularly on those CPUs that don't support alternate I/O 1647 memory map, particularly on those CPUs that don't support alternate I/O
1648 spaces. 1648 spaces.
1649 1649
1650 Accesses to this space may be fully synchronous (as on i386), but 1650 Accesses to this space may be fully synchronous (as on i386), but
1651 intermediary bridges (such as the PCI host bridge) may not fully honour 1651 intermediary bridges (such as the PCI host bridge) may not fully honour
1652 that. 1652 that.
1653 1653
1654 They are guaranteed to be fully ordered with respect to each other. 1654 They are guaranteed to be fully ordered with respect to each other.
1655 1655
1656 They are not guaranteed to be fully ordered with respect to other types of 1656 They are not guaranteed to be fully ordered with respect to other types of
1657 memory and I/O operation. 1657 memory and I/O operation.
1658 1658
1659 (*) readX(), writeX(): 1659 (*) readX(), writeX():
1660 1660
1661 Whether these are guaranteed to be fully ordered and uncombined with 1661 Whether these are guaranteed to be fully ordered and uncombined with
1662 respect to each other on the issuing CPU depends on the characteristics 1662 respect to each other on the issuing CPU depends on the characteristics
1663 defined for the memory window through which they're accessing. On later 1663 defined for the memory window through which they're accessing. On later
1664 i386 architecture machines, for example, this is controlled by way of the 1664 i386 architecture machines, for example, this is controlled by way of the
1665 MTRR registers. 1665 MTRR registers.
1666 1666
1667 Ordinarily, these will be guaranteed to be fully ordered and uncombined,, 1667 Ordinarily, these will be guaranteed to be fully ordered and uncombined,,
1668 provided they're not accessing a prefetchable device. 1668 provided they're not accessing a prefetchable device.
1669 1669
1670 However, intermediary hardware (such as a PCI bridge) may indulge in 1670 However, intermediary hardware (such as a PCI bridge) may indulge in
1671 deferral if it so wishes; to flush a store, a load from the same location 1671 deferral if it so wishes; to flush a store, a load from the same location
1672 is preferred[*], but a load from the same device or from configuration 1672 is preferred[*], but a load from the same device or from configuration
1673 space should suffice for PCI. 1673 space should suffice for PCI.
1674 1674
1675 [*] NOTE! attempting to load from the same location as was written to may 1675 [*] NOTE! attempting to load from the same location as was written to may
1676 cause a malfunction - consider the 16550 Rx/Tx serial registers for 1676 cause a malfunction - consider the 16550 Rx/Tx serial registers for
1677 example. 1677 example.
1678 1678
1679 Used with prefetchable I/O memory, an mmiowb() barrier may be required to 1679 Used with prefetchable I/O memory, an mmiowb() barrier may be required to
1680 force stores to be ordered. 1680 force stores to be ordered.
1681 1681
1682 Please refer to the PCI specification for more information on interactions 1682 Please refer to the PCI specification for more information on interactions
1683 between PCI transactions. 1683 between PCI transactions.
1684 1684
1685 (*) readX_relaxed() 1685 (*) readX_relaxed()
1686 1686
1687 These are similar to readX(), but are not guaranteed to be ordered in any 1687 These are similar to readX(), but are not guaranteed to be ordered in any
1688 way. Be aware that there is no I/O read barrier available. 1688 way. Be aware that there is no I/O read barrier available.
1689 1689
1690 (*) ioreadX(), iowriteX() 1690 (*) ioreadX(), iowriteX()
1691 1691
1692 These will perform as appropriate for the type of access they're actually 1692 These will perform as appropriate for the type of access they're actually
1693 doing, be it inX()/outX() or readX()/writeX(). 1693 doing, be it inX()/outX() or readX()/writeX().
1694 1694
1695 1695
1696 ======================================== 1696 ========================================
1697 ASSUMED MINIMUM EXECUTION ORDERING MODEL 1697 ASSUMED MINIMUM EXECUTION ORDERING MODEL
1698 ======================================== 1698 ========================================
1699 1699
1700 It has to be assumed that the conceptual CPU is weakly-ordered but that it will 1700 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
1701 maintain the appearance of program causality with respect to itself. Some CPUs 1701 maintain the appearance of program causality with respect to itself. Some CPUs
1702 (such as i386 or x86_64) are more constrained than others (such as powerpc or 1702 (such as i386 or x86_64) are more constrained than others (such as powerpc or
1703 frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside 1703 frv), and so the most relaxed case (namely DEC Alpha) must be assumed outside
1704 of arch-specific code. 1704 of arch-specific code.
1705 1705
1706 This means that it must be considered that the CPU will execute its instruction 1706 This means that it must be considered that the CPU will execute its instruction
1707 stream in any order it feels like - or even in parallel - provided that if an 1707 stream in any order it feels like - or even in parallel - provided that if an
1708 instruction in the stream depends on the an earlier instruction, then that 1708 instruction in the stream depends on the an earlier instruction, then that
1709 earlier instruction must be sufficiently complete[*] before the later 1709 earlier instruction must be sufficiently complete[*] before the later
1710 instruction may proceed; in other words: provided that the appearance of 1710 instruction may proceed; in other words: provided that the appearance of
1711 causality is maintained. 1711 causality is maintained.
1712 1712
1713 [*] Some instructions have more than one effect - such as changing the 1713 [*] Some instructions have more than one effect - such as changing the
1714 condition codes, changing registers or changing memory - and different 1714 condition codes, changing registers or changing memory - and different
1715 instructions may depend on different effects. 1715 instructions may depend on different effects.
1716 1716
1717 A CPU may also discard any instruction sequence that winds up having no 1717 A CPU may also discard any instruction sequence that winds up having no
1718 ultimate effect. For example, if two adjacent instructions both load an 1718 ultimate effect. For example, if two adjacent instructions both load an
1719 immediate value into the same register, the first may be discarded. 1719 immediate value into the same register, the first may be discarded.
1720 1720
1721 1721
1722 Similarly, it has to be assumed that compiler might reorder the instruction 1722 Similarly, it has to be assumed that compiler might reorder the instruction
1723 stream in any way it sees fit, again provided the appearance of causality is 1723 stream in any way it sees fit, again provided the appearance of causality is
1724 maintained. 1724 maintained.
1725 1725
1726 1726
1727 ============================ 1727 ============================
1728 THE EFFECTS OF THE CPU CACHE 1728 THE EFFECTS OF THE CPU CACHE
1729 ============================ 1729 ============================
1730 1730
1731 The way cached memory operations are perceived across the system is affected to 1731 The way cached memory operations are perceived across the system is affected to
1732 a certain extent by the caches that lie between CPUs and memory, and by the 1732 a certain extent by the caches that lie between CPUs and memory, and by the
1733 memory coherence system that maintains the consistency of state in the system. 1733 memory coherence system that maintains the consistency of state in the system.
1734 1734
1735 As far as the way a CPU interacts with another part of the system through the 1735 As far as the way a CPU interacts with another part of the system through the
1736 caches goes, the memory system has to include the CPU's caches, and memory 1736 caches goes, the memory system has to include the CPU's caches, and memory
1737 barriers for the most part act at the interface between the CPU and its cache 1737 barriers for the most part act at the interface between the CPU and its cache
1738 (memory barriers logically act on the dotted line in the following diagram): 1738 (memory barriers logically act on the dotted line in the following diagram):
1739 1739
1740 <--- CPU ---> : <----------- Memory -----------> 1740 <--- CPU ---> : <----------- Memory ----------->
1741 : 1741 :
1742 +--------+ +--------+ : +--------+ +-----------+ 1742 +--------+ +--------+ : +--------+ +-----------+
1743 | | | | : | | | | +--------+ 1743 | | | | : | | | | +--------+
1744 | CPU | | Memory | : | CPU | | | | | 1744 | CPU | | Memory | : | CPU | | | | |
1745 | Core |--->| Access |----->| Cache |<-->| | | | 1745 | Core |--->| Access |----->| Cache |<-->| | | |
1746 | | | Queue | : | | | |--->| Memory | 1746 | | | Queue | : | | | |--->| Memory |
1747 | | | | : | | | | | | 1747 | | | | : | | | | | |
1748 +--------+ +--------+ : +--------+ | | | | 1748 +--------+ +--------+ : +--------+ | | | |
1749 : | Cache | +--------+ 1749 : | Cache | +--------+
1750 : | Coherency | 1750 : | Coherency |
1751 : | Mechanism | +--------+ 1751 : | Mechanism | +--------+
1752 +--------+ +--------+ : +--------+ | | | | 1752 +--------+ +--------+ : +--------+ | | | |
1753 | | | | : | | | | | | 1753 | | | | : | | | | | |
1754 | CPU | | Memory | : | CPU | | |--->| Device | 1754 | CPU | | Memory | : | CPU | | |--->| Device |
1755 | Core |--->| Access |----->| Cache |<-->| | | | 1755 | Core |--->| Access |----->| Cache |<-->| | | |
1756 | | | Queue | : | | | | | | 1756 | | | Queue | : | | | | | |
1757 | | | | : | | | | +--------+ 1757 | | | | : | | | | +--------+
1758 +--------+ +--------+ : +--------+ +-----------+ 1758 +--------+ +--------+ : +--------+ +-----------+
1759 : 1759 :
1760 : 1760 :
1761 1761
1762 Although any particular load or store may not actually appear outside of the 1762 Although any particular load or store may not actually appear outside of the
1763 CPU that issued it since it may have been satisfied within the CPU's own cache, 1763 CPU that issued it since it may have been satisfied within the CPU's own cache,
1764 it will still appear as if the full memory access had taken place as far as the 1764 it will still appear as if the full memory access had taken place as far as the
1765 other CPUs are concerned since the cache coherency mechanisms will migrate the 1765 other CPUs are concerned since the cache coherency mechanisms will migrate the
1766 cacheline over to the accessing CPU and propagate the effects upon conflict. 1766 cacheline over to the accessing CPU and propagate the effects upon conflict.
1767 1767
1768 The CPU core may execute instructions in any order it deems fit, provided the 1768 The CPU core may execute instructions in any order it deems fit, provided the
1769 expected program causality appears to be maintained. Some of the instructions 1769 expected program causality appears to be maintained. Some of the instructions
1770 generate load and store operations which then go into the queue of memory 1770 generate load and store operations which then go into the queue of memory
1771 accesses to be performed. The core may place these in the queue in any order 1771 accesses to be performed. The core may place these in the queue in any order
1772 it wishes, and continue execution until it is forced to wait for an instruction 1772 it wishes, and continue execution until it is forced to wait for an instruction
1773 to complete. 1773 to complete.
1774 1774
1775 What memory barriers are concerned with is controlling the order in which 1775 What memory barriers are concerned with is controlling the order in which
1776 accesses cross from the CPU side of things to the memory side of things, and 1776 accesses cross from the CPU side of things to the memory side of things, and
1777 the order in which the effects are perceived to happen by the other observers 1777 the order in which the effects are perceived to happen by the other observers
1778 in the system. 1778 in the system.
1779 1779
1780 [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see 1780 [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
1781 their own loads and stores as if they had happened in program order. 1781 their own loads and stores as if they had happened in program order.
1782 1782
1783 [!] MMIO or other device accesses may bypass the cache system. This depends on 1783 [!] MMIO or other device accesses may bypass the cache system. This depends on
1784 the properties of the memory window through which devices are accessed and/or 1784 the properties of the memory window through which devices are accessed and/or
1785 the use of any special device communication instructions the CPU may have. 1785 the use of any special device communication instructions the CPU may have.
1786 1786
1787 1787
1788 CACHE COHERENCY 1788 CACHE COHERENCY
1789 --------------- 1789 ---------------
1790 1790
1791 Life isn't quite as simple as it may appear above, however: for while the 1791 Life isn't quite as simple as it may appear above, however: for while the
1792 caches are expected to be coherent, there's no guarantee that that coherency 1792 caches are expected to be coherent, there's no guarantee that that coherency
1793 will be ordered. This means that whilst changes made on one CPU will 1793 will be ordered. This means that whilst changes made on one CPU will
1794 eventually become visible on all CPUs, there's no guarantee that they will 1794 eventually become visible on all CPUs, there's no guarantee that they will
1795 become apparent in the same order on those other CPUs. 1795 become apparent in the same order on those other CPUs.
1796 1796
1797 1797
1798 Consider dealing with a system that has pair of CPUs (1 & 2), each of which has 1798 Consider dealing with a system that has pair of CPUs (1 & 2), each of which has
1799 a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D): 1799 a pair of parallel data caches (CPU 1 has A/B, and CPU 2 has C/D):
1800 1800
1801 : 1801 :
1802 : +--------+ 1802 : +--------+
1803 : +---------+ | | 1803 : +---------+ | |
1804 +--------+ : +--->| Cache A |<------->| | 1804 +--------+ : +--->| Cache A |<------->| |
1805 | | : | +---------+ | | 1805 | | : | +---------+ | |
1806 | CPU 1 |<---+ | | 1806 | CPU 1 |<---+ | |
1807 | | : | +---------+ | | 1807 | | : | +---------+ | |
1808 +--------+ : +--->| Cache B |<------->| | 1808 +--------+ : +--->| Cache B |<------->| |
1809 : +---------+ | | 1809 : +---------+ | |
1810 : | Memory | 1810 : | Memory |
1811 : +---------+ | System | 1811 : +---------+ | System |
1812 +--------+ : +--->| Cache C |<------->| | 1812 +--------+ : +--->| Cache C |<------->| |
1813 | | : | +---------+ | | 1813 | | : | +---------+ | |
1814 | CPU 2 |<---+ | | 1814 | CPU 2 |<---+ | |
1815 | | : | +---------+ | | 1815 | | : | +---------+ | |
1816 +--------+ : +--->| Cache D |<------->| | 1816 +--------+ : +--->| Cache D |<------->| |
1817 : +---------+ | | 1817 : +---------+ | |
1818 : +--------+ 1818 : +--------+
1819 : 1819 :
1820 1820
1821 Imagine the system has the following properties: 1821 Imagine the system has the following properties:
1822 1822
1823 (*) an odd-numbered cache line may be in cache A, cache C or it may still be 1823 (*) an odd-numbered cache line may be in cache A, cache C or it may still be
1824 resident in memory; 1824 resident in memory;
1825 1825
1826 (*) an even-numbered cache line may be in cache B, cache D or it may still be 1826 (*) an even-numbered cache line may be in cache B, cache D or it may still be
1827 resident in memory; 1827 resident in memory;
1828 1828
1829 (*) whilst the CPU core is interrogating one cache, the other cache may be 1829 (*) whilst the CPU core is interrogating one cache, the other cache may be
1830 making use of the bus to access the rest of the system - perhaps to 1830 making use of the bus to access the rest of the system - perhaps to
1831 displace a dirty cacheline or to do a speculative load; 1831 displace a dirty cacheline or to do a speculative load;
1832 1832
1833 (*) each cache has a queue of operations that need to be applied to that cache 1833 (*) each cache has a queue of operations that need to be applied to that cache
1834 to maintain coherency with the rest of the system; 1834 to maintain coherency with the rest of the system;
1835 1835
1836 (*) the coherency queue is not flushed by normal loads to lines already 1836 (*) the coherency queue is not flushed by normal loads to lines already
1837 present in the cache, even though the contents of the queue may 1837 present in the cache, even though the contents of the queue may
1838 potentially effect those loads. 1838 potentially effect those loads.
1839 1839
1840 Imagine, then, that two writes are made on the first CPU, with a write barrier 1840 Imagine, then, that two writes are made on the first CPU, with a write barrier
1841 between them to guarantee that they will appear to reach that CPU's caches in 1841 between them to guarantee that they will appear to reach that CPU's caches in
1842 the requisite order: 1842 the requisite order:
1843 1843
1844 CPU 1 CPU 2 COMMENT 1844 CPU 1 CPU 2 COMMENT
1845 =============== =============== ======================================= 1845 =============== =============== =======================================
1846 u == 0, v == 1 and p == &u, q == &u 1846 u == 0, v == 1 and p == &u, q == &u
1847 v = 2; 1847 v = 2;
1848 smp_wmb(); Make sure change to v visible before 1848 smp_wmb(); Make sure change to v visible before
1849 change to p 1849 change to p
1850 <A:modify v=2> v is now in cache A exclusively 1850 <A:modify v=2> v is now in cache A exclusively
1851 p = &v; 1851 p = &v;
1852 <B:modify p=&v> p is now in cache B exclusively 1852 <B:modify p=&v> p is now in cache B exclusively
1853 1853
1854 The write memory barrier forces the other CPUs in the system to perceive that 1854 The write memory barrier forces the other CPUs in the system to perceive that
1855 the local CPU's caches have apparently been updated in the correct order. But 1855 the local CPU's caches have apparently been updated in the correct order. But
1856 now imagine that the second CPU that wants to read those values: 1856 now imagine that the second CPU that wants to read those values:
1857 1857
1858 CPU 1 CPU 2 COMMENT 1858 CPU 1 CPU 2 COMMENT
1859 =============== =============== ======================================= 1859 =============== =============== =======================================
1860 ... 1860 ...
1861 q = p; 1861 q = p;
1862 x = *q; 1862 x = *q;
1863 1863
1864 The above pair of reads may then fail to happen in expected order, as the 1864 The above pair of reads may then fail to happen in expected order, as the
1865 cacheline holding p may get updated in one of the second CPU's caches whilst 1865 cacheline holding p may get updated in one of the second CPU's caches whilst
1866 the update to the cacheline holding v is delayed in the other of the second 1866 the update to the cacheline holding v is delayed in the other of the second
1867 CPU's caches by some other cache event: 1867 CPU's caches by some other cache event:
1868 1868
1869 CPU 1 CPU 2 COMMENT 1869 CPU 1 CPU 2 COMMENT
1870 =============== =============== ======================================= 1870 =============== =============== =======================================
1871 u == 0, v == 1 and p == &u, q == &u 1871 u == 0, v == 1 and p == &u, q == &u
1872 v = 2; 1872 v = 2;
1873 smp_wmb(); 1873 smp_wmb();
1874 <A:modify v=2> <C:busy> 1874 <A:modify v=2> <C:busy>
1875 <C:queue v=2> 1875 <C:queue v=2>
1876 p = &v; q = p; 1876 p = &v; q = p;
1877 <D:request p> 1877 <D:request p>
1878 <B:modify p=&v> <D:commit p=&v> 1878 <B:modify p=&v> <D:commit p=&v>
1879 <D:read p> 1879 <D:read p>
1880 x = *q; 1880 x = *q;
1881 <C:read *q> Reads from v before v updated in cache 1881 <C:read *q> Reads from v before v updated in cache
1882 <C:unbusy> 1882 <C:unbusy>
1883 <C:commit v=2> 1883 <C:commit v=2>
1884 1884
1885 Basically, whilst both cachelines will be updated on CPU 2 eventually, there's 1885 Basically, whilst both cachelines will be updated on CPU 2 eventually, there's
1886 no guarantee that, without intervention, the order of update will be the same 1886 no guarantee that, without intervention, the order of update will be the same
1887 as that committed on CPU 1. 1887 as that committed on CPU 1.
1888 1888
1889 1889
1890 To intervene, we need to interpolate a data dependency barrier or a read 1890 To intervene, we need to interpolate a data dependency barrier or a read
1891 barrier between the loads. This will force the cache to commit its coherency 1891 barrier between the loads. This will force the cache to commit its coherency
1892 queue before processing any further requests: 1892 queue before processing any further requests:
1893 1893
1894 CPU 1 CPU 2 COMMENT 1894 CPU 1 CPU 2 COMMENT
1895 =============== =============== ======================================= 1895 =============== =============== =======================================
1896 u == 0, v == 1 and p == &u, q == &u 1896 u == 0, v == 1 and p == &u, q == &u
1897 v = 2; 1897 v = 2;
1898 smp_wmb(); 1898 smp_wmb();
1899 <A:modify v=2> <C:busy> 1899 <A:modify v=2> <C:busy>
1900 <C:queue v=2> 1900 <C:queue v=2>
1901 p = &b; q = p; 1901 p = &b; q = p;
1902 <D:request p> 1902 <D:request p>
1903 <B:modify p=&v> <D:commit p=&v> 1903 <B:modify p=&v> <D:commit p=&v>
1904 <D:read p> 1904 <D:read p>
1905 smp_read_barrier_depends() 1905 smp_read_barrier_depends()
1906 <C:unbusy> 1906 <C:unbusy>
1907 <C:commit v=2> 1907 <C:commit v=2>
1908 x = *q; 1908 x = *q;
1909 <C:read *q> Reads from v after v updated in cache 1909 <C:read *q> Reads from v after v updated in cache
1910 1910
1911 1911
1912 This sort of problem can be encountered on DEC Alpha processors as they have a 1912 This sort of problem can be encountered on DEC Alpha processors as they have a
1913 split cache that improves performance by making better use of the data bus. 1913 split cache that improves performance by making better use of the data bus.
1914 Whilst most CPUs do imply a data dependency barrier on the read when a memory 1914 Whilst most CPUs do imply a data dependency barrier on the read when a memory
1915 access depends on a read, not all do, so it may not be relied on. 1915 access depends on a read, not all do, so it may not be relied on.
1916 1916
1917 Other CPUs may also have split caches, but must coordinate between the various 1917 Other CPUs may also have split caches, but must coordinate between the various
1918 cachelets for normal memory accesses. The semantics of the Alpha removes the 1918 cachelets for normal memory accesses. The semantics of the Alpha removes the
1919 need for coordination in absence of memory barriers. 1919 need for coordination in absence of memory barriers.
1920 1920
1921 1921
1922 CACHE COHERENCY VS DMA 1922 CACHE COHERENCY VS DMA
1923 ---------------------- 1923 ----------------------
1924 1924
1925 Not all systems maintain cache coherency with respect to devices doing DMA. In 1925 Not all systems maintain cache coherency with respect to devices doing DMA. In
1926 such cases, a device attempting DMA may obtain stale data from RAM because 1926 such cases, a device attempting DMA may obtain stale data from RAM because
1927 dirty cache lines may be resident in the caches of various CPUs, and may not 1927 dirty cache lines may be resident in the caches of various CPUs, and may not
1928 have been written back to RAM yet. To deal with this, the appropriate part of 1928 have been written back to RAM yet. To deal with this, the appropriate part of
1929 the kernel must flush the overlapping bits of cache on each CPU (and maybe 1929 the kernel must flush the overlapping bits of cache on each CPU (and maybe
1930 invalidate them as well). 1930 invalidate them as well).
1931 1931
1932 In addition, the data DMA'd to RAM by a device may be overwritten by dirty 1932 In addition, the data DMA'd to RAM by a device may be overwritten by dirty
1933 cache lines being written back to RAM from a CPU's cache after the device has 1933 cache lines being written back to RAM from a CPU's cache after the device has
1934 installed its own data, or cache lines simply present in a CPUs cache may 1934 installed its own data, or cache lines simply present in a CPUs cache may
1935 simply obscure the fact that RAM has been updated, until at such time as the 1935 simply obscure the fact that RAM has been updated, until at such time as the
1936 cacheline is discarded from the CPU's cache and reloaded. To deal with this, 1936 cacheline is discarded from the CPU's cache and reloaded. To deal with this,
1937 the appropriate part of the kernel must invalidate the overlapping bits of the 1937 the appropriate part of the kernel must invalidate the overlapping bits of the
1938 cache on each CPU. 1938 cache on each CPU.
1939 1939
1940 See Documentation/cachetlb.txt for more information on cache management. 1940 See Documentation/cachetlb.txt for more information on cache management.
1941 1941
1942 1942
1943 CACHE COHERENCY VS MMIO 1943 CACHE COHERENCY VS MMIO
1944 ----------------------- 1944 -----------------------
1945 1945
1946 Memory mapped I/O usually takes place through memory locations that are part of 1946 Memory mapped I/O usually takes place through memory locations that are part of
1947 a window in the CPU's memory space that have different properties assigned than 1947 a window in the CPU's memory space that have different properties assigned than
1948 the usual RAM directed window. 1948 the usual RAM directed window.
1949 1949
1950 Amongst these properties is usually the fact that such accesses bypass the 1950 Amongst these properties is usually the fact that such accesses bypass the
1951 caching entirely and go directly to the device buses. This means MMIO accesses 1951 caching entirely and go directly to the device buses. This means MMIO accesses
1952 may, in effect, overtake accesses to cached memory that were emitted earlier. 1952 may, in effect, overtake accesses to cached memory that were emitted earlier.
1953 A memory barrier isn't sufficient in such a case, but rather the cache must be 1953 A memory barrier isn't sufficient in such a case, but rather the cache must be
1954 flushed between the cached memory write and the MMIO access if the two are in 1954 flushed between the cached memory write and the MMIO access if the two are in
1955 any way dependent. 1955 any way dependent.
1956 1956
1957 1957
1958 ========================= 1958 =========================
1959 THE THINGS CPUS GET UP TO 1959 THE THINGS CPUS GET UP TO
1960 ========================= 1960 =========================
1961 1961
1962 A programmer might take it for granted that the CPU will perform memory 1962 A programmer might take it for granted that the CPU will perform memory
1963 operations in exactly the order specified, so that if a CPU is, for example, 1963 operations in exactly the order specified, so that if a CPU is, for example,
1964 given the following piece of code to execute: 1964 given the following piece of code to execute:
1965 1965
1966 a = *A; 1966 a = *A;
1967 *B = b; 1967 *B = b;
1968 c = *C; 1968 c = *C;
1969 d = *D; 1969 d = *D;
1970 *E = e; 1970 *E = e;
1971 1971
1972 They would then expect that the CPU will complete the memory operation for each 1972 They would then expect that the CPU will complete the memory operation for each
1973 instruction before moving on to the next one, leading to a definite sequence of 1973 instruction before moving on to the next one, leading to a definite sequence of
1974 operations as seen by external observers in the system: 1974 operations as seen by external observers in the system:
1975 1975
1976 LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E. 1976 LOAD *A, STORE *B, LOAD *C, LOAD *D, STORE *E.
1977 1977
1978 1978
1979 Reality is, of course, much messier. With many CPUs and compilers, the above 1979 Reality is, of course, much messier. With many CPUs and compilers, the above
1980 assumption doesn't hold because: 1980 assumption doesn't hold because:
1981 1981
1982 (*) loads are more likely to need to be completed immediately to permit 1982 (*) loads are more likely to need to be completed immediately to permit
1983 execution progress, whereas stores can often be deferred without a 1983 execution progress, whereas stores can often be deferred without a
1984 problem; 1984 problem;
1985 1985
1986 (*) loads may be done speculatively, and the result discarded should it prove 1986 (*) loads may be done speculatively, and the result discarded should it prove
1987 to have been unnecessary; 1987 to have been unnecessary;
1988 1988
1989 (*) loads may be done speculatively, leading to the result having being 1989 (*) loads may be done speculatively, leading to the result having being
1990 fetched at the wrong time in the expected sequence of events; 1990 fetched at the wrong time in the expected sequence of events;
1991 1991
1992 (*) the order of the memory accesses may be rearranged to promote better use 1992 (*) the order of the memory accesses may be rearranged to promote better use
1993 of the CPU buses and caches; 1993 of the CPU buses and caches;
1994 1994
1995 (*) loads and stores may be combined to improve performance when talking to 1995 (*) loads and stores may be combined to improve performance when talking to
1996 memory or I/O hardware that can do batched accesses of adjacent locations, 1996 memory or I/O hardware that can do batched accesses of adjacent locations,
1997 thus cutting down on transaction setup costs (memory and PCI devices may 1997 thus cutting down on transaction setup costs (memory and PCI devices may
1998 both be able to do this); and 1998 both be able to do this); and
1999 1999
2000 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency 2000 (*) the CPU's data cache may affect the ordering, and whilst cache-coherency
2001 mechanisms may alleviate this - once the store has actually hit the cache 2001 mechanisms may alleviate this - once the store has actually hit the cache
2002 - there's no guarantee that the coherency management will be propagated in 2002 - there's no guarantee that the coherency management will be propagated in
2003 order to other CPUs. 2003 order to other CPUs.
2004 2004
2005 So what another CPU, say, might actually observe from the above piece of code 2005 So what another CPU, say, might actually observe from the above piece of code
2006 is: 2006 is:
2007 2007
2008 LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B 2008 LOAD *A, ..., LOAD {*C,*D}, STORE *E, STORE *B
2009 2009
2010 (Where "LOAD {*C,*D}" is a combined load) 2010 (Where "LOAD {*C,*D}" is a combined load)
2011 2011
2012 2012
2013 However, it is guaranteed that a CPU will be self-consistent: it will see its 2013 However, it is guaranteed that a CPU will be self-consistent: it will see its
2014 _own_ accesses appear to be correctly ordered, without the need for a memory 2014 _own_ accesses appear to be correctly ordered, without the need for a memory
2015 barrier. For instance with the following code: 2015 barrier. For instance with the following code:
2016 2016
2017 U = *A; 2017 U = *A;
2018 *A = V; 2018 *A = V;
2019 *A = W; 2019 *A = W;
2020 X = *A; 2020 X = *A;
2021 *A = Y; 2021 *A = Y;
2022 Z = *A; 2022 Z = *A;
2023 2023
2024 and assuming no intervention by an external influence, it can be assumed that 2024 and assuming no intervention by an external influence, it can be assumed that
2025 the final result will appear to be: 2025 the final result will appear to be:
2026 2026
2027 U == the original value of *A 2027 U == the original value of *A
2028 X == W 2028 X == W
2029 Z == Y 2029 Z == Y
2030 *A == Y 2030 *A == Y
2031 2031
2032 The code above may cause the CPU to generate the full sequence of memory 2032 The code above may cause the CPU to generate the full sequence of memory
2033 accesses: 2033 accesses:
2034 2034
2035 U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A 2035 U=LOAD *A, STORE *A=V, STORE *A=W, X=LOAD *A, STORE *A=Y, Z=LOAD *A
2036 2036
2037 in that order, but, without intervention, the sequence may have almost any 2037 in that order, but, without intervention, the sequence may have almost any
2038 combination of elements combined or discarded, provided the program's view of 2038 combination of elements combined or discarded, provided the program's view of
2039 the world remains consistent. 2039 the world remains consistent.
2040 2040
2041 The compiler may also combine, discard or defer elements of the sequence before 2041 The compiler may also combine, discard or defer elements of the sequence before
2042 the CPU even sees them. 2042 the CPU even sees them.
2043 2043
2044 For instance: 2044 For instance:
2045 2045
2046 *A = V; 2046 *A = V;
2047 *A = W; 2047 *A = W;
2048 2048
2049 may be reduced to: 2049 may be reduced to:
2050 2050
2051 *A = W; 2051 *A = W;
2052 2052
2053 since, without a write barrier, it can be assumed that the effect of the 2053 since, without a write barrier, it can be assumed that the effect of the
2054 storage of V to *A is lost. Similarly: 2054 storage of V to *A is lost. Similarly:
2055 2055
2056 *A = Y; 2056 *A = Y;
2057 Z = *A; 2057 Z = *A;
2058 2058
2059 may, without a memory barrier, be reduced to: 2059 may, without a memory barrier, be reduced to:
2060 2060
2061 *A = Y; 2061 *A = Y;
2062 Z = Y; 2062 Z = Y;
2063 2063
2064 and the LOAD operation never appear outside of the CPU. 2064 and the LOAD operation never appear outside of the CPU.
2065 2065
2066 2066
2067 AND THEN THERE'S THE ALPHA 2067 AND THEN THERE'S THE ALPHA
2068 -------------------------- 2068 --------------------------
2069 2069
2070 The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that, 2070 The DEC Alpha CPU is one of the most relaxed CPUs there is. Not only that,
2071 some versions of the Alpha CPU have a split data cache, permitting them to have 2071 some versions of the Alpha CPU have a split data cache, permitting them to have
2072 two semantically related cache lines updating at separate times. This is where 2072 two semantically related cache lines updating at separate times. This is where
2073 the data dependency barrier really becomes necessary as this synchronises both 2073 the data dependency barrier really becomes necessary as this synchronises both
2074 caches with the memory coherence system, thus making it seem like pointer 2074 caches with the memory coherence system, thus making it seem like pointer
2075 changes vs new data occur in the right order. 2075 changes vs new data occur in the right order.
2076 2076
2077 The Alpha defines the Linux's kernel's memory barrier model. 2077 The Alpha defines the Linux's kernel's memory barrier model.
2078 2078
2079 See the subsection on "Cache Coherency" above. 2079 See the subsection on "Cache Coherency" above.
2080 2080
2081 2081
2082 ========== 2082 ==========
2083 REFERENCES 2083 REFERENCES
2084 ========== 2084 ==========
2085 2085
2086 Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek, 2086 Alpha AXP Architecture Reference Manual, Second Edition (Sites & Witek,
2087 Digital Press) 2087 Digital Press)
2088 Chapter 5.2: Physical Address Space Characteristics 2088 Chapter 5.2: Physical Address Space Characteristics
2089 Chapter 5.4: Caches and Write Buffers 2089 Chapter 5.4: Caches and Write Buffers
2090 Chapter 5.5: Data Sharing 2090 Chapter 5.5: Data Sharing
2091 Chapter 5.6: Read/Write Ordering 2091 Chapter 5.6: Read/Write Ordering
2092 2092
2093 AMD64 Architecture Programmer's Manual Volume 2: System Programming 2093 AMD64 Architecture Programmer's Manual Volume 2: System Programming
2094 Chapter 7.1: Memory-Access Ordering 2094 Chapter 7.1: Memory-Access Ordering
2095 Chapter 7.4: Buffering and Combining Memory Writes 2095 Chapter 7.4: Buffering and Combining Memory Writes
2096 2096
2097 IA-32 Intel Architecture Software Developer's Manual, Volume 3: 2097 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2098 System Programming Guide 2098 System Programming Guide
2099 Chapter 7.1: Locked Atomic Operations 2099 Chapter 7.1: Locked Atomic Operations
2100 Chapter 7.2: Memory Ordering 2100 Chapter 7.2: Memory Ordering
2101 Chapter 7.4: Serializing Instructions 2101 Chapter 7.4: Serializing Instructions
2102 2102
2103 The SPARC Architecture Manual, Version 9 2103 The SPARC Architecture Manual, Version 9
2104 Chapter 8: Memory Models 2104 Chapter 8: Memory Models
2105 Appendix D: Formal Specification of the Memory Models 2105 Appendix D: Formal Specification of the Memory Models
2106 Appendix J: Programming with the Memory Models 2106 Appendix J: Programming with the Memory Models
2107 2107
2108 UltraSPARC Programmer Reference Manual 2108 UltraSPARC Programmer Reference Manual
2109 Chapter 5: Memory Accesses and Cacheability 2109 Chapter 5: Memory Accesses and Cacheability
2110 Chapter 15: Sparc-V9 Memory Models 2110 Chapter 15: Sparc-V9 Memory Models
2111 2111
2112 UltraSPARC III Cu User's Manual 2112 UltraSPARC III Cu User's Manual
2113 Chapter 9: Memory Models 2113 Chapter 9: Memory Models
2114 2114
2115 UltraSPARC IIIi Processor User's Manual 2115 UltraSPARC IIIi Processor User's Manual
2116 Chapter 8: Memory Models 2116 Chapter 8: Memory Models
2117 2117
2118 UltraSPARC Architecture 2005 2118 UltraSPARC Architecture 2005
2119 Chapter 9: Memory 2119 Chapter 9: Memory
2120 Appendix D: Formal Specifications of the Memory Models 2120 Appendix D: Formal Specifications of the Memory Models
2121 2121
2122 UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005 2122 UltraSPARC T1 Supplement to the UltraSPARC Architecture 2005
2123 Chapter 8: Memory Models 2123 Chapter 8: Memory Models
2124 Appendix F: Caches and Cache Coherency 2124 Appendix F: Caches and Cache Coherency
2125 2125
2126 Solaris Internals, Core Kernel Architecture, p63-68: 2126 Solaris Internals, Core Kernel Architecture, p63-68:
2127 Chapter 3.3: Hardware Considerations for Locks and 2127 Chapter 3.3: Hardware Considerations for Locks and
2128 Synchronization 2128 Synchronization
2129 2129
2130 Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching 2130 Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
2131 for Kernel Programmers: 2131 for Kernel Programmers:
2132 Chapter 13: Other Memory Models 2132 Chapter 13: Other Memory Models
2133 2133
2134 Intel Itanium Architecture Software Developer's Manual: Volume 1: 2134 Intel Itanium Architecture Software Developer's Manual: Volume 1:
2135 Section 2.6: Speculation 2135 Section 2.6: Speculation
2136 Section 4.4: Memory Access 2136 Section 4.4: Memory Access
2137 2137
Documentation/networking/bonding.txt
1 1
2 Linux Ethernet Bonding Driver HOWTO 2 Linux Ethernet Bonding Driver HOWTO
3 3
4 Latest update: 24 April 2006 4 Latest update: 24 April 2006
5 5
6 Initial release : Thomas Davis <tadavis at lbl.gov> 6 Initial release : Thomas Davis <tadavis at lbl.gov>
7 Corrections, HA extensions : 2000/10/03-15 : 7 Corrections, HA extensions : 2000/10/03-15 :
8 - Willy Tarreau <willy at meta-x.org> 8 - Willy Tarreau <willy at meta-x.org>
9 - Constantine Gavrilov <const-g at xpert.com> 9 - Constantine Gavrilov <const-g at xpert.com>
10 - Chad N. Tindel <ctindel at ieee dot org> 10 - Chad N. Tindel <ctindel at ieee dot org>
11 - Janice Girouard <girouard at us dot ibm dot com> 11 - Janice Girouard <girouard at us dot ibm dot com>
12 - Jay Vosburgh <fubar at us dot ibm dot com> 12 - Jay Vosburgh <fubar at us dot ibm dot com>
13 13
14 Reorganized and updated Feb 2005 by Jay Vosburgh 14 Reorganized and updated Feb 2005 by Jay Vosburgh
15 Added Sysfs information: 2006/04/24 15 Added Sysfs information: 2006/04/24
16 - Mitch Williams <mitch.a.williams at intel.com> 16 - Mitch Williams <mitch.a.williams at intel.com>
17 17
18 Introduction 18 Introduction
19 ============ 19 ============
20 20
21 The Linux bonding driver provides a method for aggregating 21 The Linux bonding driver provides a method for aggregating
22 multiple network interfaces into a single logical "bonded" interface. 22 multiple network interfaces into a single logical "bonded" interface.
23 The behavior of the bonded interfaces depends upon the mode; generally 23 The behavior of the bonded interfaces depends upon the mode; generally
24 speaking, modes provide either hot standby or load balancing services. 24 speaking, modes provide either hot standby or load balancing services.
25 Additionally, link integrity monitoring may be performed. 25 Additionally, link integrity monitoring may be performed.
26 26
27 The bonding driver originally came from Donald Becker's 27 The bonding driver originally came from Donald Becker's
28 beowulf patches for kernel 2.0. It has changed quite a bit since, and 28 beowulf patches for kernel 2.0. It has changed quite a bit since, and
29 the original tools from extreme-linux and beowulf sites will not work 29 the original tools from extreme-linux and beowulf sites will not work
30 with this version of the driver. 30 with this version of the driver.
31 31
32 For new versions of the driver, updated userspace tools, and 32 For new versions of the driver, updated userspace tools, and
33 who to ask for help, please follow the links at the end of this file. 33 who to ask for help, please follow the links at the end of this file.
34 34
35 Table of Contents 35 Table of Contents
36 ================= 36 =================
37 37
38 1. Bonding Driver Installation 38 1. Bonding Driver Installation
39 39
40 2. Bonding Driver Options 40 2. Bonding Driver Options
41 41
42 3. Configuring Bonding Devices 42 3. Configuring Bonding Devices
43 3.1 Configuration with Sysconfig Support 43 3.1 Configuration with Sysconfig Support
44 3.1.1 Using DHCP with Sysconfig 44 3.1.1 Using DHCP with Sysconfig
45 3.1.2 Configuring Multiple Bonds with Sysconfig 45 3.1.2 Configuring Multiple Bonds with Sysconfig
46 3.2 Configuration with Initscripts Support 46 3.2 Configuration with Initscripts Support
47 3.2.1 Using DHCP with Initscripts 47 3.2.1 Using DHCP with Initscripts
48 3.2.2 Configuring Multiple Bonds with Initscripts 48 3.2.2 Configuring Multiple Bonds with Initscripts
49 3.3 Configuring Bonding Manually with Ifenslave 49 3.3 Configuring Bonding Manually with Ifenslave
50 3.3.1 Configuring Multiple Bonds Manually 50 3.3.1 Configuring Multiple Bonds Manually
51 3.4 Configuring Bonding Manually via Sysfs 51 3.4 Configuring Bonding Manually via Sysfs
52 52
53 4. Querying Bonding Configuration 53 4. Querying Bonding Configuration
54 4.1 Bonding Configuration 54 4.1 Bonding Configuration
55 4.2 Network Configuration 55 4.2 Network Configuration
56 56
57 5. Switch Configuration 57 5. Switch Configuration
58 58
59 6. 802.1q VLAN Support 59 6. 802.1q VLAN Support
60 60
61 7. Link Monitoring 61 7. Link Monitoring
62 7.1 ARP Monitor Operation 62 7.1 ARP Monitor Operation
63 7.2 Configuring Multiple ARP Targets 63 7.2 Configuring Multiple ARP Targets
64 7.3 MII Monitor Operation 64 7.3 MII Monitor Operation
65 65
66 8. Potential Trouble Sources 66 8. Potential Trouble Sources
67 8.1 Adventures in Routing 67 8.1 Adventures in Routing
68 8.2 Ethernet Device Renaming 68 8.2 Ethernet Device Renaming
69 8.3 Painfully Slow Or No Failed Link Detection By Miimon 69 8.3 Painfully Slow Or No Failed Link Detection By Miimon
70 70
71 9. SNMP agents 71 9. SNMP agents
72 72
73 10. Promiscuous mode 73 10. Promiscuous mode
74 74
75 11. Configuring Bonding for High Availability 75 11. Configuring Bonding for High Availability
76 11.1 High Availability in a Single Switch Topology 76 11.1 High Availability in a Single Switch Topology
77 11.2 High Availability in a Multiple Switch Topology 77 11.2 High Availability in a Multiple Switch Topology
78 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology 78 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
79 11.2.2 HA Link Monitoring for Multiple Switch Topology 79 11.2.2 HA Link Monitoring for Multiple Switch Topology
80 80
81 12. Configuring Bonding for Maximum Throughput 81 12. Configuring Bonding for Maximum Throughput
82 12.1 Maximum Throughput in a Single Switch Topology 82 12.1 Maximum Throughput in a Single Switch Topology
83 12.1.1 MT Bonding Mode Selection for Single Switch Topology 83 12.1.1 MT Bonding Mode Selection for Single Switch Topology
84 12.1.2 MT Link Monitoring for Single Switch Topology 84 12.1.2 MT Link Monitoring for Single Switch Topology
85 12.2 Maximum Throughput in a Multiple Switch Topology 85 12.2 Maximum Throughput in a Multiple Switch Topology
86 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology 86 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
87 12.2.2 MT Link Monitoring for Multiple Switch Topology 87 12.2.2 MT Link Monitoring for Multiple Switch Topology
88 88
89 13. Switch Behavior Issues 89 13. Switch Behavior Issues
90 13.1 Link Establishment and Failover Delays 90 13.1 Link Establishment and Failover Delays
91 13.2 Duplicated Incoming Packets 91 13.2 Duplicated Incoming Packets
92 92
93 14. Hardware Specific Considerations 93 14. Hardware Specific Considerations
94 14.1 IBM BladeCenter 94 14.1 IBM BladeCenter
95 95
96 15. Frequently Asked Questions 96 15. Frequently Asked Questions
97 97
98 16. Resources and Links 98 16. Resources and Links
99 99
100 100
101 1. Bonding Driver Installation 101 1. Bonding Driver Installation
102 ============================== 102 ==============================
103 103
104 Most popular distro kernels ship with the bonding driver 104 Most popular distro kernels ship with the bonding driver
105 already available as a module and the ifenslave user level control 105 already available as a module and the ifenslave user level control
106 program installed and ready for use. If your distro does not, or you 106 program installed and ready for use. If your distro does not, or you
107 have need to compile bonding from source (e.g., configuring and 107 have need to compile bonding from source (e.g., configuring and
108 installing a mainline kernel from kernel.org), you'll need to perform 108 installing a mainline kernel from kernel.org), you'll need to perform
109 the following steps: 109 the following steps:
110 110
111 1.1 Configure and build the kernel with bonding 111 1.1 Configure and build the kernel with bonding
112 ----------------------------------------------- 112 -----------------------------------------------
113 113
114 The current version of the bonding driver is available in the 114 The current version of the bonding driver is available in the
115 drivers/net/bonding subdirectory of the most recent kernel source 115 drivers/net/bonding subdirectory of the most recent kernel source
116 (which is available on http://kernel.org). Most users "rolling their 116 (which is available on http://kernel.org). Most users "rolling their
117 own" will want to use the most recent kernel from kernel.org. 117 own" will want to use the most recent kernel from kernel.org.
118 118
119 Configure kernel with "make menuconfig" (or "make xconfig" or 119 Configure kernel with "make menuconfig" (or "make xconfig" or
120 "make config"), then select "Bonding driver support" in the "Network 120 "make config"), then select "Bonding driver support" in the "Network
121 device support" section. It is recommended that you configure the 121 device support" section. It is recommended that you configure the
122 driver as module since it is currently the only way to pass parameters 122 driver as module since it is currently the only way to pass parameters
123 to the driver or configure more than one bonding device. 123 to the driver or configure more than one bonding device.
124 124
125 Build and install the new kernel and modules, then continue 125 Build and install the new kernel and modules, then continue
126 below to install ifenslave. 126 below to install ifenslave.
127 127
128 1.2 Install ifenslave Control Utility 128 1.2 Install ifenslave Control Utility
129 ------------------------------------- 129 -------------------------------------
130 130
131 The ifenslave user level control program is included in the 131 The ifenslave user level control program is included in the
132 kernel source tree, in the file Documentation/networking/ifenslave.c. 132 kernel source tree, in the file Documentation/networking/ifenslave.c.
133 It is generally recommended that you use the ifenslave that 133 It is generally recommended that you use the ifenslave that
134 corresponds to the kernel that you are using (either from the same 134 corresponds to the kernel that you are using (either from the same
135 source tree or supplied with the distro), however, ifenslave 135 source tree or supplied with the distro), however, ifenslave
136 executables from older kernels should function (but features newer 136 executables from older kernels should function (but features newer
137 than the ifenslave release are not supported). Running an ifenslave 137 than the ifenslave release are not supported). Running an ifenslave
138 that is newer than the kernel is not supported, and may or may not 138 that is newer than the kernel is not supported, and may or may not
139 work. 139 work.
140 140
141 To install ifenslave, do the following: 141 To install ifenslave, do the following:
142 142
143 # gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave 143 # gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave
144 # cp ifenslave /sbin/ifenslave 144 # cp ifenslave /sbin/ifenslave
145 145
146 If your kernel source is not in "/usr/src/linux," then replace 146 If your kernel source is not in "/usr/src/linux," then replace
147 "/usr/src/linux/include" in the above with the location of your kernel 147 "/usr/src/linux/include" in the above with the location of your kernel
148 source include directory. 148 source include directory.
149 149
150 You may wish to back up any existing /sbin/ifenslave, or, for 150 You may wish to back up any existing /sbin/ifenslave, or, for
151 testing or informal use, tag the ifenslave to the kernel version 151 testing or informal use, tag the ifenslave to the kernel version
152 (e.g., name the ifenslave executable /sbin/ifenslave-2.6.10). 152 (e.g., name the ifenslave executable /sbin/ifenslave-2.6.10).
153 153
154 IMPORTANT NOTE: 154 IMPORTANT NOTE:
155 155
156 If you omit the "-I" or specify an incorrect directory, you 156 If you omit the "-I" or specify an incorrect directory, you
157 may end up with an ifenslave that is incompatible with the kernel 157 may end up with an ifenslave that is incompatible with the kernel
158 you're trying to build it for. Some distros (e.g., Red Hat from 7.1 158 you're trying to build it for. Some distros (e.g., Red Hat from 7.1
159 onwards) do not have /usr/include/linux symbolically linked to the 159 onwards) do not have /usr/include/linux symbolically linked to the
160 default kernel source include directory. 160 default kernel source include directory.
161 161
162 SECOND IMPORTANT NOTE: 162 SECOND IMPORTANT NOTE:
163 If you plan to configure bonding using sysfs, you do not need 163 If you plan to configure bonding using sysfs, you do not need
164 to use ifenslave. 164 to use ifenslave.
165 165
166 2. Bonding Driver Options 166 2. Bonding Driver Options
167 ========================= 167 =========================
168 168
169 Options for the bonding driver are supplied as parameters to 169 Options for the bonding driver are supplied as parameters to
170 the bonding module at load time. They may be given as command line 170 the bonding module at load time. They may be given as command line
171 arguments to the insmod or modprobe command, but are usually specified 171 arguments to the insmod or modprobe command, but are usually specified
172 in either the /etc/modules.conf or /etc/modprobe.conf configuration 172 in either the /etc/modules.conf or /etc/modprobe.conf configuration
173 file, or in a distro-specific configuration file (some of which are 173 file, or in a distro-specific configuration file (some of which are
174 detailed in the next section). 174 detailed in the next section).
175 175
176 The available bonding driver parameters are listed below. If a 176 The available bonding driver parameters are listed below. If a
177 parameter is not specified the default value is used. When initially 177 parameter is not specified the default value is used. When initially
178 configuring a bond, it is recommended "tail -f /var/log/messages" be 178 configuring a bond, it is recommended "tail -f /var/log/messages" be
179 run in a separate window to watch for bonding driver error messages. 179 run in a separate window to watch for bonding driver error messages.
180 180
181 It is critical that either the miimon or arp_interval and 181 It is critical that either the miimon or arp_interval and
182 arp_ip_target parameters be specified, otherwise serious network 182 arp_ip_target parameters be specified, otherwise serious network
183 degradation will occur during link failures. Very few devices do not 183 degradation will occur during link failures. Very few devices do not
184 support at least miimon, so there is really no reason not to use it. 184 support at least miimon, so there is really no reason not to use it.
185 185
186 Options with textual values will accept either the text name 186 Options with textual values will accept either the text name
187 or, for backwards compatibility, the option value. E.g., 187 or, for backwards compatibility, the option value. E.g.,
188 "mode=802.3ad" and "mode=4" set the same mode. 188 "mode=802.3ad" and "mode=4" set the same mode.
189 189
190 The parameters are as follows: 190 The parameters are as follows:
191 191
192 arp_interval 192 arp_interval
193 193
194 Specifies the ARP link monitoring frequency in milliseconds. 194 Specifies the ARP link monitoring frequency in milliseconds.
195 195
196 The ARP monitor works by periodically checking the slave 196 The ARP monitor works by periodically checking the slave
197 devices to determine whether they have sent or received 197 devices to determine whether they have sent or received
198 traffic recently (the precise criteria depends upon the 198 traffic recently (the precise criteria depends upon the
199 bonding mode, and the state of the slave). Regular traffic is 199 bonding mode, and the state of the slave). Regular traffic is
200 generated via ARP probes issued for the addresses specified by 200 generated via ARP probes issued for the addresses specified by
201 the arp_ip_target option. 201 the arp_ip_target option.
202 202
203 This behavior can be modified by the arp_validate option, 203 This behavior can be modified by the arp_validate option,
204 below. 204 below.
205 205
206 If ARP monitoring is used in an etherchannel compatible mode 206 If ARP monitoring is used in an etherchannel compatible mode
207 (modes 0 and 2), the switch should be configured in a mode 207 (modes 0 and 2), the switch should be configured in a mode
208 that evenly distributes packets across all links. If the 208 that evenly distributes packets across all links. If the
209 switch is configured to distribute the packets in an XOR 209 switch is configured to distribute the packets in an XOR
210 fashion, all replies from the ARP targets will be received on 210 fashion, all replies from the ARP targets will be received on
211 the same link which could cause the other team members to 211 the same link which could cause the other team members to
212 fail. ARP monitoring should not be used in conjunction with 212 fail. ARP monitoring should not be used in conjunction with
213 miimon. A value of 0 disables ARP monitoring. The default 213 miimon. A value of 0 disables ARP monitoring. The default
214 value is 0. 214 value is 0.
215 215
216 arp_ip_target 216 arp_ip_target
217 217
218 Specifies the IP addresses to use as ARP monitoring peers when 218 Specifies the IP addresses to use as ARP monitoring peers when
219 arp_interval is > 0. These are the targets of the ARP request 219 arp_interval is > 0. These are the targets of the ARP request
220 sent to determine the health of the link to the targets. 220 sent to determine the health of the link to the targets.
221 Specify these values in ddd.ddd.ddd.ddd format. Multiple IP 221 Specify these values in ddd.ddd.ddd.ddd format. Multiple IP
222 addresses must be separated by a comma. At least one IP 222 addresses must be separated by a comma. At least one IP
223 address must be given for ARP monitoring to function. The 223 address must be given for ARP monitoring to function. The
224 maximum number of targets that can be specified is 16. The 224 maximum number of targets that can be specified is 16. The
225 default value is no IP addresses. 225 default value is no IP addresses.
226 226
227 arp_validate 227 arp_validate
228 228
229 Specifies whether or not ARP probes and replies should be 229 Specifies whether or not ARP probes and replies should be
230 validated in the active-backup mode. This causes the ARP 230 validated in the active-backup mode. This causes the ARP
231 monitor to examine the incoming ARP requests and replies, and 231 monitor to examine the incoming ARP requests and replies, and
232 only consider a slave to be up if it is receiving the 232 only consider a slave to be up if it is receiving the
233 appropriate ARP traffic. 233 appropriate ARP traffic.
234 234
235 Possible values are: 235 Possible values are:
236 236
237 none or 0 237 none or 0
238 238
239 No validation is performed. This is the default. 239 No validation is performed. This is the default.
240 240
241 active or 1 241 active or 1
242 242
243 Validation is performed only for the active slave. 243 Validation is performed only for the active slave.
244 244
245 backup or 2 245 backup or 2
246 246
247 Validation is performed only for backup slaves. 247 Validation is performed only for backup slaves.
248 248
249 all or 3 249 all or 3
250 250
251 Validation is performed for all slaves. 251 Validation is performed for all slaves.
252 252
253 For the active slave, the validation checks ARP replies to 253 For the active slave, the validation checks ARP replies to
254 confirm that they were generated by an arp_ip_target. Since 254 confirm that they were generated by an arp_ip_target. Since
255 backup slaves do not typically receive these replies, the 255 backup slaves do not typically receive these replies, the
256 validation performed for backup slaves is on the ARP request 256 validation performed for backup slaves is on the ARP request
257 sent out via the active slave. It is possible that some 257 sent out via the active slave. It is possible that some
258 switch or network configurations may result in situations 258 switch or network configurations may result in situations
259 wherein the backup slaves do not receive the ARP requests; in 259 wherein the backup slaves do not receive the ARP requests; in
260 such a situation, validation of backup slaves must be 260 such a situation, validation of backup slaves must be
261 disabled. 261 disabled.
262 262
263 This option is useful in network configurations in which 263 This option is useful in network configurations in which
264 multiple bonding hosts are concurrently issuing ARPs to one or 264 multiple bonding hosts are concurrently issuing ARPs to one or
265 more targets beyond a common switch. Should the link between 265 more targets beyond a common switch. Should the link between
266 the switch and target fail (but not the switch itself), the 266 the switch and target fail (but not the switch itself), the
267 probe traffic generated by the multiple bonding instances will 267 probe traffic generated by the multiple bonding instances will
268 fool the standard ARP monitor into considering the links as 268 fool the standard ARP monitor into considering the links as
269 still up. Use of the arp_validate option can resolve this, as 269 still up. Use of the arp_validate option can resolve this, as
270 the ARP monitor will only consider ARP requests and replies 270 the ARP monitor will only consider ARP requests and replies
271 associated with its own instance of bonding. 271 associated with its own instance of bonding.
272 272
273 This option was added in bonding version 3.1.0. 273 This option was added in bonding version 3.1.0.
274 274
275 downdelay 275 downdelay
276 276
277 Specifies the time, in milliseconds, to wait before disabling 277 Specifies the time, in milliseconds, to wait before disabling
278 a slave after a link failure has been detected. This option 278 a slave after a link failure has been detected. This option
279 is only valid for the miimon link monitor. The downdelay 279 is only valid for the miimon link monitor. The downdelay
280 value should be a multiple of the miimon value; if not, it 280 value should be a multiple of the miimon value; if not, it
281 will be rounded down to the nearest multiple. The default 281 will be rounded down to the nearest multiple. The default
282 value is 0. 282 value is 0.
283 283
284 lacp_rate 284 lacp_rate
285 285
286 Option specifying the rate in which we'll ask our link partner 286 Option specifying the rate in which we'll ask our link partner
287 to transmit LACPDU packets in 802.3ad mode. Possible values 287 to transmit LACPDU packets in 802.3ad mode. Possible values
288 are: 288 are:
289 289
290 slow or 0 290 slow or 0
291 Request partner to transmit LACPDUs every 30 seconds 291 Request partner to transmit LACPDUs every 30 seconds
292 292
293 fast or 1 293 fast or 1
294 Request partner to transmit LACPDUs every 1 second 294 Request partner to transmit LACPDUs every 1 second
295 295
296 The default is slow. 296 The default is slow.
297 297
298 max_bonds 298 max_bonds
299 299
300 Specifies the number of bonding devices to create for this 300 Specifies the number of bonding devices to create for this
301 instance of the bonding driver. E.g., if max_bonds is 3, and 301 instance of the bonding driver. E.g., if max_bonds is 3, and
302 the bonding driver is not already loaded, then bond0, bond1 302 the bonding driver is not already loaded, then bond0, bond1
303 and bond2 will be created. The default value is 1. 303 and bond2 will be created. The default value is 1.
304 304
305 miimon 305 miimon
306 306
307 Specifies the MII link monitoring frequency in milliseconds. 307 Specifies the MII link monitoring frequency in milliseconds.
308 This determines how often the link state of each slave is 308 This determines how often the link state of each slave is
309 inspected for link failures. A value of zero disables MII 309 inspected for link failures. A value of zero disables MII
310 link monitoring. A value of 100 is a good starting point. 310 link monitoring. A value of 100 is a good starting point.
311 The use_carrier option, below, affects how the link state is 311 The use_carrier option, below, affects how the link state is
312 determined. See the High Availability section for additional 312 determined. See the High Availability section for additional
313 information. The default value is 0. 313 information. The default value is 0.
314 314
315 mode 315 mode
316 316
317 Specifies one of the bonding policies. The default is 317 Specifies one of the bonding policies. The default is
318 balance-rr (round robin). Possible values are: 318 balance-rr (round robin). Possible values are:
319 319
320 balance-rr or 0 320 balance-rr or 0
321 321
322 Round-robin policy: Transmit packets in sequential 322 Round-robin policy: Transmit packets in sequential
323 order from the first available slave through the 323 order from the first available slave through the
324 last. This mode provides load balancing and fault 324 last. This mode provides load balancing and fault
325 tolerance. 325 tolerance.
326 326
327 active-backup or 1 327 active-backup or 1
328 328
329 Active-backup policy: Only one slave in the bond is 329 Active-backup policy: Only one slave in the bond is
330 active. A different slave becomes active if, and only 330 active. A different slave becomes active if, and only
331 if, the active slave fails. The bond's MAC address is 331 if, the active slave fails. The bond's MAC address is
332 externally visible on only one port (network adapter) 332 externally visible on only one port (network adapter)
333 to avoid confusing the switch. 333 to avoid confusing the switch.
334 334
335 In bonding version 2.6.2 or later, when a failover 335 In bonding version 2.6.2 or later, when a failover
336 occurs in active-backup mode, bonding will issue one 336 occurs in active-backup mode, bonding will issue one
337 or more gratuitous ARPs on the newly active slave. 337 or more gratuitous ARPs on the newly active slave.
338 One gratuitous ARP is issued for the bonding master 338 One gratuitous ARP is issued for the bonding master
339 interface and each VLAN interfaces configured above 339 interface and each VLAN interfaces configured above
340 it, provided that the interface has at least one IP 340 it, provided that the interface has at least one IP
341 address configured. Gratuitous ARPs issued for VLAN 341 address configured. Gratuitous ARPs issued for VLAN
342 interfaces are tagged with the appropriate VLAN id. 342 interfaces are tagged with the appropriate VLAN id.
343 343
344 This mode provides fault tolerance. The primary 344 This mode provides fault tolerance. The primary
345 option, documented below, affects the behavior of this 345 option, documented below, affects the behavior of this
346 mode. 346 mode.
347 347
348 balance-xor or 2 348 balance-xor or 2
349 349
350 XOR policy: Transmit based on the selected transmit 350 XOR policy: Transmit based on the selected transmit
351 hash policy. The default policy is a simple [(source 351 hash policy. The default policy is a simple [(source
352 MAC address XOR'd with destination MAC address) modulo 352 MAC address XOR'd with destination MAC address) modulo
353 slave count]. Alternate transmit policies may be 353 slave count]. Alternate transmit policies may be
354 selected via the xmit_hash_policy option, described 354 selected via the xmit_hash_policy option, described
355 below. 355 below.
356 356
357 This mode provides load balancing and fault tolerance. 357 This mode provides load balancing and fault tolerance.
358 358
359 broadcast or 3 359 broadcast or 3
360 360
361 Broadcast policy: transmits everything on all slave 361 Broadcast policy: transmits everything on all slave
362 interfaces. This mode provides fault tolerance. 362 interfaces. This mode provides fault tolerance.
363 363
364 802.3ad or 4 364 802.3ad or 4
365 365
366 IEEE 802.3ad Dynamic link aggregation. Creates 366 IEEE 802.3ad Dynamic link aggregation. Creates
367 aggregation groups that share the same speed and 367 aggregation groups that share the same speed and
368 duplex settings. Utilizes all slaves in the active 368 duplex settings. Utilizes all slaves in the active
369 aggregator according to the 802.3ad specification. 369 aggregator according to the 802.3ad specification.
370 370
371 Slave selection for outgoing traffic is done according 371 Slave selection for outgoing traffic is done according
372 to the transmit hash policy, which may be changed from 372 to the transmit hash policy, which may be changed from
373 the default simple XOR policy via the xmit_hash_policy 373 the default simple XOR policy via the xmit_hash_policy
374 option, documented below. Note that not all transmit 374 option, documented below. Note that not all transmit
375 policies may be 802.3ad compliant, particularly in 375 policies may be 802.3ad compliant, particularly in
376 regards to the packet mis-ordering requirements of 376 regards to the packet mis-ordering requirements of
377 section 43.2.4 of the 802.3ad standard. Differing 377 section 43.2.4 of the 802.3ad standard. Differing
378 peer implementations will have varying tolerances for 378 peer implementations will have varying tolerances for
379 noncompliance. 379 noncompliance.
380 380
381 Prerequisites: 381 Prerequisites:
382 382
383 1. Ethtool support in the base drivers for retrieving 383 1. Ethtool support in the base drivers for retrieving
384 the speed and duplex of each slave. 384 the speed and duplex of each slave.
385 385
386 2. A switch that supports IEEE 802.3ad Dynamic link 386 2. A switch that supports IEEE 802.3ad Dynamic link
387 aggregation. 387 aggregation.
388 388
389 Most switches will require some type of configuration 389 Most switches will require some type of configuration
390 to enable 802.3ad mode. 390 to enable 802.3ad mode.
391 391
392 balance-tlb or 5 392 balance-tlb or 5
393 393
394 Adaptive transmit load balancing: channel bonding that 394 Adaptive transmit load balancing: channel bonding that
395 does not require any special switch support. The 395 does not require any special switch support. The
396 outgoing traffic is distributed according to the 396 outgoing traffic is distributed according to the
397 current load (computed relative to the speed) on each 397 current load (computed relative to the speed) on each
398 slave. Incoming traffic is received by the current 398 slave. Incoming traffic is received by the current
399 slave. If the receiving slave fails, another slave 399 slave. If the receiving slave fails, another slave
400 takes over the MAC address of the failed receiving 400 takes over the MAC address of the failed receiving
401 slave. 401 slave.
402 402
403 Prerequisite: 403 Prerequisite:
404 404
405 Ethtool support in the base drivers for retrieving the 405 Ethtool support in the base drivers for retrieving the
406 speed of each slave. 406 speed of each slave.
407 407
408 balance-alb or 6 408 balance-alb or 6
409 409
410 Adaptive load balancing: includes balance-tlb plus 410 Adaptive load balancing: includes balance-tlb plus
411 receive load balancing (rlb) for IPV4 traffic, and 411 receive load balancing (rlb) for IPV4 traffic, and
412 does not require any special switch support. The 412 does not require any special switch support. The
413 receive load balancing is achieved by ARP negotiation. 413 receive load balancing is achieved by ARP negotiation.
414 The bonding driver intercepts the ARP Replies sent by 414 The bonding driver intercepts the ARP Replies sent by
415 the local system on their way out and overwrites the 415 the local system on their way out and overwrites the
416 source hardware address with the unique hardware 416 source hardware address with the unique hardware
417 address of one of the slaves in the bond such that 417 address of one of the slaves in the bond such that
418 different peers use different hardware addresses for 418 different peers use different hardware addresses for
419 the server. 419 the server.
420 420
421 Receive traffic from connections created by the server 421 Receive traffic from connections created by the server
422 is also balanced. When the local system sends an ARP 422 is also balanced. When the local system sends an ARP
423 Request the bonding driver copies and saves the peer's 423 Request the bonding driver copies and saves the peer's
424 IP information from the ARP packet. When the ARP 424 IP information from the ARP packet. When the ARP
425 Reply arrives from the peer, its hardware address is 425 Reply arrives from the peer, its hardware address is
426 retrieved and the bonding driver initiates an ARP 426 retrieved and the bonding driver initiates an ARP
427 reply to this peer assigning it to one of the slaves 427 reply to this peer assigning it to one of the slaves
428 in the bond. A problematic outcome of using ARP 428 in the bond. A problematic outcome of using ARP
429 negotiation for balancing is that each time that an 429 negotiation for balancing is that each time that an
430 ARP request is broadcast it uses the hardware address 430 ARP request is broadcast it uses the hardware address
431 of the bond. Hence, peers learn the hardware address 431 of the bond. Hence, peers learn the hardware address
432 of the bond and the balancing of receive traffic 432 of the bond and the balancing of receive traffic
433 collapses to the current slave. This is handled by 433 collapses to the current slave. This is handled by
434 sending updates (ARP Replies) to all the peers with 434 sending updates (ARP Replies) to all the peers with
435 their individually assigned hardware address such that 435 their individually assigned hardware address such that
436 the traffic is redistributed. Receive traffic is also 436 the traffic is redistributed. Receive traffic is also
437 redistributed when a new slave is added to the bond 437 redistributed when a new slave is added to the bond
438 and when an inactive slave is re-activated. The 438 and when an inactive slave is re-activated. The
439 receive load is distributed sequentially (round robin) 439 receive load is distributed sequentially (round robin)
440 among the group of highest speed slaves in the bond. 440 among the group of highest speed slaves in the bond.
441 441
442 When a link is reconnected or a new slave joins the 442 When a link is reconnected or a new slave joins the
443 bond the receive traffic is redistributed among all 443 bond the receive traffic is redistributed among all
444 active slaves in the bond by initiating ARP Replies 444 active slaves in the bond by initiating ARP Replies
445 with the selected MAC address to each of the 445 with the selected MAC address to each of the
446 clients. The updelay parameter (detailed below) must 446 clients. The updelay parameter (detailed below) must
447 be set to a value equal or greater than the switch's 447 be set to a value equal or greater than the switch's
448 forwarding delay so that the ARP Replies sent to the 448 forwarding delay so that the ARP Replies sent to the
449 peers will not be blocked by the switch. 449 peers will not be blocked by the switch.
450 450
451 Prerequisites: 451 Prerequisites:
452 452
453 1. Ethtool support in the base drivers for retrieving 453 1. Ethtool support in the base drivers for retrieving
454 the speed of each slave. 454 the speed of each slave.
455 455
456 2. Base driver support for setting the hardware 456 2. Base driver support for setting the hardware
457 address of a device while it is open. This is 457 address of a device while it is open. This is
458 required so that there will always be one slave in the 458 required so that there will always be one slave in the
459 team using the bond hardware address (the 459 team using the bond hardware address (the
460 curr_active_slave) while having a unique hardware 460 curr_active_slave) while having a unique hardware
461 address for each slave in the bond. If the 461 address for each slave in the bond. If the
462 curr_active_slave fails its hardware address is 462 curr_active_slave fails its hardware address is
463 swapped with the new curr_active_slave that was 463 swapped with the new curr_active_slave that was
464 chosen. 464 chosen.
465 465
466 primary 466 primary
467 467
468 A string (eth0, eth2, etc) specifying which slave is the 468 A string (eth0, eth2, etc) specifying which slave is the
469 primary device. The specified device will always be the 469 primary device. The specified device will always be the
470 active slave while it is available. Only when the primary is 470 active slave while it is available. Only when the primary is
471 off-line will alternate devices be used. This is useful when 471 off-line will alternate devices be used. This is useful when
472 one slave is preferred over another, e.g., when one slave has 472 one slave is preferred over another, e.g., when one slave has
473 higher throughput than another. 473 higher throughput than another.
474 474
475 The primary option is only valid for active-backup mode. 475 The primary option is only valid for active-backup mode.
476 476
477 updelay 477 updelay
478 478
479 Specifies the time, in milliseconds, to wait before enabling a 479 Specifies the time, in milliseconds, to wait before enabling a
480 slave after a link recovery has been detected. This option is 480 slave after a link recovery has been detected. This option is
481 only valid for the miimon link monitor. The updelay value 481 only valid for the miimon link monitor. The updelay value
482 should be a multiple of the miimon value; if not, it will be 482 should be a multiple of the miimon value; if not, it will be
483 rounded down to the nearest multiple. The default value is 0. 483 rounded down to the nearest multiple. The default value is 0.
484 484
485 use_carrier 485 use_carrier
486 486
487 Specifies whether or not miimon should use MII or ETHTOOL 487 Specifies whether or not miimon should use MII or ETHTOOL
488 ioctls vs. netif_carrier_ok() to determine the link 488 ioctls vs. netif_carrier_ok() to determine the link
489 status. The MII or ETHTOOL ioctls are less efficient and 489 status. The MII or ETHTOOL ioctls are less efficient and
490 utilize a deprecated calling sequence within the kernel. The 490 utilize a deprecated calling sequence within the kernel. The
491 netif_carrier_ok() relies on the device driver to maintain its 491 netif_carrier_ok() relies on the device driver to maintain its
492 state with netif_carrier_on/off; at this writing, most, but 492 state with netif_carrier_on/off; at this writing, most, but
493 not all, device drivers support this facility. 493 not all, device drivers support this facility.
494 494
495 If bonding insists that the link is up when it should not be, 495 If bonding insists that the link is up when it should not be,
496 it may be that your network device driver does not support 496 it may be that your network device driver does not support
497 netif_carrier_on/off. The default state for netif_carrier is 497 netif_carrier_on/off. The default state for netif_carrier is
498 "carrier on," so if a driver does not support netif_carrier, 498 "carrier on," so if a driver does not support netif_carrier,
499 it will appear as if the link is always up. In this case, 499 it will appear as if the link is always up. In this case,
500 setting use_carrier to 0 will cause bonding to revert to the 500 setting use_carrier to 0 will cause bonding to revert to the
501 MII / ETHTOOL ioctl method to determine the link state. 501 MII / ETHTOOL ioctl method to determine the link state.
502 502
503 A value of 1 enables the use of netif_carrier_ok(), a value of 503 A value of 1 enables the use of netif_carrier_ok(), a value of
504 0 will use the deprecated MII / ETHTOOL ioctls. The default 504 0 will use the deprecated MII / ETHTOOL ioctls. The default
505 value is 1. 505 value is 1.
506 506
507 xmit_hash_policy 507 xmit_hash_policy
508 508
509 Selects the transmit hash policy to use for slave selection in 509 Selects the transmit hash policy to use for slave selection in
510 balance-xor and 802.3ad modes. Possible values are: 510 balance-xor and 802.3ad modes. Possible values are:
511 511
512 layer2 512 layer2
513 513
514 Uses XOR of hardware MAC addresses to generate the 514 Uses XOR of hardware MAC addresses to generate the
515 hash. The formula is 515 hash. The formula is
516 516
517 (source MAC XOR destination MAC) modulo slave count 517 (source MAC XOR destination MAC) modulo slave count
518 518
519 This algorithm will place all traffic to a particular 519 This algorithm will place all traffic to a particular
520 network peer on the same slave. 520 network peer on the same slave.
521 521
522 This algorithm is 802.3ad compliant. 522 This algorithm is 802.3ad compliant.
523 523
524 layer3+4 524 layer3+4
525 525
526 This policy uses upper layer protocol information, 526 This policy uses upper layer protocol information,
527 when available, to generate the hash. This allows for 527 when available, to generate the hash. This allows for
528 traffic to a particular network peer to span multiple 528 traffic to a particular network peer to span multiple
529 slaves, although a single connection will not span 529 slaves, although a single connection will not span
530 multiple slaves. 530 multiple slaves.
531 531
532 The formula for unfragmented TCP and UDP packets is 532 The formula for unfragmented TCP and UDP packets is
533 533
534 ((source port XOR dest port) XOR 534 ((source port XOR dest port) XOR
535 ((source IP XOR dest IP) AND 0xffff) 535 ((source IP XOR dest IP) AND 0xffff)
536 modulo slave count 536 modulo slave count
537 537
538 For fragmented TCP or UDP packets and all other IP 538 For fragmented TCP or UDP packets and all other IP
539 protocol traffic, the source and destination port 539 protocol traffic, the source and destination port
540 information is omitted. For non-IP traffic, the 540 information is omitted. For non-IP traffic, the
541 formula is the same as for the layer2 transmit hash 541 formula is the same as for the layer2 transmit hash
542 policy. 542 policy.
543 543
544 This policy is intended to mimic the behavior of 544 This policy is intended to mimic the behavior of
545 certain switches, notably Cisco switches with PFC2 as 545 certain switches, notably Cisco switches with PFC2 as
546 well as some Foundry and IBM products. 546 well as some Foundry and IBM products.
547 547
548 This algorithm is not fully 802.3ad compliant. A 548 This algorithm is not fully 802.3ad compliant. A
549 single TCP or UDP conversation containing both 549 single TCP or UDP conversation containing both
550 fragmented and unfragmented packets will see packets 550 fragmented and unfragmented packets will see packets
551 striped across two interfaces. This may result in out 551 striped across two interfaces. This may result in out
552 of order delivery. Most traffic types will not meet 552 of order delivery. Most traffic types will not meet
553 this criteria, as TCP rarely fragments traffic, and 553 this criteria, as TCP rarely fragments traffic, and
554 most UDP traffic is not involved in extended 554 most UDP traffic is not involved in extended
555 conversations. Other implementations of 802.3ad may 555 conversations. Other implementations of 802.3ad may
556 or may not tolerate this noncompliance. 556 or may not tolerate this noncompliance.
557 557
558 The default value is layer2. This option was added in bonding 558 The default value is layer2. This option was added in bonding
559 version 2.6.3. In earlier versions of bonding, this parameter does 559 version 2.6.3. In earlier versions of bonding, this parameter does
560 not exist, and the layer2 policy is the only policy. 560 not exist, and the layer2 policy is the only policy.
561 561
562 562
563 3. Configuring Bonding Devices 563 3. Configuring Bonding Devices
564 ============================== 564 ==============================
565 565
566 You can configure bonding using either your distro's network 566 You can configure bonding using either your distro's network
567 initialization scripts, or manually using either ifenslave or the 567 initialization scripts, or manually using either ifenslave or the
568 sysfs interface. Distros generally use one of two packages for the 568 sysfs interface. Distros generally use one of two packages for the
569 network initialization scripts: initscripts or sysconfig. Recent 569 network initialization scripts: initscripts or sysconfig. Recent
570 versions of these packages have support for bonding, while older 570 versions of these packages have support for bonding, while older
571 versions do not. 571 versions do not.
572 572
573 We will first describe the options for configuring bonding for 573 We will first describe the options for configuring bonding for
574 distros using versions of initscripts and sysconfig with full or 574 distros using versions of initscripts and sysconfig with full or
575 partial support for bonding, then provide information on enabling 575 partial support for bonding, then provide information on enabling
576 bonding without support from the network initialization scripts (i.e., 576 bonding without support from the network initialization scripts (i.e.,
577 older versions of initscripts or sysconfig). 577 older versions of initscripts or sysconfig).
578 578
579 If you're unsure whether your distro uses sysconfig or 579 If you're unsure whether your distro uses sysconfig or
580 initscripts, or don't know if it's new enough, have no fear. 580 initscripts, or don't know if it's new enough, have no fear.
581 Determining this is fairly straightforward. 581 Determining this is fairly straightforward.
582 582
583 First, issue the command: 583 First, issue the command:
584 584
585 $ rpm -qf /sbin/ifup 585 $ rpm -qf /sbin/ifup
586 586
587 It will respond with a line of text starting with either 587 It will respond with a line of text starting with either
588 "initscripts" or "sysconfig," followed by some numbers. This is the 588 "initscripts" or "sysconfig," followed by some numbers. This is the
589 package that provides your network initialization scripts. 589 package that provides your network initialization scripts.
590 590
591 Next, to determine if your installation supports bonding, 591 Next, to determine if your installation supports bonding,
592 issue the command: 592 issue the command:
593 593
594 $ grep ifenslave /sbin/ifup 594 $ grep ifenslave /sbin/ifup
595 595
596 If this returns any matches, then your initscripts or 596 If this returns any matches, then your initscripts or
597 sysconfig has support for bonding. 597 sysconfig has support for bonding.
598 598
599 3.1 Configuration with Sysconfig Support 599 3.1 Configuration with Sysconfig Support
600 ---------------------------------------- 600 ----------------------------------------
601 601
602 This section applies to distros using a version of sysconfig 602 This section applies to distros using a version of sysconfig
603 with bonding support, for example, SuSE Linux Enterprise Server 9. 603 with bonding support, for example, SuSE Linux Enterprise Server 9.
604 604
605 SuSE SLES 9's networking configuration system does support 605 SuSE SLES 9's networking configuration system does support
606 bonding, however, at this writing, the YaST system configuration 606 bonding, however, at this writing, the YaST system configuration
607 front end does not provide any means to work with bonding devices. 607 front end does not provide any means to work with bonding devices.
608 Bonding devices can be managed by hand, however, as follows. 608 Bonding devices can be managed by hand, however, as follows.
609 609
610 First, if they have not already been configured, configure the 610 First, if they have not already been configured, configure the
611 slave devices. On SLES 9, this is most easily done by running the 611 slave devices. On SLES 9, this is most easily done by running the
612 yast2 sysconfig configuration utility. The goal is for to create an 612 yast2 sysconfig configuration utility. The goal is for to create an
613 ifcfg-id file for each slave device. The simplest way to accomplish 613 ifcfg-id file for each slave device. The simplest way to accomplish
614 this is to configure the devices for DHCP (this is only to get the 614 this is to configure the devices for DHCP (this is only to get the
615 file ifcfg-id file created; see below for some issues with DHCP). The 615 file ifcfg-id file created; see below for some issues with DHCP). The
616 name of the configuration file for each device will be of the form: 616 name of the configuration file for each device will be of the form:
617 617
618 ifcfg-id-xx:xx:xx:xx:xx:xx 618 ifcfg-id-xx:xx:xx:xx:xx:xx
619 619
620 Where the "xx" portion will be replaced with the digits from 620 Where the "xx" portion will be replaced with the digits from
621 the device's permanent MAC address. 621 the device's permanent MAC address.
622 622
623 Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been 623 Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
624 created, it is necessary to edit the configuration files for the slave 624 created, it is necessary to edit the configuration files for the slave
625 devices (the MAC addresses correspond to those of the slave devices). 625 devices (the MAC addresses correspond to those of the slave devices).
626 Before editing, the file will contain multiple lines, and will look 626 Before editing, the file will contain multiple lines, and will look
627 something like this: 627 something like this:
628 628
629 BOOTPROTO='dhcp' 629 BOOTPROTO='dhcp'
630 STARTMODE='on' 630 STARTMODE='on'
631 USERCTL='no' 631 USERCTL='no'
632 UNIQUE='XNzu.WeZGOGF+4wE' 632 UNIQUE='XNzu.WeZGOGF+4wE'
633 _nm_name='bus-pci-0001:61:01.0' 633 _nm_name='bus-pci-0001:61:01.0'
634 634
635 Change the BOOTPROTO and STARTMODE lines to the following: 635 Change the BOOTPROTO and STARTMODE lines to the following:
636 636
637 BOOTPROTO='none' 637 BOOTPROTO='none'
638 STARTMODE='off' 638 STARTMODE='off'
639 639
640 Do not alter the UNIQUE or _nm_name lines. Remove any other 640 Do not alter the UNIQUE or _nm_name lines. Remove any other
641 lines (USERCTL, etc). 641 lines (USERCTL, etc).
642 642
643 Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified, 643 Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
644 it's time to create the configuration file for the bonding device 644 it's time to create the configuration file for the bonding device
645 itself. This file is named ifcfg-bondX, where X is the number of the 645 itself. This file is named ifcfg-bondX, where X is the number of the
646 bonding device to create, starting at 0. The first such file is 646 bonding device to create, starting at 0. The first such file is
647 ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig 647 ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig
648 network configuration system will correctly start multiple instances 648 network configuration system will correctly start multiple instances
649 of bonding. 649 of bonding.
650 650
651 The contents of the ifcfg-bondX file is as follows: 651 The contents of the ifcfg-bondX file is as follows:
652 652
653 BOOTPROTO="static" 653 BOOTPROTO="static"
654 BROADCAST="10.0.2.255" 654 BROADCAST="10.0.2.255"
655 IPADDR="10.0.2.10" 655 IPADDR="10.0.2.10"
656 NETMASK="255.255.0.0" 656 NETMASK="255.255.0.0"
657 NETWORK="10.0.2.0" 657 NETWORK="10.0.2.0"
658 REMOTE_IPADDR="" 658 REMOTE_IPADDR=""
659 STARTMODE="onboot" 659 STARTMODE="onboot"
660 BONDING_MASTER="yes" 660 BONDING_MASTER="yes"
661 BONDING_MODULE_OPTS="mode=active-backup miimon=100" 661 BONDING_MODULE_OPTS="mode=active-backup miimon=100"
662 BONDING_SLAVE0="eth0" 662 BONDING_SLAVE0="eth0"
663 BONDING_SLAVE1="bus-pci-0000:06:08.1" 663 BONDING_SLAVE1="bus-pci-0000:06:08.1"
664 664
665 Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK 665 Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
666 values with the appropriate values for your network. 666 values with the appropriate values for your network.
667 667
668 The STARTMODE specifies when the device is brought online. 668 The STARTMODE specifies when the device is brought online.
669 The possible values are: 669 The possible values are:
670 670
671 onboot: The device is started at boot time. If you're not 671 onboot: The device is started at boot time. If you're not
672 sure, this is probably what you want. 672 sure, this is probably what you want.
673 673
674 manual: The device is started only when ifup is called 674 manual: The device is started only when ifup is called
675 manually. Bonding devices may be configured this 675 manually. Bonding devices may be configured this
676 way if you do not wish them to start automatically 676 way if you do not wish them to start automatically
677 at boot for some reason. 677 at boot for some reason.
678 678
679 hotplug: The device is started by a hotplug event. This is not 679 hotplug: The device is started by a hotplug event. This is not
680 a valid choice for a bonding device. 680 a valid choice for a bonding device.
681 681
682 off or ignore: The device configuration is ignored. 682 off or ignore: The device configuration is ignored.
683 683
684 The line BONDING_MASTER='yes' indicates that the device is a 684 The line BONDING_MASTER='yes' indicates that the device is a
685 bonding master device. The only useful value is "yes." 685 bonding master device. The only useful value is "yes."
686 686
687 The contents of BONDING_MODULE_OPTS are supplied to the 687 The contents of BONDING_MODULE_OPTS are supplied to the
688 instance of the bonding module for this device. Specify the options 688 instance of the bonding module for this device. Specify the options
689 for the bonding mode, link monitoring, and so on here. Do not include 689 for the bonding mode, link monitoring, and so on here. Do not include
690 the max_bonds bonding parameter; this will confuse the configuration 690 the max_bonds bonding parameter; this will confuse the configuration
691 system if you have multiple bonding devices. 691 system if you have multiple bonding devices.
692 692
693 Finally, supply one BONDING_SLAVEn="slave device" for each 693 Finally, supply one BONDING_SLAVEn="slave device" for each
694 slave. where "n" is an increasing value, one for each slave. The 694 slave. where "n" is an increasing value, one for each slave. The
695 "slave device" is either an interface name, e.g., "eth0", or a device 695 "slave device" is either an interface name, e.g., "eth0", or a device
696 specifier for the network device. The interface name is easier to 696 specifier for the network device. The interface name is easier to
697 find, but the ethN names are subject to change at boot time if, e.g., 697 find, but the ethN names are subject to change at boot time if, e.g.,
698 a device early in the sequence has failed. The device specifiers 698 a device early in the sequence has failed. The device specifiers
699 (bus-pci-0000:06:08.1 in the example above) specify the physical 699 (bus-pci-0000:06:08.1 in the example above) specify the physical
700 network device, and will not change unless the device's bus location 700 network device, and will not change unless the device's bus location
701 changes (for example, it is moved from one PCI slot to another). The 701 changes (for example, it is moved from one PCI slot to another). The
702 example above uses one of each type for demonstration purposes; most 702 example above uses one of each type for demonstration purposes; most
703 configurations will choose one or the other for all slave devices. 703 configurations will choose one or the other for all slave devices.
704 704
705 When all configuration files have been modified or created, 705 When all configuration files have been modified or created,
706 networking must be restarted for the configuration changes to take 706 networking must be restarted for the configuration changes to take
707 effect. This can be accomplished via the following: 707 effect. This can be accomplished via the following:
708 708
709 # /etc/init.d/network restart 709 # /etc/init.d/network restart
710 710
711 Note that the network control script (/sbin/ifdown) will 711 Note that the network control script (/sbin/ifdown) will
712 remove the bonding module as part of the network shutdown processing, 712 remove the bonding module as part of the network shutdown processing,
713 so it is not necessary to remove the module by hand if, e.g., the 713 so it is not necessary to remove the module by hand if, e.g., the
714 module parameters have changed. 714 module parameters have changed.
715 715
716 Also, at this writing, YaST/YaST2 will not manage bonding 716 Also, at this writing, YaST/YaST2 will not manage bonding
717 devices (they do not show bonding interfaces on its list of network 717 devices (they do not show bonding interfaces on its list of network
718 devices). It is necessary to edit the configuration file by hand to 718 devices). It is necessary to edit the configuration file by hand to
719 change the bonding configuration. 719 change the bonding configuration.
720 720
721 Additional general options and details of the ifcfg file 721 Additional general options and details of the ifcfg file
722 format can be found in an example ifcfg template file: 722 format can be found in an example ifcfg template file:
723 723
724 /etc/sysconfig/network/ifcfg.template 724 /etc/sysconfig/network/ifcfg.template
725 725
726 Note that the template does not document the various BONDING_ 726 Note that the template does not document the various BONDING_
727 settings described above, but does describe many of the other options. 727 settings described above, but does describe many of the other options.
728 728
729 3.1.1 Using DHCP with Sysconfig 729 3.1.1 Using DHCP with Sysconfig
730 ------------------------------- 730 -------------------------------
731 731
732 Under sysconfig, configuring a device with BOOTPROTO='dhcp' 732 Under sysconfig, configuring a device with BOOTPROTO='dhcp'
733 will cause it to query DHCP for its IP address information. At this 733 will cause it to query DHCP for its IP address information. At this
734 writing, this does not function for bonding devices; the scripts 734 writing, this does not function for bonding devices; the scripts
735 attempt to obtain the device address from DHCP prior to adding any of 735 attempt to obtain the device address from DHCP prior to adding any of
736 the slave devices. Without active slaves, the DHCP requests are not 736 the slave devices. Without active slaves, the DHCP requests are not
737 sent to the network. 737 sent to the network.
738 738
739 3.1.2 Configuring Multiple Bonds with Sysconfig 739 3.1.2 Configuring Multiple Bonds with Sysconfig
740 ----------------------------------------------- 740 -----------------------------------------------
741 741
742 The sysconfig network initialization system is capable of 742 The sysconfig network initialization system is capable of
743 handling multiple bonding devices. All that is necessary is for each 743 handling multiple bonding devices. All that is necessary is for each
744 bonding instance to have an appropriately configured ifcfg-bondX file 744 bonding instance to have an appropriately configured ifcfg-bondX file
745 (as described above). Do not specify the "max_bonds" parameter to any 745 (as described above). Do not specify the "max_bonds" parameter to any
746 instance of bonding, as this will confuse sysconfig. If you require 746 instance of bonding, as this will confuse sysconfig. If you require
747 multiple bonding devices with identical parameters, create multiple 747 multiple bonding devices with identical parameters, create multiple
748 ifcfg-bondX files. 748 ifcfg-bondX files.
749 749
750 Because the sysconfig scripts supply the bonding module 750 Because the sysconfig scripts supply the bonding module
751 options in the ifcfg-bondX file, it is not necessary to add them to 751 options in the ifcfg-bondX file, it is not necessary to add them to
752 the system /etc/modules.conf or /etc/modprobe.conf configuration file. 752 the system /etc/modules.conf or /etc/modprobe.conf configuration file.
753 753
754 3.2 Configuration with Initscripts Support 754 3.2 Configuration with Initscripts Support
755 ------------------------------------------ 755 ------------------------------------------
756 756
757 This section applies to distros using a version of initscripts 757 This section applies to distros using a version of initscripts
758 with bonding support, for example, Red Hat Linux 9 or Red Hat 758 with bonding support, for example, Red Hat Linux 9 or Red Hat
759 Enterprise Linux version 3 or 4. On these systems, the network 759 Enterprise Linux version 3 or 4. On these systems, the network
760 initialization scripts have some knowledge of bonding, and can be 760 initialization scripts have some knowledge of bonding, and can be
761 configured to control bonding devices. 761 configured to control bonding devices.
762 762
763 These distros will not automatically load the network adapter 763 These distros will not automatically load the network adapter
764 driver unless the ethX device is configured with an IP address. 764 driver unless the ethX device is configured with an IP address.
765 Because of this constraint, users must manually configure a 765 Because of this constraint, users must manually configure a
766 network-script file for all physical adapters that will be members of 766 network-script file for all physical adapters that will be members of
767 a bondX link. Network script files are located in the directory: 767 a bondX link. Network script files are located in the directory:
768 768
769 /etc/sysconfig/network-scripts 769 /etc/sysconfig/network-scripts
770 770
771 The file name must be prefixed with "ifcfg-eth" and suffixed 771 The file name must be prefixed with "ifcfg-eth" and suffixed
772 with the adapter's physical adapter number. For example, the script 772 with the adapter's physical adapter number. For example, the script
773 for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0. 773 for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0.
774 Place the following text in the file: 774 Place the following text in the file:
775 775
776 DEVICE=eth0 776 DEVICE=eth0
777 USERCTL=no 777 USERCTL=no
778 ONBOOT=yes 778 ONBOOT=yes
779 MASTER=bond0 779 MASTER=bond0
780 SLAVE=yes 780 SLAVE=yes
781 BOOTPROTO=none 781 BOOTPROTO=none
782 782
783 The DEVICE= line will be different for every ethX device and 783 The DEVICE= line will be different for every ethX device and
784 must correspond with the name of the file, i.e., ifcfg-eth1 must have 784 must correspond with the name of the file, i.e., ifcfg-eth1 must have
785 a device line of DEVICE=eth1. The setting of the MASTER= line will 785 a device line of DEVICE=eth1. The setting of the MASTER= line will
786 also depend on the final bonding interface name chosen for your bond. 786 also depend on the final bonding interface name chosen for your bond.
787 As with other network devices, these typically start at 0, and go up 787 As with other network devices, these typically start at 0, and go up
788 one for each device, i.e., the first bonding instance is bond0, the 788 one for each device, i.e., the first bonding instance is bond0, the
789 second is bond1, and so on. 789 second is bond1, and so on.
790 790
791 Next, create a bond network script. The file name for this 791 Next, create a bond network script. The file name for this
792 script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is 792 script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is
793 the number of the bond. For bond0 the file is named "ifcfg-bond0", 793 the number of the bond. For bond0 the file is named "ifcfg-bond0",
794 for bond1 it is named "ifcfg-bond1", and so on. Within that file, 794 for bond1 it is named "ifcfg-bond1", and so on. Within that file,
795 place the following text: 795 place the following text:
796 796
797 DEVICE=bond0 797 DEVICE=bond0
798 IPADDR=192.168.1.1 798 IPADDR=192.168.1.1
799 NETMASK=255.255.255.0 799 NETMASK=255.255.255.0
800 NETWORK=192.168.1.0 800 NETWORK=192.168.1.0
801 BROADCAST=192.168.1.255 801 BROADCAST=192.168.1.255
802 ONBOOT=yes 802 ONBOOT=yes
803 BOOTPROTO=none 803 BOOTPROTO=none
804 USERCTL=no 804 USERCTL=no
805 805
806 Be sure to change the networking specific lines (IPADDR, 806 Be sure to change the networking specific lines (IPADDR,
807 NETMASK, NETWORK and BROADCAST) to match your network configuration. 807 NETMASK, NETWORK and BROADCAST) to match your network configuration.
808 808
809 Finally, it is necessary to edit /etc/modules.conf (or 809 Finally, it is necessary to edit /etc/modules.conf (or
810 /etc/modprobe.conf, depending upon your distro) to load the bonding 810 /etc/modprobe.conf, depending upon your distro) to load the bonding
811 module with your desired options when the bond0 interface is brought 811 module with your desired options when the bond0 interface is brought
812 up. The following lines in /etc/modules.conf (or modprobe.conf) will 812 up. The following lines in /etc/modules.conf (or modprobe.conf) will
813 load the bonding module, and select its options: 813 load the bonding module, and select its options:
814 814
815 alias bond0 bonding 815 alias bond0 bonding
816 options bond0 mode=balance-alb miimon=100 816 options bond0 mode=balance-alb miimon=100
817 817
818 Replace the sample parameters with the appropriate set of 818 Replace the sample parameters with the appropriate set of
819 options for your configuration. 819 options for your configuration.
820 820
821 Finally run "/etc/rc.d/init.d/network restart" as root. This 821 Finally run "/etc/rc.d/init.d/network restart" as root. This
822 will restart the networking subsystem and your bond link should be now 822 will restart the networking subsystem and your bond link should be now
823 up and running. 823 up and running.
824 824
825 3.2.1 Using DHCP with Initscripts 825 3.2.1 Using DHCP with Initscripts
826 --------------------------------- 826 ---------------------------------
827 827
828 Recent versions of initscripts (the version supplied with 828 Recent versions of initscripts (the version supplied with
829 Fedora Core 3 and Red Hat Enterprise Linux 4 is reported to work) do 829 Fedora Core 3 and Red Hat Enterprise Linux 4 is reported to work) do
830 have support for assigning IP information to bonding devices via DHCP. 830 have support for assigning IP information to bonding devices via DHCP.
831 831
832 To configure bonding for DHCP, configure it as described 832 To configure bonding for DHCP, configure it as described
833 above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" 833 above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp"
834 and add a line consisting of "TYPE=Bonding". Note that the TYPE value 834 and add a line consisting of "TYPE=Bonding". Note that the TYPE value
835 is case sensitive. 835 is case sensitive.
836 836
837 3.2.2 Configuring Multiple Bonds with Initscripts 837 3.2.2 Configuring Multiple Bonds with Initscripts
838 ------------------------------------------------- 838 -------------------------------------------------
839 839
840 At this writing, the initscripts package does not directly 840 At this writing, the initscripts package does not directly
841 support loading the bonding driver multiple times, so the process for 841 support loading the bonding driver multiple times, so the process for
842 doing so is the same as described in the "Configuring Multiple Bonds 842 doing so is the same as described in the "Configuring Multiple Bonds
843 Manually" section, below. 843 Manually" section, below.
844 844
845 NOTE: It has been observed that some Red Hat supplied kernels 845 NOTE: It has been observed that some Red Hat supplied kernels
846 are apparently unable to rename modules at load time (the "-o bond1" 846 are apparently unable to rename modules at load time (the "-o bond1"
847 part). Attempts to pass that option to modprobe will produce an 847 part). Attempts to pass that option to modprobe will produce an
848 "Operation not permitted" error. This has been reported on some 848 "Operation not permitted" error. This has been reported on some
849 Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels 849 Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels
850 exhibiting this problem, it will be impossible to configure multiple 850 exhibiting this problem, it will be impossible to configure multiple
851 bonds with differing parameters. 851 bonds with differing parameters.
852 852
853 3.3 Configuring Bonding Manually with Ifenslave 853 3.3 Configuring Bonding Manually with Ifenslave
854 ----------------------------------------------- 854 -----------------------------------------------
855 855
856 This section applies to distros whose network initialization 856 This section applies to distros whose network initialization
857 scripts (the sysconfig or initscripts package) do not have specific 857 scripts (the sysconfig or initscripts package) do not have specific
858 knowledge of bonding. One such distro is SuSE Linux Enterprise Server 858 knowledge of bonding. One such distro is SuSE Linux Enterprise Server
859 version 8. 859 version 8.
860 860
861 The general method for these systems is to place the bonding 861 The general method for these systems is to place the bonding
862 module parameters into /etc/modules.conf or /etc/modprobe.conf (as 862 module parameters into /etc/modules.conf or /etc/modprobe.conf (as
863 appropriate for the installed distro), then add modprobe and/or 863 appropriate for the installed distro), then add modprobe and/or
864 ifenslave commands to the system's global init script. The name of 864 ifenslave commands to the system's global init script. The name of
865 the global init script differs; for sysconfig, it is 865 the global init script differs; for sysconfig, it is
866 /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. 866 /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
867 867
868 For example, if you wanted to make a simple bond of two e100 868 For example, if you wanted to make a simple bond of two e100
869 devices (presumed to be eth0 and eth1), and have it persist across 869 devices (presumed to be eth0 and eth1), and have it persist across
870 reboots, edit the appropriate file (/etc/init.d/boot.local or 870 reboots, edit the appropriate file (/etc/init.d/boot.local or
871 /etc/rc.d/rc.local), and add the following: 871 /etc/rc.d/rc.local), and add the following:
872 872
873 modprobe bonding mode=balance-alb miimon=100 873 modprobe bonding mode=balance-alb miimon=100
874 modprobe e100 874 modprobe e100
875 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up 875 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
876 ifenslave bond0 eth0 876 ifenslave bond0 eth0
877 ifenslave bond0 eth1 877 ifenslave bond0 eth1
878 878
879 Replace the example bonding module parameters and bond0 879 Replace the example bonding module parameters and bond0
880 network configuration (IP address, netmask, etc) with the appropriate 880 network configuration (IP address, netmask, etc) with the appropriate
881 values for your configuration. 881 values for your configuration.
882 882
883 Unfortunately, this method will not provide support for the 883 Unfortunately, this method will not provide support for the
884 ifup and ifdown scripts on the bond devices. To reload the bonding 884 ifup and ifdown scripts on the bond devices. To reload the bonding
885 configuration, it is necessary to run the initialization script, e.g., 885 configuration, it is necessary to run the initialization script, e.g.,
886 886
887 # /etc/init.d/boot.local 887 # /etc/init.d/boot.local
888 888
889 or 889 or
890 890
891 # /etc/rc.d/rc.local 891 # /etc/rc.d/rc.local
892 892
893 It may be desirable in such a case to create a separate script 893 It may be desirable in such a case to create a separate script
894 which only initializes the bonding configuration, then call that 894 which only initializes the bonding configuration, then call that
895 separate script from within boot.local. This allows for bonding to be 895 separate script from within boot.local. This allows for bonding to be
896 enabled without re-running the entire global init script. 896 enabled without re-running the entire global init script.
897 897
898 To shut down the bonding devices, it is necessary to first 898 To shut down the bonding devices, it is necessary to first
899 mark the bonding device itself as being down, then remove the 899 mark the bonding device itself as being down, then remove the
900 appropriate device driver modules. For our example above, you can do 900 appropriate device driver modules. For our example above, you can do
901 the following: 901 the following:
902 902
903 # ifconfig bond0 down 903 # ifconfig bond0 down
904 # rmmod bonding 904 # rmmod bonding
905 # rmmod e100 905 # rmmod e100
906 906
907 Again, for convenience, it may be desirable to create a script 907 Again, for convenience, it may be desirable to create a script
908 with these commands. 908 with these commands.
909 909
910 910
911 3.3.1 Configuring Multiple Bonds Manually 911 3.3.1 Configuring Multiple Bonds Manually
912 ----------------------------------------- 912 -----------------------------------------
913 913
914 This section contains information on configuring multiple 914 This section contains information on configuring multiple
915 bonding devices with differing options for those systems whose network 915 bonding devices with differing options for those systems whose network
916 initialization scripts lack support for configuring multiple bonds. 916 initialization scripts lack support for configuring multiple bonds.
917 917
918 If you require multiple bonding devices, but all with the same 918 If you require multiple bonding devices, but all with the same
919 options, you may wish to use the "max_bonds" module parameter, 919 options, you may wish to use the "max_bonds" module parameter,
920 documented above. 920 documented above.
921 921
922 To create multiple bonding devices with differing options, it 922 To create multiple bonding devices with differing options, it
923 is necessary to load the bonding driver multiple times. Note that 923 is necessary to load the bonding driver multiple times. Note that
924 current versions of the sysconfig network initialization scripts 924 current versions of the sysconfig network initialization scripts
925 handle this automatically; if your distro uses these scripts, no 925 handle this automatically; if your distro uses these scripts, no
926 special action is needed. See the section Configuring Bonding 926 special action is needed. See the section Configuring Bonding
927 Devices, above, if you're not sure about your network initialization 927 Devices, above, if you're not sure about your network initialization
928 scripts. 928 scripts.
929 929
930 To load multiple instances of the module, it is necessary to 930 To load multiple instances of the module, it is necessary to
931 specify a different name for each instance (the module loading system 931 specify a different name for each instance (the module loading system
932 requires that every loaded module, even multiple instances of the same 932 requires that every loaded module, even multiple instances of the same
933 module, have a unique name). This is accomplished by supplying 933 module, have a unique name). This is accomplished by supplying
934 multiple sets of bonding options in /etc/modprobe.conf, for example: 934 multiple sets of bonding options in /etc/modprobe.conf, for example:
935 935
936 alias bond0 bonding 936 alias bond0 bonding
937 options bond0 -o bond0 mode=balance-rr miimon=100 937 options bond0 -o bond0 mode=balance-rr miimon=100
938 938
939 alias bond1 bonding 939 alias bond1 bonding
940 options bond1 -o bond1 mode=balance-alb miimon=50 940 options bond1 -o bond1 mode=balance-alb miimon=50
941 941
942 will load the bonding module two times. The first instance is 942 will load the bonding module two times. The first instance is
943 named "bond0" and creates the bond0 device in balance-rr mode with an 943 named "bond0" and creates the bond0 device in balance-rr mode with an
944 miimon of 100. The second instance is named "bond1" and creates the 944 miimon of 100. The second instance is named "bond1" and creates the
945 bond1 device in balance-alb mode with an miimon of 50. 945 bond1 device in balance-alb mode with an miimon of 50.
946 946
947 In some circumstances (typically with older distributions), 947 In some circumstances (typically with older distributions),
948 the above does not work, and the second bonding instance never sees 948 the above does not work, and the second bonding instance never sees
949 its options. In that case, the second options line can be substituted 949 its options. In that case, the second options line can be substituted
950 as follows: 950 as follows:
951 951
952 install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ 952 install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
953 mode=balance-alb miimon=50 953 mode=balance-alb miimon=50
954 954
955 This may be repeated any number of times, specifying a new and 955 This may be repeated any number of times, specifying a new and
956 unique name in place of bond1 for each subsequent instance. 956 unique name in place of bond1 for each subsequent instance.
957 957
958 3.4 Configuring Bonding Manually via Sysfs 958 3.4 Configuring Bonding Manually via Sysfs
959 ------------------------------------------ 959 ------------------------------------------
960 960
961 Starting with version 3.0, Channel Bonding may be configured 961 Starting with version 3.0, Channel Bonding may be configured
962 via the sysfs interface. This interface allows dynamic configuration 962 via the sysfs interface. This interface allows dynamic configuration
963 of all bonds in the system without unloading the module. It also 963 of all bonds in the system without unloading the module. It also
964 allows for adding and removing bonds at runtime. Ifenslave is no 964 allows for adding and removing bonds at runtime. Ifenslave is no
965 longer required, though it is still supported. 965 longer required, though it is still supported.
966 966
967 Use of the sysfs interface allows you to use multiple bonds 967 Use of the sysfs interface allows you to use multiple bonds
968 with different configurations without having to reload the module. 968 with different configurations without having to reload the module.
969 It also allows you to use multiple, differently configured bonds when 969 It also allows you to use multiple, differently configured bonds when
970 bonding is compiled into the kernel. 970 bonding is compiled into the kernel.
971 971
972 You must have the sysfs filesystem mounted to configure 972 You must have the sysfs filesystem mounted to configure
973 bonding this way. The examples in this document assume that you 973 bonding this way. The examples in this document assume that you
974 are using the standard mount point for sysfs, e.g. /sys. If your 974 are using the standard mount point for sysfs, e.g. /sys. If your
975 sysfs filesystem is mounted elsewhere, you will need to adjust the 975 sysfs filesystem is mounted elsewhere, you will need to adjust the
976 example paths accordingly. 976 example paths accordingly.
977 977
978 Creating and Destroying Bonds 978 Creating and Destroying Bonds
979 ----------------------------- 979 -----------------------------
980 To add a new bond foo: 980 To add a new bond foo:
981 # echo +foo > /sys/class/net/bonding_masters 981 # echo +foo > /sys/class/net/bonding_masters
982 982
983 To remove an existing bond bar: 983 To remove an existing bond bar:
984 # echo -bar > /sys/class/net/bonding_masters 984 # echo -bar > /sys/class/net/bonding_masters
985 985
986 To show all existing bonds: 986 To show all existing bonds:
987 # cat /sys/class/net/bonding_masters 987 # cat /sys/class/net/bonding_masters
988 988
989 NOTE: due to 4K size limitation of sysfs files, this list may be 989 NOTE: due to 4K size limitation of sysfs files, this list may be
990 truncated if you have more than a few hundred bonds. This is unlikely 990 truncated if you have more than a few hundred bonds. This is unlikely
991 to occur under normal operating conditions. 991 to occur under normal operating conditions.
992 992
993 Adding and Removing Slaves 993 Adding and Removing Slaves
994 -------------------------- 994 --------------------------
995 Interfaces may be enslaved to a bond using the file 995 Interfaces may be enslaved to a bond using the file
996 /sys/class/net/<bond>/bonding/slaves. The semantics for this file 996 /sys/class/net/<bond>/bonding/slaves. The semantics for this file
997 are the same as for the bonding_masters file. 997 are the same as for the bonding_masters file.
998 998
999 To enslave interface eth0 to bond bond0: 999 To enslave interface eth0 to bond bond0:
1000 # ifconfig bond0 up 1000 # ifconfig bond0 up
1001 # echo +eth0 > /sys/class/net/bond0/bonding/slaves 1001 # echo +eth0 > /sys/class/net/bond0/bonding/slaves
1002 1002
1003 To free slave eth0 from bond bond0: 1003 To free slave eth0 from bond bond0:
1004 # echo -eth0 > /sys/class/net/bond0/bonding/slaves 1004 # echo -eth0 > /sys/class/net/bond0/bonding/slaves
1005 1005
1006 NOTE: The bond must be up before slaves can be added. All 1006 NOTE: The bond must be up before slaves can be added. All
1007 slaves are freed when the interface is brought down. 1007 slaves are freed when the interface is brought down.
1008 1008
1009 When an interface is enslaved to a bond, symlinks between the 1009 When an interface is enslaved to a bond, symlinks between the
1010 two are created in the sysfs filesystem. In this case, you would get 1010 two are created in the sysfs filesystem. In this case, you would get
1011 /sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and 1011 /sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and
1012 /sys/class/net/eth0/master pointing to /sys/class/net/bond0. 1012 /sys/class/net/eth0/master pointing to /sys/class/net/bond0.
1013 1013
1014 This means that you can tell quickly whether or not an 1014 This means that you can tell quickly whether or not an
1015 interface is enslaved by looking for the master symlink. Thus: 1015 interface is enslaved by looking for the master symlink. Thus:
1016 # echo -eth0 > /sys/class/net/eth0/master/bonding/slaves 1016 # echo -eth0 > /sys/class/net/eth0/master/bonding/slaves
1017 will free eth0 from whatever bond it is enslaved to, regardless of 1017 will free eth0 from whatever bond it is enslaved to, regardless of
1018 the name of the bond interface. 1018 the name of the bond interface.
1019 1019
1020 Changing a Bond's Configuration 1020 Changing a Bond's Configuration
1021 ------------------------------- 1021 -------------------------------
1022 Each bond may be configured individually by manipulating the 1022 Each bond may be configured individually by manipulating the
1023 files located in /sys/class/net/<bond name>/bonding 1023 files located in /sys/class/net/<bond name>/bonding
1024 1024
1025 The names of these files correspond directly with the command- 1025 The names of these files correspond directly with the command-
1026 line parameters described elsewhere in in this file, and, with the 1026 line parameters described elsewhere in this file, and, with the
1027 exception of arp_ip_target, they accept the same values. To see the 1027 exception of arp_ip_target, they accept the same values. To see the
1028 current setting, simply cat the appropriate file. 1028 current setting, simply cat the appropriate file.
1029 1029
1030 A few examples will be given here; for specific usage 1030 A few examples will be given here; for specific usage
1031 guidelines for each parameter, see the appropriate section in this 1031 guidelines for each parameter, see the appropriate section in this
1032 document. 1032 document.
1033 1033
1034 To configure bond0 for balance-alb mode: 1034 To configure bond0 for balance-alb mode:
1035 # ifconfig bond0 down 1035 # ifconfig bond0 down
1036 # echo 6 > /sys/class/net/bond0/bonding/mode 1036 # echo 6 > /sys/class/net/bond0/bonding/mode
1037 - or - 1037 - or -
1038 # echo balance-alb > /sys/class/net/bond0/bonding/mode 1038 # echo balance-alb > /sys/class/net/bond0/bonding/mode
1039 NOTE: The bond interface must be down before the mode can be 1039 NOTE: The bond interface must be down before the mode can be
1040 changed. 1040 changed.
1041 1041
1042 To enable MII monitoring on bond0 with a 1 second interval: 1042 To enable MII monitoring on bond0 with a 1 second interval:
1043 # echo 1000 > /sys/class/net/bond0/bonding/miimon 1043 # echo 1000 > /sys/class/net/bond0/bonding/miimon
1044 NOTE: If ARP monitoring is enabled, it will disabled when MII 1044 NOTE: If ARP monitoring is enabled, it will disabled when MII
1045 monitoring is enabled, and vice-versa. 1045 monitoring is enabled, and vice-versa.
1046 1046
1047 To add ARP targets: 1047 To add ARP targets:
1048 # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target 1048 # echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
1049 # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target 1049 # echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
1050 NOTE: up to 10 target addresses may be specified. 1050 NOTE: up to 10 target addresses may be specified.
1051 1051
1052 To remove an ARP target: 1052 To remove an ARP target:
1053 # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target 1053 # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
1054 1054
1055 Example Configuration 1055 Example Configuration
1056 --------------------- 1056 ---------------------
1057 We begin with the same example that is shown in section 3.3, 1057 We begin with the same example that is shown in section 3.3,
1058 executed with sysfs, and without using ifenslave. 1058 executed with sysfs, and without using ifenslave.
1059 1059
1060 To make a simple bond of two e100 devices (presumed to be eth0 1060 To make a simple bond of two e100 devices (presumed to be eth0
1061 and eth1), and have it persist across reboots, edit the appropriate 1061 and eth1), and have it persist across reboots, edit the appropriate
1062 file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the 1062 file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the
1063 following: 1063 following:
1064 1064
1065 modprobe bonding 1065 modprobe bonding
1066 modprobe e100 1066 modprobe e100
1067 echo balance-alb > /sys/class/net/bond0/bonding/mode 1067 echo balance-alb > /sys/class/net/bond0/bonding/mode
1068 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up 1068 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
1069 echo 100 > /sys/class/net/bond0/bonding/miimon 1069 echo 100 > /sys/class/net/bond0/bonding/miimon
1070 echo +eth0 > /sys/class/net/bond0/bonding/slaves 1070 echo +eth0 > /sys/class/net/bond0/bonding/slaves
1071 echo +eth1 > /sys/class/net/bond0/bonding/slaves 1071 echo +eth1 > /sys/class/net/bond0/bonding/slaves
1072 1072
1073 To add a second bond, with two e1000 interfaces in 1073 To add a second bond, with two e1000 interfaces in
1074 active-backup mode, using ARP monitoring, add the following lines to 1074 active-backup mode, using ARP monitoring, add the following lines to
1075 your init script: 1075 your init script:
1076 1076
1077 modprobe e1000 1077 modprobe e1000
1078 echo +bond1 > /sys/class/net/bonding_masters 1078 echo +bond1 > /sys/class/net/bonding_masters
1079 echo active-backup > /sys/class/net/bond1/bonding/mode 1079 echo active-backup > /sys/class/net/bond1/bonding/mode
1080 ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up 1080 ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up
1081 echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target 1081 echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target
1082 echo 2000 > /sys/class/net/bond1/bonding/arp_interval 1082 echo 2000 > /sys/class/net/bond1/bonding/arp_interval
1083 echo +eth2 > /sys/class/net/bond1/bonding/slaves 1083 echo +eth2 > /sys/class/net/bond1/bonding/slaves
1084 echo +eth3 > /sys/class/net/bond1/bonding/slaves 1084 echo +eth3 > /sys/class/net/bond1/bonding/slaves
1085 1085
1086 1086
1087 4. Querying Bonding Configuration 1087 4. Querying Bonding Configuration
1088 ================================= 1088 =================================
1089 1089
1090 4.1 Bonding Configuration 1090 4.1 Bonding Configuration
1091 ------------------------- 1091 -------------------------
1092 1092
1093 Each bonding device has a read-only file residing in the 1093 Each bonding device has a read-only file residing in the
1094 /proc/net/bonding directory. The file contents include information 1094 /proc/net/bonding directory. The file contents include information
1095 about the bonding configuration, options and state of each slave. 1095 about the bonding configuration, options and state of each slave.
1096 1096
1097 For example, the contents of /proc/net/bonding/bond0 after the 1097 For example, the contents of /proc/net/bonding/bond0 after the
1098 driver is loaded with parameters of mode=0 and miimon=1000 is 1098 driver is loaded with parameters of mode=0 and miimon=1000 is
1099 generally as follows: 1099 generally as follows:
1100 1100
1101 Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004) 1101 Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)
1102 Bonding Mode: load balancing (round-robin) 1102 Bonding Mode: load balancing (round-robin)
1103 Currently Active Slave: eth0 1103 Currently Active Slave: eth0
1104 MII Status: up 1104 MII Status: up
1105 MII Polling Interval (ms): 1000 1105 MII Polling Interval (ms): 1000
1106 Up Delay (ms): 0 1106 Up Delay (ms): 0
1107 Down Delay (ms): 0 1107 Down Delay (ms): 0
1108 1108
1109 Slave Interface: eth1 1109 Slave Interface: eth1
1110 MII Status: up 1110 MII Status: up
1111 Link Failure Count: 1 1111 Link Failure Count: 1
1112 1112
1113 Slave Interface: eth0 1113 Slave Interface: eth0
1114 MII Status: up 1114 MII Status: up
1115 Link Failure Count: 1 1115 Link Failure Count: 1
1116 1116
1117 The precise format and contents will change depending upon the 1117 The precise format and contents will change depending upon the
1118 bonding configuration, state, and version of the bonding driver. 1118 bonding configuration, state, and version of the bonding driver.
1119 1119
1120 4.2 Network configuration 1120 4.2 Network configuration
1121 ------------------------- 1121 -------------------------
1122 1122
1123 The network configuration can be inspected using the ifconfig 1123 The network configuration can be inspected using the ifconfig
1124 command. Bonding devices will have the MASTER flag set; Bonding slave 1124 command. Bonding devices will have the MASTER flag set; Bonding slave
1125 devices will have the SLAVE flag set. The ifconfig output does not 1125 devices will have the SLAVE flag set. The ifconfig output does not
1126 contain information on which slaves are associated with which masters. 1126 contain information on which slaves are associated with which masters.
1127 1127
1128 In the example below, the bond0 interface is the master 1128 In the example below, the bond0 interface is the master
1129 (MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of 1129 (MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of
1130 bond0 have the same MAC address (HWaddr) as bond0 for all modes except 1130 bond0 have the same MAC address (HWaddr) as bond0 for all modes except
1131 TLB and ALB that require a unique MAC address for each slave. 1131 TLB and ALB that require a unique MAC address for each slave.
1132 1132
1133 # /sbin/ifconfig 1133 # /sbin/ifconfig
1134 bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 1134 bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
1135 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 1135 inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
1136 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 1136 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
1137 RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0 1137 RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
1138 TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0 1138 TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
1139 collisions:0 txqueuelen:0 1139 collisions:0 txqueuelen:0
1140 1140
1141 eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 1141 eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
1142 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 1142 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
1143 RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 1143 RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
1144 TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 1144 TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
1145 collisions:0 txqueuelen:100 1145 collisions:0 txqueuelen:100
1146 Interrupt:10 Base address:0x1080 1146 Interrupt:10 Base address:0x1080
1147 1147
1148 eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 1148 eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
1149 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 1149 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
1150 RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 1150 RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
1151 TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 1151 TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
1152 collisions:0 txqueuelen:100 1152 collisions:0 txqueuelen:100
1153 Interrupt:9 Base address:0x1400 1153 Interrupt:9 Base address:0x1400
1154 1154
1155 5. Switch Configuration 1155 5. Switch Configuration
1156 ======================= 1156 =======================
1157 1157
1158 For this section, "switch" refers to whatever system the 1158 For this section, "switch" refers to whatever system the
1159 bonded devices are directly connected to (i.e., where the other end of 1159 bonded devices are directly connected to (i.e., where the other end of
1160 the cable plugs into). This may be an actual dedicated switch device, 1160 the cable plugs into). This may be an actual dedicated switch device,
1161 or it may be another regular system (e.g., another computer running 1161 or it may be another regular system (e.g., another computer running
1162 Linux), 1162 Linux),
1163 1163
1164 The active-backup, balance-tlb and balance-alb modes do not 1164 The active-backup, balance-tlb and balance-alb modes do not
1165 require any specific configuration of the switch. 1165 require any specific configuration of the switch.
1166 1166
1167 The 802.3ad mode requires that the switch have the appropriate 1167 The 802.3ad mode requires that the switch have the appropriate
1168 ports configured as an 802.3ad aggregation. The precise method used 1168 ports configured as an 802.3ad aggregation. The precise method used
1169 to configure this varies from switch to switch, but, for example, a 1169 to configure this varies from switch to switch, but, for example, a
1170 Cisco 3550 series switch requires that the appropriate ports first be 1170 Cisco 3550 series switch requires that the appropriate ports first be
1171 grouped together in a single etherchannel instance, then that 1171 grouped together in a single etherchannel instance, then that
1172 etherchannel is set to mode "lacp" to enable 802.3ad (instead of 1172 etherchannel is set to mode "lacp" to enable 802.3ad (instead of
1173 standard EtherChannel). 1173 standard EtherChannel).
1174 1174
1175 The balance-rr, balance-xor and broadcast modes generally 1175 The balance-rr, balance-xor and broadcast modes generally
1176 require that the switch have the appropriate ports grouped together. 1176 require that the switch have the appropriate ports grouped together.
1177 The nomenclature for such a group differs between switches, it may be 1177 The nomenclature for such a group differs between switches, it may be
1178 called an "etherchannel" (as in the Cisco example, above), a "trunk 1178 called an "etherchannel" (as in the Cisco example, above), a "trunk
1179 group" or some other similar variation. For these modes, each switch 1179 group" or some other similar variation. For these modes, each switch
1180 will also have its own configuration options for the switch's transmit 1180 will also have its own configuration options for the switch's transmit
1181 policy to the bond. Typical choices include XOR of either the MAC or 1181 policy to the bond. Typical choices include XOR of either the MAC or
1182 IP addresses. The transmit policy of the two peers does not need to 1182 IP addresses. The transmit policy of the two peers does not need to
1183 match. For these three modes, the bonding mode really selects a 1183 match. For these three modes, the bonding mode really selects a
1184 transmit policy for an EtherChannel group; all three will interoperate 1184 transmit policy for an EtherChannel group; all three will interoperate
1185 with another EtherChannel group. 1185 with another EtherChannel group.
1186 1186
1187 1187
1188 6. 802.1q VLAN Support 1188 6. 802.1q VLAN Support
1189 ====================== 1189 ======================
1190 1190
1191 It is possible to configure VLAN devices over a bond interface 1191 It is possible to configure VLAN devices over a bond interface
1192 using the 8021q driver. However, only packets coming from the 8021q 1192 using the 8021q driver. However, only packets coming from the 8021q
1193 driver and passing through bonding will be tagged by default. Self 1193 driver and passing through bonding will be tagged by default. Self
1194 generated packets, for example, bonding's learning packets or ARP 1194 generated packets, for example, bonding's learning packets or ARP
1195 packets generated by either ALB mode or the ARP monitor mechanism, are 1195 packets generated by either ALB mode or the ARP monitor mechanism, are
1196 tagged internally by bonding itself. As a result, bonding must 1196 tagged internally by bonding itself. As a result, bonding must
1197 "learn" the VLAN IDs configured above it, and use those IDs to tag 1197 "learn" the VLAN IDs configured above it, and use those IDs to tag
1198 self generated packets. 1198 self generated packets.
1199 1199
1200 For reasons of simplicity, and to support the use of adapters 1200 For reasons of simplicity, and to support the use of adapters
1201 that can do VLAN hardware acceleration offloading, the bonding 1201 that can do VLAN hardware acceleration offloading, the bonding
1202 interface declares itself as fully hardware offloading capable, it gets 1202 interface declares itself as fully hardware offloading capable, it gets
1203 the add_vid/kill_vid notifications to gather the necessary 1203 the add_vid/kill_vid notifications to gather the necessary
1204 information, and it propagates those actions to the slaves. In case 1204 information, and it propagates those actions to the slaves. In case
1205 of mixed adapter types, hardware accelerated tagged packets that 1205 of mixed adapter types, hardware accelerated tagged packets that
1206 should go through an adapter that is not offloading capable are 1206 should go through an adapter that is not offloading capable are
1207 "un-accelerated" by the bonding driver so the VLAN tag sits in the 1207 "un-accelerated" by the bonding driver so the VLAN tag sits in the
1208 regular location. 1208 regular location.
1209 1209
1210 VLAN interfaces *must* be added on top of a bonding interface 1210 VLAN interfaces *must* be added on top of a bonding interface
1211 only after enslaving at least one slave. The bonding interface has a 1211 only after enslaving at least one slave. The bonding interface has a
1212 hardware address of 00:00:00:00:00:00 until the first slave is added. 1212 hardware address of 00:00:00:00:00:00 until the first slave is added.
1213 If the VLAN interface is created prior to the first enslavement, it 1213 If the VLAN interface is created prior to the first enslavement, it
1214 would pick up the all-zeroes hardware address. Once the first slave 1214 would pick up the all-zeroes hardware address. Once the first slave
1215 is attached to the bond, the bond device itself will pick up the 1215 is attached to the bond, the bond device itself will pick up the
1216 slave's hardware address, which is then available for the VLAN device. 1216 slave's hardware address, which is then available for the VLAN device.
1217 1217
1218 Also, be aware that a similar problem can occur if all slaves 1218 Also, be aware that a similar problem can occur if all slaves
1219 are released from a bond that still has one or more VLAN interfaces on 1219 are released from a bond that still has one or more VLAN interfaces on
1220 top of it. When a new slave is added, the bonding interface will 1220 top of it. When a new slave is added, the bonding interface will
1221 obtain its hardware address from the first slave, which might not 1221 obtain its hardware address from the first slave, which might not
1222 match the hardware address of the VLAN interfaces (which was 1222 match the hardware address of the VLAN interfaces (which was
1223 ultimately copied from an earlier slave). 1223 ultimately copied from an earlier slave).
1224 1224
1225 There are two methods to insure that the VLAN device operates 1225 There are two methods to insure that the VLAN device operates
1226 with the correct hardware address if all slaves are removed from a 1226 with the correct hardware address if all slaves are removed from a
1227 bond interface: 1227 bond interface:
1228 1228
1229 1. Remove all VLAN interfaces then recreate them 1229 1. Remove all VLAN interfaces then recreate them
1230 1230
1231 2. Set the bonding interface's hardware address so that it 1231 2. Set the bonding interface's hardware address so that it
1232 matches the hardware address of the VLAN interfaces. 1232 matches the hardware address of the VLAN interfaces.
1233 1233
1234 Note that changing a VLAN interface's HW address would set the 1234 Note that changing a VLAN interface's HW address would set the
1235 underlying device -- i.e. the bonding interface -- to promiscuous 1235 underlying device -- i.e. the bonding interface -- to promiscuous
1236 mode, which might not be what you want. 1236 mode, which might not be what you want.
1237 1237
1238 1238
1239 7. Link Monitoring 1239 7. Link Monitoring
1240 ================== 1240 ==================
1241 1241
1242 The bonding driver at present supports two schemes for 1242 The bonding driver at present supports two schemes for
1243 monitoring a slave device's link state: the ARP monitor and the MII 1243 monitoring a slave device's link state: the ARP monitor and the MII
1244 monitor. 1244 monitor.
1245 1245
1246 At the present time, due to implementation restrictions in the 1246 At the present time, due to implementation restrictions in the
1247 bonding driver itself, it is not possible to enable both ARP and MII 1247 bonding driver itself, it is not possible to enable both ARP and MII
1248 monitoring simultaneously. 1248 monitoring simultaneously.
1249 1249
1250 7.1 ARP Monitor Operation 1250 7.1 ARP Monitor Operation
1251 ------------------------- 1251 -------------------------
1252 1252
1253 The ARP monitor operates as its name suggests: it sends ARP 1253 The ARP monitor operates as its name suggests: it sends ARP
1254 queries to one or more designated peer systems on the network, and 1254 queries to one or more designated peer systems on the network, and
1255 uses the response as an indication that the link is operating. This 1255 uses the response as an indication that the link is operating. This
1256 gives some assurance that traffic is actually flowing to and from one 1256 gives some assurance that traffic is actually flowing to and from one
1257 or more peers on the local network. 1257 or more peers on the local network.
1258 1258
1259 The ARP monitor relies on the device driver itself to verify 1259 The ARP monitor relies on the device driver itself to verify
1260 that traffic is flowing. In particular, the driver must keep up to 1260 that traffic is flowing. In particular, the driver must keep up to
1261 date the last receive time, dev->last_rx, and transmit start time, 1261 date the last receive time, dev->last_rx, and transmit start time,
1262 dev->trans_start. If these are not updated by the driver, then the 1262 dev->trans_start. If these are not updated by the driver, then the
1263 ARP monitor will immediately fail any slaves using that driver, and 1263 ARP monitor will immediately fail any slaves using that driver, and
1264 those slaves will stay down. If networking monitoring (tcpdump, etc) 1264 those slaves will stay down. If networking monitoring (tcpdump, etc)
1265 shows the ARP requests and replies on the network, then it may be that 1265 shows the ARP requests and replies on the network, then it may be that
1266 your device driver is not updating last_rx and trans_start. 1266 your device driver is not updating last_rx and trans_start.
1267 1267
1268 7.2 Configuring Multiple ARP Targets 1268 7.2 Configuring Multiple ARP Targets
1269 ------------------------------------ 1269 ------------------------------------
1270 1270
1271 While ARP monitoring can be done with just one target, it can 1271 While ARP monitoring can be done with just one target, it can
1272 be useful in a High Availability setup to have several targets to 1272 be useful in a High Availability setup to have several targets to
1273 monitor. In the case of just one target, the target itself may go 1273 monitor. In the case of just one target, the target itself may go
1274 down or have a problem making it unresponsive to ARP requests. Having 1274 down or have a problem making it unresponsive to ARP requests. Having
1275 an additional target (or several) increases the reliability of the ARP 1275 an additional target (or several) increases the reliability of the ARP
1276 monitoring. 1276 monitoring.
1277 1277
1278 Multiple ARP targets must be separated by commas as follows: 1278 Multiple ARP targets must be separated by commas as follows:
1279 1279
1280 # example options for ARP monitoring with three targets 1280 # example options for ARP monitoring with three targets
1281 alias bond0 bonding 1281 alias bond0 bonding
1282 options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9 1282 options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
1283 1283
1284 For just a single target the options would resemble: 1284 For just a single target the options would resemble:
1285 1285
1286 # example options for ARP monitoring with one target 1286 # example options for ARP monitoring with one target
1287 alias bond0 bonding 1287 alias bond0 bonding
1288 options bond0 arp_interval=60 arp_ip_target=192.168.0.100 1288 options bond0 arp_interval=60 arp_ip_target=192.168.0.100
1289 1289
1290 1290
1291 7.3 MII Monitor Operation 1291 7.3 MII Monitor Operation
1292 ------------------------- 1292 -------------------------
1293 1293
1294 The MII monitor monitors only the carrier state of the local 1294 The MII monitor monitors only the carrier state of the local
1295 network interface. It accomplishes this in one of three ways: by 1295 network interface. It accomplishes this in one of three ways: by
1296 depending upon the device driver to maintain its carrier state, by 1296 depending upon the device driver to maintain its carrier state, by
1297 querying the device's MII registers, or by making an ethtool query to 1297 querying the device's MII registers, or by making an ethtool query to
1298 the device. 1298 the device.
1299 1299
1300 If the use_carrier module parameter is 1 (the default value), 1300 If the use_carrier module parameter is 1 (the default value),
1301 then the MII monitor will rely on the driver for carrier state 1301 then the MII monitor will rely on the driver for carrier state
1302 information (via the netif_carrier subsystem). As explained in the 1302 information (via the netif_carrier subsystem). As explained in the
1303 use_carrier parameter information, above, if the MII monitor fails to 1303 use_carrier parameter information, above, if the MII monitor fails to
1304 detect carrier loss on the device (e.g., when the cable is physically 1304 detect carrier loss on the device (e.g., when the cable is physically
1305 disconnected), it may be that the driver does not support 1305 disconnected), it may be that the driver does not support
1306 netif_carrier. 1306 netif_carrier.
1307 1307
1308 If use_carrier is 0, then the MII monitor will first query the 1308 If use_carrier is 0, then the MII monitor will first query the
1309 device's (via ioctl) MII registers and check the link state. If that 1309 device's (via ioctl) MII registers and check the link state. If that
1310 request fails (not just that it returns carrier down), then the MII 1310 request fails (not just that it returns carrier down), then the MII
1311 monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain 1311 monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain
1312 the same information. If both methods fail (i.e., the driver either 1312 the same information. If both methods fail (i.e., the driver either
1313 does not support or had some error in processing both the MII register 1313 does not support or had some error in processing both the MII register
1314 and ethtool requests), then the MII monitor will assume the link is 1314 and ethtool requests), then the MII monitor will assume the link is
1315 up. 1315 up.
1316 1316
1317 8. Potential Sources of Trouble 1317 8. Potential Sources of Trouble
1318 =============================== 1318 ===============================
1319 1319
1320 8.1 Adventures in Routing 1320 8.1 Adventures in Routing
1321 ------------------------- 1321 -------------------------
1322 1322
1323 When bonding is configured, it is important that the slave 1323 When bonding is configured, it is important that the slave
1324 devices not have routes that supersede routes of the master (or, 1324 devices not have routes that supersede routes of the master (or,
1325 generally, not have routes at all). For example, suppose the bonding 1325 generally, not have routes at all). For example, suppose the bonding
1326 device bond0 has two slaves, eth0 and eth1, and the routing table is 1326 device bond0 has two slaves, eth0 and eth1, and the routing table is
1327 as follows: 1327 as follows:
1328 1328
1329 Kernel IP routing table 1329 Kernel IP routing table
1330 Destination Gateway Genmask Flags MSS Window irtt Iface 1330 Destination Gateway Genmask Flags MSS Window irtt Iface
1331 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0 1331 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
1332 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1 1332 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
1333 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0 1333 10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
1334 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo 1334 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
1335 1335
1336 This routing configuration will likely still update the 1336 This routing configuration will likely still update the
1337 receive/transmit times in the driver (needed by the ARP monitor), but 1337 receive/transmit times in the driver (needed by the ARP monitor), but
1338 may bypass the bonding driver (because outgoing traffic to, in this 1338 may bypass the bonding driver (because outgoing traffic to, in this
1339 case, another host on network 10 would use eth0 or eth1 before bond0). 1339 case, another host on network 10 would use eth0 or eth1 before bond0).
1340 1340
1341 The ARP monitor (and ARP itself) may become confused by this 1341 The ARP monitor (and ARP itself) may become confused by this
1342 configuration, because ARP requests (generated by the ARP monitor) 1342 configuration, because ARP requests (generated by the ARP monitor)
1343 will be sent on one interface (bond0), but the corresponding reply 1343 will be sent on one interface (bond0), but the corresponding reply
1344 will arrive on a different interface (eth0). This reply looks to ARP 1344 will arrive on a different interface (eth0). This reply looks to ARP
1345 as an unsolicited ARP reply (because ARP matches replies on an 1345 as an unsolicited ARP reply (because ARP matches replies on an
1346 interface basis), and is discarded. The MII monitor is not affected 1346 interface basis), and is discarded. The MII monitor is not affected
1347 by the state of the routing table. 1347 by the state of the routing table.
1348 1348
1349 The solution here is simply to insure that slaves do not have 1349 The solution here is simply to insure that slaves do not have
1350 routes of their own, and if for some reason they must, those routes do 1350 routes of their own, and if for some reason they must, those routes do
1351 not supersede routes of their master. This should generally be the 1351 not supersede routes of their master. This should generally be the
1352 case, but unusual configurations or errant manual or automatic static 1352 case, but unusual configurations or errant manual or automatic static
1353 route additions may cause trouble. 1353 route additions may cause trouble.
1354 1354
1355 8.2 Ethernet Device Renaming 1355 8.2 Ethernet Device Renaming
1356 ---------------------------- 1356 ----------------------------
1357 1357
1358 On systems with network configuration scripts that do not 1358 On systems with network configuration scripts that do not
1359 associate physical devices directly with network interface names (so 1359 associate physical devices directly with network interface names (so
1360 that the same physical device always has the same "ethX" name), it may 1360 that the same physical device always has the same "ethX" name), it may
1361 be necessary to add some special logic to either /etc/modules.conf or 1361 be necessary to add some special logic to either /etc/modules.conf or
1362 /etc/modprobe.conf (depending upon which is installed on the system). 1362 /etc/modprobe.conf (depending upon which is installed on the system).
1363 1363
1364 For example, given a modules.conf containing the following: 1364 For example, given a modules.conf containing the following:
1365 1365
1366 alias bond0 bonding 1366 alias bond0 bonding
1367 options bond0 mode=some-mode miimon=50 1367 options bond0 mode=some-mode miimon=50
1368 alias eth0 tg3 1368 alias eth0 tg3
1369 alias eth1 tg3 1369 alias eth1 tg3
1370 alias eth2 e1000 1370 alias eth2 e1000
1371 alias eth3 e1000 1371 alias eth3 e1000
1372 1372
1373 If neither eth0 and eth1 are slaves to bond0, then when the 1373 If neither eth0 and eth1 are slaves to bond0, then when the
1374 bond0 interface comes up, the devices may end up reordered. This 1374 bond0 interface comes up, the devices may end up reordered. This
1375 happens because bonding is loaded first, then its slave device's 1375 happens because bonding is loaded first, then its slave device's
1376 drivers are loaded next. Since no other drivers have been loaded, 1376 drivers are loaded next. Since no other drivers have been loaded,
1377 when the e1000 driver loads, it will receive eth0 and eth1 for its 1377 when the e1000 driver loads, it will receive eth0 and eth1 for its
1378 devices, but the bonding configuration tries to enslave eth2 and eth3 1378 devices, but the bonding configuration tries to enslave eth2 and eth3
1379 (which may later be assigned to the tg3 devices). 1379 (which may later be assigned to the tg3 devices).
1380 1380
1381 Adding the following: 1381 Adding the following:
1382 1382
1383 add above bonding e1000 tg3 1383 add above bonding e1000 tg3
1384 1384
1385 causes modprobe to load e1000 then tg3, in that order, when 1385 causes modprobe to load e1000 then tg3, in that order, when
1386 bonding is loaded. This command is fully documented in the 1386 bonding is loaded. This command is fully documented in the
1387 modules.conf manual page. 1387 modules.conf manual page.
1388 1388
1389 On systems utilizing modprobe.conf (or modprobe.conf.local), 1389 On systems utilizing modprobe.conf (or modprobe.conf.local),
1390 an equivalent problem can occur. In this case, the following can be 1390 an equivalent problem can occur. In this case, the following can be
1391 added to modprobe.conf (or modprobe.conf.local, as appropriate), as 1391 added to modprobe.conf (or modprobe.conf.local, as appropriate), as
1392 follows (all on one line; it has been split here for clarity): 1392 follows (all on one line; it has been split here for clarity):
1393 1393
1394 install bonding /sbin/modprobe tg3; /sbin/modprobe e1000; 1394 install bonding /sbin/modprobe tg3; /sbin/modprobe e1000;
1395 /sbin/modprobe --ignore-install bonding 1395 /sbin/modprobe --ignore-install bonding
1396 1396
1397 This will, when loading the bonding module, rather than 1397 This will, when loading the bonding module, rather than
1398 performing the normal action, instead execute the provided command. 1398 performing the normal action, instead execute the provided command.
1399 This command loads the device drivers in the order needed, then calls 1399 This command loads the device drivers in the order needed, then calls
1400 modprobe with --ignore-install to cause the normal action to then take 1400 modprobe with --ignore-install to cause the normal action to then take
1401 place. Full documentation on this can be found in the modprobe.conf 1401 place. Full documentation on this can be found in the modprobe.conf
1402 and modprobe manual pages. 1402 and modprobe manual pages.
1403 1403
1404 8.3. Painfully Slow Or No Failed Link Detection By Miimon 1404 8.3. Painfully Slow Or No Failed Link Detection By Miimon
1405 --------------------------------------------------------- 1405 ---------------------------------------------------------
1406 1406
1407 By default, bonding enables the use_carrier option, which 1407 By default, bonding enables the use_carrier option, which
1408 instructs bonding to trust the driver to maintain carrier state. 1408 instructs bonding to trust the driver to maintain carrier state.
1409 1409
1410 As discussed in the options section, above, some drivers do 1410 As discussed in the options section, above, some drivers do
1411 not support the netif_carrier_on/_off link state tracking system. 1411 not support the netif_carrier_on/_off link state tracking system.
1412 With use_carrier enabled, bonding will always see these links as up, 1412 With use_carrier enabled, bonding will always see these links as up,
1413 regardless of their actual state. 1413 regardless of their actual state.
1414 1414
1415 Additionally, other drivers do support netif_carrier, but do 1415 Additionally, other drivers do support netif_carrier, but do
1416 not maintain it in real time, e.g., only polling the link state at 1416 not maintain it in real time, e.g., only polling the link state at
1417 some fixed interval. In this case, miimon will detect failures, but 1417 some fixed interval. In this case, miimon will detect failures, but
1418 only after some long period of time has expired. If it appears that 1418 only after some long period of time has expired. If it appears that
1419 miimon is very slow in detecting link failures, try specifying 1419 miimon is very slow in detecting link failures, try specifying
1420 use_carrier=0 to see if that improves the failure detection time. If 1420 use_carrier=0 to see if that improves the failure detection time. If
1421 it does, then it may be that the driver checks the carrier state at a 1421 it does, then it may be that the driver checks the carrier state at a
1422 fixed interval, but does not cache the MII register values (so the 1422 fixed interval, but does not cache the MII register values (so the
1423 use_carrier=0 method of querying the registers directly works). If 1423 use_carrier=0 method of querying the registers directly works). If
1424 use_carrier=0 does not improve the failover, then the driver may cache 1424 use_carrier=0 does not improve the failover, then the driver may cache
1425 the registers, or the problem may be elsewhere. 1425 the registers, or the problem may be elsewhere.
1426 1426
1427 Also, remember that miimon only checks for the device's 1427 Also, remember that miimon only checks for the device's
1428 carrier state. It has no way to determine the state of devices on or 1428 carrier state. It has no way to determine the state of devices on or
1429 beyond other ports of a switch, or if a switch is refusing to pass 1429 beyond other ports of a switch, or if a switch is refusing to pass
1430 traffic while still maintaining carrier on. 1430 traffic while still maintaining carrier on.
1431 1431
1432 9. SNMP agents 1432 9. SNMP agents
1433 =============== 1433 ===============
1434 1434
1435 If running SNMP agents, the bonding driver should be loaded 1435 If running SNMP agents, the bonding driver should be loaded
1436 before any network drivers participating in a bond. This requirement 1436 before any network drivers participating in a bond. This requirement
1437 is due to the interface index (ipAdEntIfIndex) being associated to 1437 is due to the interface index (ipAdEntIfIndex) being associated to
1438 the first interface found with a given IP address. That is, there is 1438 the first interface found with a given IP address. That is, there is
1439 only one ipAdEntIfIndex for each IP address. For example, if eth0 and 1439 only one ipAdEntIfIndex for each IP address. For example, if eth0 and
1440 eth1 are slaves of bond0 and the driver for eth0 is loaded before the 1440 eth1 are slaves of bond0 and the driver for eth0 is loaded before the
1441 bonding driver, the interface for the IP address will be associated 1441 bonding driver, the interface for the IP address will be associated
1442 with the eth0 interface. This configuration is shown below, the IP 1442 with the eth0 interface. This configuration is shown below, the IP
1443 address 192.168.1.1 has an interface index of 2 which indexes to eth0 1443 address 192.168.1.1 has an interface index of 2 which indexes to eth0
1444 in the ifDescr table (ifDescr.2). 1444 in the ifDescr table (ifDescr.2).
1445 1445
1446 interfaces.ifTable.ifEntry.ifDescr.1 = lo 1446 interfaces.ifTable.ifEntry.ifDescr.1 = lo
1447 interfaces.ifTable.ifEntry.ifDescr.2 = eth0 1447 interfaces.ifTable.ifEntry.ifDescr.2 = eth0
1448 interfaces.ifTable.ifEntry.ifDescr.3 = eth1 1448 interfaces.ifTable.ifEntry.ifDescr.3 = eth1
1449 interfaces.ifTable.ifEntry.ifDescr.4 = eth2 1449 interfaces.ifTable.ifEntry.ifDescr.4 = eth2
1450 interfaces.ifTable.ifEntry.ifDescr.5 = eth3 1450 interfaces.ifTable.ifEntry.ifDescr.5 = eth3
1451 interfaces.ifTable.ifEntry.ifDescr.6 = bond0 1451 interfaces.ifTable.ifEntry.ifDescr.6 = bond0
1452 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5 1452 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5
1453 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 1453 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
1454 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4 1454 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
1455 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 1455 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
1456 1456
1457 This problem is avoided by loading the bonding driver before 1457 This problem is avoided by loading the bonding driver before
1458 any network drivers participating in a bond. Below is an example of 1458 any network drivers participating in a bond. Below is an example of
1459 loading the bonding driver first, the IP address 192.168.1.1 is 1459 loading the bonding driver first, the IP address 192.168.1.1 is
1460 correctly associated with ifDescr.2. 1460 correctly associated with ifDescr.2.
1461 1461
1462 interfaces.ifTable.ifEntry.ifDescr.1 = lo 1462 interfaces.ifTable.ifEntry.ifDescr.1 = lo
1463 interfaces.ifTable.ifEntry.ifDescr.2 = bond0 1463 interfaces.ifTable.ifEntry.ifDescr.2 = bond0
1464 interfaces.ifTable.ifEntry.ifDescr.3 = eth0 1464 interfaces.ifTable.ifEntry.ifDescr.3 = eth0
1465 interfaces.ifTable.ifEntry.ifDescr.4 = eth1 1465 interfaces.ifTable.ifEntry.ifDescr.4 = eth1
1466 interfaces.ifTable.ifEntry.ifDescr.5 = eth2 1466 interfaces.ifTable.ifEntry.ifDescr.5 = eth2
1467 interfaces.ifTable.ifEntry.ifDescr.6 = eth3 1467 interfaces.ifTable.ifEntry.ifDescr.6 = eth3
1468 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6 1468 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6
1469 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2 1469 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
1470 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5 1470 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
1471 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1 1471 ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
1472 1472
1473 While some distributions may not report the interface name in 1473 While some distributions may not report the interface name in
1474 ifDescr, the association between the IP address and IfIndex remains 1474 ifDescr, the association between the IP address and IfIndex remains
1475 and SNMP functions such as Interface_Scan_Next will report that 1475 and SNMP functions such as Interface_Scan_Next will report that
1476 association. 1476 association.
1477 1477
1478 10. Promiscuous mode 1478 10. Promiscuous mode
1479 ==================== 1479 ====================
1480 1480
1481 When running network monitoring tools, e.g., tcpdump, it is 1481 When running network monitoring tools, e.g., tcpdump, it is
1482 common to enable promiscuous mode on the device, so that all traffic 1482 common to enable promiscuous mode on the device, so that all traffic
1483 is seen (instead of seeing only traffic destined for the local host). 1483 is seen (instead of seeing only traffic destined for the local host).
1484 The bonding driver handles promiscuous mode changes to the bonding 1484 The bonding driver handles promiscuous mode changes to the bonding
1485 master device (e.g., bond0), and propagates the setting to the slave 1485 master device (e.g., bond0), and propagates the setting to the slave
1486 devices. 1486 devices.
1487 1487
1488 For the balance-rr, balance-xor, broadcast, and 802.3ad modes, 1488 For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
1489 the promiscuous mode setting is propagated to all slaves. 1489 the promiscuous mode setting is propagated to all slaves.
1490 1490
1491 For the active-backup, balance-tlb and balance-alb modes, the 1491 For the active-backup, balance-tlb and balance-alb modes, the
1492 promiscuous mode setting is propagated only to the active slave. 1492 promiscuous mode setting is propagated only to the active slave.
1493 1493
1494 For balance-tlb mode, the active slave is the slave currently 1494 For balance-tlb mode, the active slave is the slave currently
1495 receiving inbound traffic. 1495 receiving inbound traffic.
1496 1496
1497 For balance-alb mode, the active slave is the slave used as a 1497 For balance-alb mode, the active slave is the slave used as a
1498 "primary." This slave is used for mode-specific control traffic, for 1498 "primary." This slave is used for mode-specific control traffic, for
1499 sending to peers that are unassigned or if the load is unbalanced. 1499 sending to peers that are unassigned or if the load is unbalanced.
1500 1500
1501 For the active-backup, balance-tlb and balance-alb modes, when 1501 For the active-backup, balance-tlb and balance-alb modes, when
1502 the active slave changes (e.g., due to a link failure), the 1502 the active slave changes (e.g., due to a link failure), the
1503 promiscuous setting will be propagated to the new active slave. 1503 promiscuous setting will be propagated to the new active slave.
1504 1504
1505 11. Configuring Bonding for High Availability 1505 11. Configuring Bonding for High Availability
1506 ============================================= 1506 =============================================
1507 1507
1508 High Availability refers to configurations that provide 1508 High Availability refers to configurations that provide
1509 maximum network availability by having redundant or backup devices, 1509 maximum network availability by having redundant or backup devices,
1510 links or switches between the host and the rest of the world. The 1510 links or switches between the host and the rest of the world. The
1511 goal is to provide the maximum availability of network connectivity 1511 goal is to provide the maximum availability of network connectivity
1512 (i.e., the network always works), even though other configurations 1512 (i.e., the network always works), even though other configurations
1513 could provide higher throughput. 1513 could provide higher throughput.
1514 1514
1515 11.1 High Availability in a Single Switch Topology 1515 11.1 High Availability in a Single Switch Topology
1516 -------------------------------------------------- 1516 --------------------------------------------------
1517 1517
1518 If two hosts (or a host and a single switch) are directly 1518 If two hosts (or a host and a single switch) are directly
1519 connected via multiple physical links, then there is no availability 1519 connected via multiple physical links, then there is no availability
1520 penalty to optimizing for maximum bandwidth. In this case, there is 1520 penalty to optimizing for maximum bandwidth. In this case, there is
1521 only one switch (or peer), so if it fails, there is no alternative 1521 only one switch (or peer), so if it fails, there is no alternative
1522 access to fail over to. Additionally, the bonding load balance modes 1522 access to fail over to. Additionally, the bonding load balance modes
1523 support link monitoring of their members, so if individual links fail, 1523 support link monitoring of their members, so if individual links fail,
1524 the load will be rebalanced across the remaining devices. 1524 the load will be rebalanced across the remaining devices.
1525 1525
1526 See Section 13, "Configuring Bonding for Maximum Throughput" 1526 See Section 13, "Configuring Bonding for Maximum Throughput"
1527 for information on configuring bonding with one peer device. 1527 for information on configuring bonding with one peer device.
1528 1528
1529 11.2 High Availability in a Multiple Switch Topology 1529 11.2 High Availability in a Multiple Switch Topology
1530 ---------------------------------------------------- 1530 ----------------------------------------------------
1531 1531
1532 With multiple switches, the configuration of bonding and the 1532 With multiple switches, the configuration of bonding and the
1533 network changes dramatically. In multiple switch topologies, there is 1533 network changes dramatically. In multiple switch topologies, there is
1534 a trade off between network availability and usable bandwidth. 1534 a trade off between network availability and usable bandwidth.
1535 1535
1536 Below is a sample network, configured to maximize the 1536 Below is a sample network, configured to maximize the
1537 availability of the network: 1537 availability of the network:
1538 1538
1539 | | 1539 | |
1540 |port3 port3| 1540 |port3 port3|
1541 +-----+----+ +-----+----+ 1541 +-----+----+ +-----+----+
1542 | |port2 ISL port2| | 1542 | |port2 ISL port2| |
1543 | switch A +--------------------------+ switch B | 1543 | switch A +--------------------------+ switch B |
1544 | | | | 1544 | | | |
1545 +-----+----+ +-----++---+ 1545 +-----+----+ +-----++---+
1546 |port1 port1| 1546 |port1 port1|
1547 | +-------+ | 1547 | +-------+ |
1548 +-------------+ host1 +---------------+ 1548 +-------------+ host1 +---------------+
1549 eth0 +-------+ eth1 1549 eth0 +-------+ eth1
1550 1550
1551 In this configuration, there is a link between the two 1551 In this configuration, there is a link between the two
1552 switches (ISL, or inter switch link), and multiple ports connecting to 1552 switches (ISL, or inter switch link), and multiple ports connecting to
1553 the outside world ("port3" on each switch). There is no technical 1553 the outside world ("port3" on each switch). There is no technical
1554 reason that this could not be extended to a third switch. 1554 reason that this could not be extended to a third switch.
1555 1555
1556 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology 1556 11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
1557 ------------------------------------------------------------- 1557 -------------------------------------------------------------
1558 1558
1559 In a topology such as the example above, the active-backup and 1559 In a topology such as the example above, the active-backup and
1560 broadcast modes are the only useful bonding modes when optimizing for 1560 broadcast modes are the only useful bonding modes when optimizing for
1561 availability; the other modes require all links to terminate on the 1561 availability; the other modes require all links to terminate on the
1562 same peer for them to behave rationally. 1562 same peer for them to behave rationally.
1563 1563
1564 active-backup: This is generally the preferred mode, particularly if 1564 active-backup: This is generally the preferred mode, particularly if
1565 the switches have an ISL and play together well. If the 1565 the switches have an ISL and play together well. If the
1566 network configuration is such that one switch is specifically 1566 network configuration is such that one switch is specifically
1567 a backup switch (e.g., has lower capacity, higher cost, etc), 1567 a backup switch (e.g., has lower capacity, higher cost, etc),
1568 then the primary option can be used to insure that the 1568 then the primary option can be used to insure that the
1569 preferred link is always used when it is available. 1569 preferred link is always used when it is available.
1570 1570
1571 broadcast: This mode is really a special purpose mode, and is suitable 1571 broadcast: This mode is really a special purpose mode, and is suitable
1572 only for very specific needs. For example, if the two 1572 only for very specific needs. For example, if the two
1573 switches are not connected (no ISL), and the networks beyond 1573 switches are not connected (no ISL), and the networks beyond
1574 them are totally independent. In this case, if it is 1574 them are totally independent. In this case, if it is
1575 necessary for some specific one-way traffic to reach both 1575 necessary for some specific one-way traffic to reach both
1576 independent networks, then the broadcast mode may be suitable. 1576 independent networks, then the broadcast mode may be suitable.
1577 1577
1578 11.2.2 HA Link Monitoring Selection for Multiple Switch Topology 1578 11.2.2 HA Link Monitoring Selection for Multiple Switch Topology
1579 ---------------------------------------------------------------- 1579 ----------------------------------------------------------------
1580 1580
1581 The choice of link monitoring ultimately depends upon your 1581 The choice of link monitoring ultimately depends upon your
1582 switch. If the switch can reliably fail ports in response to other 1582 switch. If the switch can reliably fail ports in response to other
1583 failures, then either the MII or ARP monitors should work. For 1583 failures, then either the MII or ARP monitors should work. For
1584 example, in the above example, if the "port3" link fails at the remote 1584 example, in the above example, if the "port3" link fails at the remote
1585 end, the MII monitor has no direct means to detect this. The ARP 1585 end, the MII monitor has no direct means to detect this. The ARP
1586 monitor could be configured with a target at the remote end of port3, 1586 monitor could be configured with a target at the remote end of port3,
1587 thus detecting that failure without switch support. 1587 thus detecting that failure without switch support.
1588 1588
1589 In general, however, in a multiple switch topology, the ARP 1589 In general, however, in a multiple switch topology, the ARP
1590 monitor can provide a higher level of reliability in detecting end to 1590 monitor can provide a higher level of reliability in detecting end to
1591 end connectivity failures (which may be caused by the failure of any 1591 end connectivity failures (which may be caused by the failure of any
1592 individual component to pass traffic for any reason). Additionally, 1592 individual component to pass traffic for any reason). Additionally,
1593 the ARP monitor should be configured with multiple targets (at least 1593 the ARP monitor should be configured with multiple targets (at least
1594 one for each switch in the network). This will insure that, 1594 one for each switch in the network). This will insure that,
1595 regardless of which switch is active, the ARP monitor has a suitable 1595 regardless of which switch is active, the ARP monitor has a suitable
1596 target to query. 1596 target to query.
1597 1597
1598 1598
1599 12. Configuring Bonding for Maximum Throughput 1599 12. Configuring Bonding for Maximum Throughput
1600 ============================================== 1600 ==============================================
1601 1601
1602 12.1 Maximizing Throughput in a Single Switch Topology 1602 12.1 Maximizing Throughput in a Single Switch Topology
1603 ------------------------------------------------------ 1603 ------------------------------------------------------
1604 1604
1605 In a single switch configuration, the best method to maximize 1605 In a single switch configuration, the best method to maximize
1606 throughput depends upon the application and network environment. The 1606 throughput depends upon the application and network environment. The
1607 various load balancing modes each have strengths and weaknesses in 1607 various load balancing modes each have strengths and weaknesses in
1608 different environments, as detailed below. 1608 different environments, as detailed below.
1609 1609
1610 For this discussion, we will break down the topologies into 1610 For this discussion, we will break down the topologies into
1611 two categories. Depending upon the destination of most traffic, we 1611 two categories. Depending upon the destination of most traffic, we
1612 categorize them into either "gatewayed" or "local" configurations. 1612 categorize them into either "gatewayed" or "local" configurations.
1613 1613
1614 In a gatewayed configuration, the "switch" is acting primarily 1614 In a gatewayed configuration, the "switch" is acting primarily
1615 as a router, and the majority of traffic passes through this router to 1615 as a router, and the majority of traffic passes through this router to
1616 other networks. An example would be the following: 1616 other networks. An example would be the following:
1617 1617
1618 1618
1619 +----------+ +----------+ 1619 +----------+ +----------+
1620 | |eth0 port1| | to other networks 1620 | |eth0 port1| | to other networks
1621 | Host A +---------------------+ router +-------------------> 1621 | Host A +---------------------+ router +------------------->
1622 | +---------------------+ | Hosts B and C are out 1622 | +---------------------+ | Hosts B and C are out
1623 | |eth1 port2| | here somewhere 1623 | |eth1 port2| | here somewhere
1624 +----------+ +----------+ 1624 +----------+ +----------+
1625 1625
1626 The router may be a dedicated router device, or another host 1626 The router may be a dedicated router device, or another host
1627 acting as a gateway. For our discussion, the important point is that 1627 acting as a gateway. For our discussion, the important point is that
1628 the majority of traffic from Host A will pass through the router to 1628 the majority of traffic from Host A will pass through the router to
1629 some other network before reaching its final destination. 1629 some other network before reaching its final destination.
1630 1630
1631 In a gatewayed network configuration, although Host A may 1631 In a gatewayed network configuration, although Host A may
1632 communicate with many other systems, all of its traffic will be sent 1632 communicate with many other systems, all of its traffic will be sent
1633 and received via one other peer on the local network, the router. 1633 and received via one other peer on the local network, the router.
1634 1634
1635 Note that the case of two systems connected directly via 1635 Note that the case of two systems connected directly via
1636 multiple physical links is, for purposes of configuring bonding, the 1636 multiple physical links is, for purposes of configuring bonding, the
1637 same as a gatewayed configuration. In that case, it happens that all 1637 same as a gatewayed configuration. In that case, it happens that all
1638 traffic is destined for the "gateway" itself, not some other network 1638 traffic is destined for the "gateway" itself, not some other network
1639 beyond the gateway. 1639 beyond the gateway.
1640 1640
1641 In a local configuration, the "switch" is acting primarily as 1641 In a local configuration, the "switch" is acting primarily as
1642 a switch, and the majority of traffic passes through this switch to 1642 a switch, and the majority of traffic passes through this switch to
1643 reach other stations on the same network. An example would be the 1643 reach other stations on the same network. An example would be the
1644 following: 1644 following:
1645 1645
1646 +----------+ +----------+ +--------+ 1646 +----------+ +----------+ +--------+
1647 | |eth0 port1| +-------+ Host B | 1647 | |eth0 port1| +-------+ Host B |
1648 | Host A +------------+ switch |port3 +--------+ 1648 | Host A +------------+ switch |port3 +--------+
1649 | +------------+ | +--------+ 1649 | +------------+ | +--------+
1650 | |eth1 port2| +------------------+ Host C | 1650 | |eth1 port2| +------------------+ Host C |
1651 +----------+ +----------+port4 +--------+ 1651 +----------+ +----------+port4 +--------+
1652 1652
1653 1653
1654 Again, the switch may be a dedicated switch device, or another 1654 Again, the switch may be a dedicated switch device, or another
1655 host acting as a gateway. For our discussion, the important point is 1655 host acting as a gateway. For our discussion, the important point is
1656 that the majority of traffic from Host A is destined for other hosts 1656 that the majority of traffic from Host A is destined for other hosts
1657 on the same local network (Hosts B and C in the above example). 1657 on the same local network (Hosts B and C in the above example).
1658 1658
1659 In summary, in a gatewayed configuration, traffic to and from 1659 In summary, in a gatewayed configuration, traffic to and from
1660 the bonded device will be to the same MAC level peer on the network 1660 the bonded device will be to the same MAC level peer on the network
1661 (the gateway itself, i.e., the router), regardless of its final 1661 (the gateway itself, i.e., the router), regardless of its final
1662 destination. In a local configuration, traffic flows directly to and 1662 destination. In a local configuration, traffic flows directly to and
1663 from the final destinations, thus, each destination (Host B, Host C) 1663 from the final destinations, thus, each destination (Host B, Host C)
1664 will be addressed directly by their individual MAC addresses. 1664 will be addressed directly by their individual MAC addresses.
1665 1665
1666 This distinction between a gatewayed and a local network 1666 This distinction between a gatewayed and a local network
1667 configuration is important because many of the load balancing modes 1667 configuration is important because many of the load balancing modes
1668 available use the MAC addresses of the local network source and 1668 available use the MAC addresses of the local network source and
1669 destination to make load balancing decisions. The behavior of each 1669 destination to make load balancing decisions. The behavior of each
1670 mode is described below. 1670 mode is described below.
1671 1671
1672 1672
1673 12.1.1 MT Bonding Mode Selection for Single Switch Topology 1673 12.1.1 MT Bonding Mode Selection for Single Switch Topology
1674 ----------------------------------------------------------- 1674 -----------------------------------------------------------
1675 1675
1676 This configuration is the easiest to set up and to understand, 1676 This configuration is the easiest to set up and to understand,
1677 although you will have to decide which bonding mode best suits your 1677 although you will have to decide which bonding mode best suits your
1678 needs. The trade offs for each mode are detailed below: 1678 needs. The trade offs for each mode are detailed below:
1679 1679
1680 balance-rr: This mode is the only mode that will permit a single 1680 balance-rr: This mode is the only mode that will permit a single
1681 TCP/IP connection to stripe traffic across multiple 1681 TCP/IP connection to stripe traffic across multiple
1682 interfaces. It is therefore the only mode that will allow a 1682 interfaces. It is therefore the only mode that will allow a
1683 single TCP/IP stream to utilize more than one interface's 1683 single TCP/IP stream to utilize more than one interface's
1684 worth of throughput. This comes at a cost, however: the 1684 worth of throughput. This comes at a cost, however: the
1685 striping often results in peer systems receiving packets out 1685 striping often results in peer systems receiving packets out
1686 of order, causing TCP/IP's congestion control system to kick 1686 of order, causing TCP/IP's congestion control system to kick
1687 in, often by retransmitting segments. 1687 in, often by retransmitting segments.
1688 1688
1689 It is possible to adjust TCP/IP's congestion limits by 1689 It is possible to adjust TCP/IP's congestion limits by
1690 altering the net.ipv4.tcp_reordering sysctl parameter. The 1690 altering the net.ipv4.tcp_reordering sysctl parameter. The
1691 usual default value is 3, and the maximum useful value is 127. 1691 usual default value is 3, and the maximum useful value is 127.
1692 For a four interface balance-rr bond, expect that a single 1692 For a four interface balance-rr bond, expect that a single
1693 TCP/IP stream will utilize no more than approximately 2.3 1693 TCP/IP stream will utilize no more than approximately 2.3
1694 interface's worth of throughput, even after adjusting 1694 interface's worth of throughput, even after adjusting
1695 tcp_reordering. 1695 tcp_reordering.
1696 1696
1697 Note that this out of order delivery occurs when both the 1697 Note that this out of order delivery occurs when both the
1698 sending and receiving systems are utilizing a multiple 1698 sending and receiving systems are utilizing a multiple
1699 interface bond. Consider a configuration in which a 1699 interface bond. Consider a configuration in which a
1700 balance-rr bond feeds into a single higher capacity network 1700 balance-rr bond feeds into a single higher capacity network
1701 channel (e.g., multiple 100Mb/sec ethernets feeding a single 1701 channel (e.g., multiple 100Mb/sec ethernets feeding a single
1702 gigabit ethernet via an etherchannel capable switch). In this 1702 gigabit ethernet via an etherchannel capable switch). In this
1703 configuration, traffic sent from the multiple 100Mb devices to 1703 configuration, traffic sent from the multiple 100Mb devices to
1704 a destination connected to the gigabit device will not see 1704 a destination connected to the gigabit device will not see
1705 packets out of order. However, traffic sent from the gigabit 1705 packets out of order. However, traffic sent from the gigabit
1706 device to the multiple 100Mb devices may or may not see 1706 device to the multiple 100Mb devices may or may not see
1707 traffic out of order, depending upon the balance policy of the 1707 traffic out of order, depending upon the balance policy of the
1708 switch. Many switches do not support any modes that stripe 1708 switch. Many switches do not support any modes that stripe
1709 traffic (instead choosing a port based upon IP or MAC level 1709 traffic (instead choosing a port based upon IP or MAC level
1710 addresses); for those devices, traffic flowing from the 1710 addresses); for those devices, traffic flowing from the
1711 gigabit device to the many 100Mb devices will only utilize one 1711 gigabit device to the many 100Mb devices will only utilize one
1712 interface. 1712 interface.
1713 1713
1714 If you are utilizing protocols other than TCP/IP, UDP for 1714 If you are utilizing protocols other than TCP/IP, UDP for
1715 example, and your application can tolerate out of order 1715 example, and your application can tolerate out of order
1716 delivery, then this mode can allow for single stream datagram 1716 delivery, then this mode can allow for single stream datagram
1717 performance that scales near linearly as interfaces are added 1717 performance that scales near linearly as interfaces are added
1718 to the bond. 1718 to the bond.
1719 1719
1720 This mode requires the switch to have the appropriate ports 1720 This mode requires the switch to have the appropriate ports
1721 configured for "etherchannel" or "trunking." 1721 configured for "etherchannel" or "trunking."
1722 1722
1723 active-backup: There is not much advantage in this network topology to 1723 active-backup: There is not much advantage in this network topology to
1724 the active-backup mode, as the inactive backup devices are all 1724 the active-backup mode, as the inactive backup devices are all
1725 connected to the same peer as the primary. In this case, a 1725 connected to the same peer as the primary. In this case, a
1726 load balancing mode (with link monitoring) will provide the 1726 load balancing mode (with link monitoring) will provide the
1727 same level of network availability, but with increased 1727 same level of network availability, but with increased
1728 available bandwidth. On the plus side, active-backup mode 1728 available bandwidth. On the plus side, active-backup mode
1729 does not require any configuration of the switch, so it may 1729 does not require any configuration of the switch, so it may
1730 have value if the hardware available does not support any of 1730 have value if the hardware available does not support any of
1731 the load balance modes. 1731 the load balance modes.
1732 1732
1733 balance-xor: This mode will limit traffic such that packets destined 1733 balance-xor: This mode will limit traffic such that packets destined
1734 for specific peers will always be sent over the same 1734 for specific peers will always be sent over the same
1735 interface. Since the destination is determined by the MAC 1735 interface. Since the destination is determined by the MAC
1736 addresses involved, this mode works best in a "local" network 1736 addresses involved, this mode works best in a "local" network
1737 configuration (as described above), with destinations all on 1737 configuration (as described above), with destinations all on
1738 the same local network. This mode is likely to be suboptimal 1738 the same local network. This mode is likely to be suboptimal
1739 if all your traffic is passed through a single router (i.e., a 1739 if all your traffic is passed through a single router (i.e., a
1740 "gatewayed" network configuration, as described above). 1740 "gatewayed" network configuration, as described above).
1741 1741
1742 As with balance-rr, the switch ports need to be configured for 1742 As with balance-rr, the switch ports need to be configured for
1743 "etherchannel" or "trunking." 1743 "etherchannel" or "trunking."
1744 1744
1745 broadcast: Like active-backup, there is not much advantage to this 1745 broadcast: Like active-backup, there is not much advantage to this
1746 mode in this type of network topology. 1746 mode in this type of network topology.
1747 1747
1748 802.3ad: This mode can be a good choice for this type of network 1748 802.3ad: This mode can be a good choice for this type of network
1749 topology. The 802.3ad mode is an IEEE standard, so all peers 1749 topology. The 802.3ad mode is an IEEE standard, so all peers
1750 that implement 802.3ad should interoperate well. The 802.3ad 1750 that implement 802.3ad should interoperate well. The 802.3ad
1751 protocol includes automatic configuration of the aggregates, 1751 protocol includes automatic configuration of the aggregates,
1752 so minimal manual configuration of the switch is needed 1752 so minimal manual configuration of the switch is needed
1753 (typically only to designate that some set of devices is 1753 (typically only to designate that some set of devices is
1754 available for 802.3ad). The 802.3ad standard also mandates 1754 available for 802.3ad). The 802.3ad standard also mandates
1755 that frames be delivered in order (within certain limits), so 1755 that frames be delivered in order (within certain limits), so
1756 in general single connections will not see misordering of 1756 in general single connections will not see misordering of
1757 packets. The 802.3ad mode does have some drawbacks: the 1757 packets. The 802.3ad mode does have some drawbacks: the
1758 standard mandates that all devices in the aggregate operate at 1758 standard mandates that all devices in the aggregate operate at
1759 the same speed and duplex. Also, as with all bonding load 1759 the same speed and duplex. Also, as with all bonding load
1760 balance modes other than balance-rr, no single connection will 1760 balance modes other than balance-rr, no single connection will
1761 be able to utilize more than a single interface's worth of 1761 be able to utilize more than a single interface's worth of
1762 bandwidth. 1762 bandwidth.
1763 1763
1764 Additionally, the linux bonding 802.3ad implementation 1764 Additionally, the linux bonding 802.3ad implementation
1765 distributes traffic by peer (using an XOR of MAC addresses), 1765 distributes traffic by peer (using an XOR of MAC addresses),
1766 so in a "gatewayed" configuration, all outgoing traffic will 1766 so in a "gatewayed" configuration, all outgoing traffic will
1767 generally use the same device. Incoming traffic may also end 1767 generally use the same device. Incoming traffic may also end
1768 up on a single device, but that is dependent upon the 1768 up on a single device, but that is dependent upon the
1769 balancing policy of the peer's 8023.ad implementation. In a 1769 balancing policy of the peer's 8023.ad implementation. In a
1770 "local" configuration, traffic will be distributed across the 1770 "local" configuration, traffic will be distributed across the
1771 devices in the bond. 1771 devices in the bond.
1772 1772
1773 Finally, the 802.3ad mode mandates the use of the MII monitor, 1773 Finally, the 802.3ad mode mandates the use of the MII monitor,
1774 therefore, the ARP monitor is not available in this mode. 1774 therefore, the ARP monitor is not available in this mode.
1775 1775
1776 balance-tlb: The balance-tlb mode balances outgoing traffic by peer. 1776 balance-tlb: The balance-tlb mode balances outgoing traffic by peer.
1777 Since the balancing is done according to MAC address, in a 1777 Since the balancing is done according to MAC address, in a
1778 "gatewayed" configuration (as described above), this mode will 1778 "gatewayed" configuration (as described above), this mode will
1779 send all traffic across a single device. However, in a 1779 send all traffic across a single device. However, in a
1780 "local" network configuration, this mode balances multiple 1780 "local" network configuration, this mode balances multiple
1781 local network peers across devices in a vaguely intelligent 1781 local network peers across devices in a vaguely intelligent
1782 manner (not a simple XOR as in balance-xor or 802.3ad mode), 1782 manner (not a simple XOR as in balance-xor or 802.3ad mode),
1783 so that mathematically unlucky MAC addresses (i.e., ones that 1783 so that mathematically unlucky MAC addresses (i.e., ones that
1784 XOR to the same value) will not all "bunch up" on a single 1784 XOR to the same value) will not all "bunch up" on a single
1785 interface. 1785 interface.
1786 1786
1787 Unlike 802.3ad, interfaces may be of differing speeds, and no 1787 Unlike 802.3ad, interfaces may be of differing speeds, and no
1788 special switch configuration is required. On the down side, 1788 special switch configuration is required. On the down side,
1789 in this mode all incoming traffic arrives over a single 1789 in this mode all incoming traffic arrives over a single
1790 interface, this mode requires certain ethtool support in the 1790 interface, this mode requires certain ethtool support in the
1791 network device driver of the slave interfaces, and the ARP 1791 network device driver of the slave interfaces, and the ARP
1792 monitor is not available. 1792 monitor is not available.
1793 1793
1794 balance-alb: This mode is everything that balance-tlb is, and more. 1794 balance-alb: This mode is everything that balance-tlb is, and more.
1795 It has all of the features (and restrictions) of balance-tlb, 1795 It has all of the features (and restrictions) of balance-tlb,
1796 and will also balance incoming traffic from local network 1796 and will also balance incoming traffic from local network
1797 peers (as described in the Bonding Module Options section, 1797 peers (as described in the Bonding Module Options section,
1798 above). 1798 above).
1799 1799
1800 The only additional down side to this mode is that the network 1800 The only additional down side to this mode is that the network
1801 device driver must support changing the hardware address while 1801 device driver must support changing the hardware address while
1802 the device is open. 1802 the device is open.
1803 1803
1804 12.1.2 MT Link Monitoring for Single Switch Topology 1804 12.1.2 MT Link Monitoring for Single Switch Topology
1805 ---------------------------------------------------- 1805 ----------------------------------------------------
1806 1806
1807 The choice of link monitoring may largely depend upon which 1807 The choice of link monitoring may largely depend upon which
1808 mode you choose to use. The more advanced load balancing modes do not 1808 mode you choose to use. The more advanced load balancing modes do not
1809 support the use of the ARP monitor, and are thus restricted to using 1809 support the use of the ARP monitor, and are thus restricted to using
1810 the MII monitor (which does not provide as high a level of end to end 1810 the MII monitor (which does not provide as high a level of end to end
1811 assurance as the ARP monitor). 1811 assurance as the ARP monitor).
1812 1812
1813 12.2 Maximum Throughput in a Multiple Switch Topology 1813 12.2 Maximum Throughput in a Multiple Switch Topology
1814 ----------------------------------------------------- 1814 -----------------------------------------------------
1815 1815
1816 Multiple switches may be utilized to optimize for throughput 1816 Multiple switches may be utilized to optimize for throughput
1817 when they are configured in parallel as part of an isolated network 1817 when they are configured in parallel as part of an isolated network
1818 between two or more systems, for example: 1818 between two or more systems, for example:
1819 1819
1820 +-----------+ 1820 +-----------+
1821 | Host A | 1821 | Host A |
1822 +-+---+---+-+ 1822 +-+---+---+-+
1823 | | | 1823 | | |
1824 +--------+ | +---------+ 1824 +--------+ | +---------+
1825 | | | 1825 | | |
1826 +------+---+ +-----+----+ +-----+----+ 1826 +------+---+ +-----+----+ +-----+----+
1827 | Switch A | | Switch B | | Switch C | 1827 | Switch A | | Switch B | | Switch C |
1828 +------+---+ +-----+----+ +-----+----+ 1828 +------+---+ +-----+----+ +-----+----+
1829 | | | 1829 | | |
1830 +--------+ | +---------+ 1830 +--------+ | +---------+
1831 | | | 1831 | | |
1832 +-+---+---+-+ 1832 +-+---+---+-+
1833 | Host B | 1833 | Host B |
1834 +-----------+ 1834 +-----------+
1835 1835
1836 In this configuration, the switches are isolated from one 1836 In this configuration, the switches are isolated from one
1837 another. One reason to employ a topology such as this is for an 1837 another. One reason to employ a topology such as this is for an
1838 isolated network with many hosts (a cluster configured for high 1838 isolated network with many hosts (a cluster configured for high
1839 performance, for example), using multiple smaller switches can be more 1839 performance, for example), using multiple smaller switches can be more
1840 cost effective than a single larger switch, e.g., on a network with 24 1840 cost effective than a single larger switch, e.g., on a network with 24
1841 hosts, three 24 port switches can be significantly less expensive than 1841 hosts, three 24 port switches can be significantly less expensive than
1842 a single 72 port switch. 1842 a single 72 port switch.
1843 1843
1844 If access beyond the network is required, an individual host 1844 If access beyond the network is required, an individual host
1845 can be equipped with an additional network device connected to an 1845 can be equipped with an additional network device connected to an
1846 external network; this host then additionally acts as a gateway. 1846 external network; this host then additionally acts as a gateway.
1847 1847
1848 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology 1848 12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
1849 ------------------------------------------------------------- 1849 -------------------------------------------------------------
1850 1850
1851 In actual practice, the bonding mode typically employed in 1851 In actual practice, the bonding mode typically employed in
1852 configurations of this type is balance-rr. Historically, in this 1852 configurations of this type is balance-rr. Historically, in this
1853 network configuration, the usual caveats about out of order packet 1853 network configuration, the usual caveats about out of order packet
1854 delivery are mitigated by the use of network adapters that do not do 1854 delivery are mitigated by the use of network adapters that do not do
1855 any kind of packet coalescing (via the use of NAPI, or because the 1855 any kind of packet coalescing (via the use of NAPI, or because the
1856 device itself does not generate interrupts until some number of 1856 device itself does not generate interrupts until some number of
1857 packets has arrived). When employed in this fashion, the balance-rr 1857 packets has arrived). When employed in this fashion, the balance-rr
1858 mode allows individual connections between two hosts to effectively 1858 mode allows individual connections between two hosts to effectively
1859 utilize greater than one interface's bandwidth. 1859 utilize greater than one interface's bandwidth.
1860 1860
1861 12.2.2 MT Link Monitoring for Multiple Switch Topology 1861 12.2.2 MT Link Monitoring for Multiple Switch Topology
1862 ------------------------------------------------------ 1862 ------------------------------------------------------
1863 1863
1864 Again, in actual practice, the MII monitor is most often used 1864 Again, in actual practice, the MII monitor is most often used
1865 in this configuration, as performance is given preference over 1865 in this configuration, as performance is given preference over
1866 availability. The ARP monitor will function in this topology, but its 1866 availability. The ARP monitor will function in this topology, but its
1867 advantages over the MII monitor are mitigated by the volume of probes 1867 advantages over the MII monitor are mitigated by the volume of probes
1868 needed as the number of systems involved grows (remember that each 1868 needed as the number of systems involved grows (remember that each
1869 host in the network is configured with bonding). 1869 host in the network is configured with bonding).
1870 1870
1871 13. Switch Behavior Issues 1871 13. Switch Behavior Issues
1872 ========================== 1872 ==========================
1873 1873
1874 13.1 Link Establishment and Failover Delays 1874 13.1 Link Establishment and Failover Delays
1875 ------------------------------------------- 1875 -------------------------------------------
1876 1876
1877 Some switches exhibit undesirable behavior with regard to the 1877 Some switches exhibit undesirable behavior with regard to the
1878 timing of link up and down reporting by the switch. 1878 timing of link up and down reporting by the switch.
1879 1879
1880 First, when a link comes up, some switches may indicate that 1880 First, when a link comes up, some switches may indicate that
1881 the link is up (carrier available), but not pass traffic over the 1881 the link is up (carrier available), but not pass traffic over the
1882 interface for some period of time. This delay is typically due to 1882 interface for some period of time. This delay is typically due to
1883 some type of autonegotiation or routing protocol, but may also occur 1883 some type of autonegotiation or routing protocol, but may also occur
1884 during switch initialization (e.g., during recovery after a switch 1884 during switch initialization (e.g., during recovery after a switch
1885 failure). If you find this to be a problem, specify an appropriate 1885 failure). If you find this to be a problem, specify an appropriate
1886 value to the updelay bonding module option to delay the use of the 1886 value to the updelay bonding module option to delay the use of the
1887 relevant interface(s). 1887 relevant interface(s).
1888 1888
1889 Second, some switches may "bounce" the link state one or more 1889 Second, some switches may "bounce" the link state one or more
1890 times while a link is changing state. This occurs most commonly while 1890 times while a link is changing state. This occurs most commonly while
1891 the switch is initializing. Again, an appropriate updelay value may 1891 the switch is initializing. Again, an appropriate updelay value may
1892 help. 1892 help.
1893 1893
1894 Note that when a bonding interface has no active links, the 1894 Note that when a bonding interface has no active links, the
1895 driver will immediately reuse the first link that goes up, even if the 1895 driver will immediately reuse the first link that goes up, even if the
1896 updelay parameter has been specified (the updelay is ignored in this 1896 updelay parameter has been specified (the updelay is ignored in this
1897 case). If there are slave interfaces waiting for the updelay timeout 1897 case). If there are slave interfaces waiting for the updelay timeout
1898 to expire, the interface that first went into that state will be 1898 to expire, the interface that first went into that state will be
1899 immediately reused. This reduces down time of the network if the 1899 immediately reused. This reduces down time of the network if the
1900 value of updelay has been overestimated, and since this occurs only in 1900 value of updelay has been overestimated, and since this occurs only in
1901 cases with no connectivity, there is no additional penalty for 1901 cases with no connectivity, there is no additional penalty for
1902 ignoring the updelay. 1902 ignoring the updelay.
1903 1903
1904 In addition to the concerns about switch timings, if your 1904 In addition to the concerns about switch timings, if your
1905 switches take a long time to go into backup mode, it may be desirable 1905 switches take a long time to go into backup mode, it may be desirable
1906 to not activate a backup interface immediately after a link goes down. 1906 to not activate a backup interface immediately after a link goes down.
1907 Failover may be delayed via the downdelay bonding module option. 1907 Failover may be delayed via the downdelay bonding module option.
1908 1908
1909 13.2 Duplicated Incoming Packets 1909 13.2 Duplicated Incoming Packets
1910 -------------------------------- 1910 --------------------------------
1911 1911
1912 It is not uncommon to observe a short burst of duplicated 1912 It is not uncommon to observe a short burst of duplicated
1913 traffic when the bonding device is first used, or after it has been 1913 traffic when the bonding device is first used, or after it has been
1914 idle for some period of time. This is most easily observed by issuing 1914 idle for some period of time. This is most easily observed by issuing
1915 a "ping" to some other host on the network, and noticing that the 1915 a "ping" to some other host on the network, and noticing that the
1916 output from ping flags duplicates (typically one per slave). 1916 output from ping flags duplicates (typically one per slave).
1917 1917
1918 For example, on a bond in active-backup mode with five slaves 1918 For example, on a bond in active-backup mode with five slaves
1919 all connected to one switch, the output may appear as follows: 1919 all connected to one switch, the output may appear as follows:
1920 1920
1921 # ping -n 10.0.4.2 1921 # ping -n 10.0.4.2
1922 PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. 1922 PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
1923 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms 1923 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
1924 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) 1924 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
1925 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) 1925 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
1926 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) 1926 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
1927 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) 1927 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
1928 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms 1928 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
1929 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms 1929 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
1930 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms 1930 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
1931 1931
1932 This is not due to an error in the bonding driver, rather, it 1932 This is not due to an error in the bonding driver, rather, it
1933 is a side effect of how many switches update their MAC forwarding 1933 is a side effect of how many switches update their MAC forwarding
1934 tables. Initially, the switch does not associate the MAC address in 1934 tables. Initially, the switch does not associate the MAC address in
1935 the packet with a particular switch port, and so it may send the 1935 the packet with a particular switch port, and so it may send the
1936 traffic to all ports until its MAC forwarding table is updated. Since 1936 traffic to all ports until its MAC forwarding table is updated. Since
1937 the interfaces attached to the bond may occupy multiple ports on a 1937 the interfaces attached to the bond may occupy multiple ports on a
1938 single switch, when the switch (temporarily) floods the traffic to all 1938 single switch, when the switch (temporarily) floods the traffic to all
1939 ports, the bond device receives multiple copies of the same packet 1939 ports, the bond device receives multiple copies of the same packet
1940 (one per slave device). 1940 (one per slave device).
1941 1941
1942 The duplicated packet behavior is switch dependent, some 1942 The duplicated packet behavior is switch dependent, some
1943 switches exhibit this, and some do not. On switches that display this 1943 switches exhibit this, and some do not. On switches that display this
1944 behavior, it can be induced by clearing the MAC forwarding table (on 1944 behavior, it can be induced by clearing the MAC forwarding table (on
1945 most Cisco switches, the privileged command "clear mac address-table 1945 most Cisco switches, the privileged command "clear mac address-table
1946 dynamic" will accomplish this). 1946 dynamic" will accomplish this).
1947 1947
1948 14. Hardware Specific Considerations 1948 14. Hardware Specific Considerations
1949 ==================================== 1949 ====================================
1950 1950
1951 This section contains additional information for configuring 1951 This section contains additional information for configuring
1952 bonding on specific hardware platforms, or for interfacing bonding 1952 bonding on specific hardware platforms, or for interfacing bonding
1953 with particular switches or other devices. 1953 with particular switches or other devices.
1954 1954
1955 14.1 IBM BladeCenter 1955 14.1 IBM BladeCenter
1956 -------------------- 1956 --------------------
1957 1957
1958 This applies to the JS20 and similar systems. 1958 This applies to the JS20 and similar systems.
1959 1959
1960 On the JS20 blades, the bonding driver supports only 1960 On the JS20 blades, the bonding driver supports only
1961 balance-rr, active-backup, balance-tlb and balance-alb modes. This is 1961 balance-rr, active-backup, balance-tlb and balance-alb modes. This is
1962 largely due to the network topology inside the BladeCenter, detailed 1962 largely due to the network topology inside the BladeCenter, detailed
1963 below. 1963 below.
1964 1964
1965 JS20 network adapter information 1965 JS20 network adapter information
1966 -------------------------------- 1966 --------------------------------
1967 1967
1968 All JS20s come with two Broadcom Gigabit Ethernet ports 1968 All JS20s come with two Broadcom Gigabit Ethernet ports
1969 integrated on the planar (that's "motherboard" in IBM-speak). In the 1969 integrated on the planar (that's "motherboard" in IBM-speak). In the
1970 BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to 1970 BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
1971 I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. 1971 I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
1972 An add-on Broadcom daughter card can be installed on a JS20 to provide 1972 An add-on Broadcom daughter card can be installed on a JS20 to provide
1973 two more Gigabit Ethernet ports. These ports, eth2 and eth3, are 1973 two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
1974 wired to I/O Modules 3 and 4, respectively. 1974 wired to I/O Modules 3 and 4, respectively.
1975 1975
1976 Each I/O Module may contain either a switch or a passthrough 1976 Each I/O Module may contain either a switch or a passthrough
1977 module (which allows ports to be directly connected to an external 1977 module (which allows ports to be directly connected to an external
1978 switch). Some bonding modes require a specific BladeCenter internal 1978 switch). Some bonding modes require a specific BladeCenter internal
1979 network topology in order to function; these are detailed below. 1979 network topology in order to function; these are detailed below.
1980 1980
1981 Additional BladeCenter-specific networking information can be 1981 Additional BladeCenter-specific networking information can be
1982 found in two IBM Redbooks (www.ibm.com/redbooks): 1982 found in two IBM Redbooks (www.ibm.com/redbooks):
1983 1983
1984 "IBM eServer BladeCenter Networking Options" 1984 "IBM eServer BladeCenter Networking Options"
1985 "IBM eServer BladeCenter Layer 2-7 Network Switching" 1985 "IBM eServer BladeCenter Layer 2-7 Network Switching"
1986 1986
1987 BladeCenter networking configuration 1987 BladeCenter networking configuration
1988 ------------------------------------ 1988 ------------------------------------
1989 1989
1990 Because a BladeCenter can be configured in a very large number 1990 Because a BladeCenter can be configured in a very large number
1991 of ways, this discussion will be confined to describing basic 1991 of ways, this discussion will be confined to describing basic
1992 configurations. 1992 configurations.
1993 1993
1994 Normally, Ethernet Switch Modules (ESMs) are used in I/O 1994 Normally, Ethernet Switch Modules (ESMs) are used in I/O
1995 modules 1 and 2. In this configuration, the eth0 and eth1 ports of a 1995 modules 1 and 2. In this configuration, the eth0 and eth1 ports of a
1996 JS20 will be connected to different internal switches (in the 1996 JS20 will be connected to different internal switches (in the
1997 respective I/O modules). 1997 respective I/O modules).
1998 1998
1999 A passthrough module (OPM or CPM, optical or copper, 1999 A passthrough module (OPM or CPM, optical or copper,
2000 passthrough module) connects the I/O module directly to an external 2000 passthrough module) connects the I/O module directly to an external
2001 switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 2001 switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
2002 interfaces of a JS20 can be redirected to the outside world and 2002 interfaces of a JS20 can be redirected to the outside world and
2003 connected to a common external switch. 2003 connected to a common external switch.
2004 2004
2005 Depending upon the mix of ESMs and PMs, the network will 2005 Depending upon the mix of ESMs and PMs, the network will
2006 appear to bonding as either a single switch topology (all PMs) or as a 2006 appear to bonding as either a single switch topology (all PMs) or as a
2007 multiple switch topology (one or more ESMs, zero or more PMs). It is 2007 multiple switch topology (one or more ESMs, zero or more PMs). It is
2008 also possible to connect ESMs together, resulting in a configuration 2008 also possible to connect ESMs together, resulting in a configuration
2009 much like the example in "High Availability in a Multiple Switch 2009 much like the example in "High Availability in a Multiple Switch
2010 Topology," above. 2010 Topology," above.
2011 2011
2012 Requirements for specific modes 2012 Requirements for specific modes
2013 ------------------------------- 2013 -------------------------------
2014 2014
2015 The balance-rr mode requires the use of passthrough modules 2015 The balance-rr mode requires the use of passthrough modules
2016 for devices in the bond, all connected to an common external switch. 2016 for devices in the bond, all connected to an common external switch.
2017 That switch must be configured for "etherchannel" or "trunking" on the 2017 That switch must be configured for "etherchannel" or "trunking" on the
2018 appropriate ports, as is usual for balance-rr. 2018 appropriate ports, as is usual for balance-rr.
2019 2019
2020 The balance-alb and balance-tlb modes will function with 2020 The balance-alb and balance-tlb modes will function with
2021 either switch modules or passthrough modules (or a mix). The only 2021 either switch modules or passthrough modules (or a mix). The only
2022 specific requirement for these modes is that all network interfaces 2022 specific requirement for these modes is that all network interfaces
2023 must be able to reach all destinations for traffic sent over the 2023 must be able to reach all destinations for traffic sent over the
2024 bonding device (i.e., the network must converge at some point outside 2024 bonding device (i.e., the network must converge at some point outside
2025 the BladeCenter). 2025 the BladeCenter).
2026 2026
2027 The active-backup mode has no additional requirements. 2027 The active-backup mode has no additional requirements.
2028 2028
2029 Link monitoring issues 2029 Link monitoring issues
2030 ---------------------- 2030 ----------------------
2031 2031
2032 When an Ethernet Switch Module is in place, only the ARP 2032 When an Ethernet Switch Module is in place, only the ARP
2033 monitor will reliably detect link loss to an external switch. This is 2033 monitor will reliably detect link loss to an external switch. This is
2034 nothing unusual, but examination of the BladeCenter cabinet would 2034 nothing unusual, but examination of the BladeCenter cabinet would
2035 suggest that the "external" network ports are the ethernet ports for 2035 suggest that the "external" network ports are the ethernet ports for
2036 the system, when it fact there is a switch between these "external" 2036 the system, when it fact there is a switch between these "external"
2037 ports and the devices on the JS20 system itself. The MII monitor is 2037 ports and the devices on the JS20 system itself. The MII monitor is
2038 only able to detect link failures between the ESM and the JS20 system. 2038 only able to detect link failures between the ESM and the JS20 system.
2039 2039
2040 When a passthrough module is in place, the MII monitor does 2040 When a passthrough module is in place, the MII monitor does
2041 detect failures to the "external" port, which is then directly 2041 detect failures to the "external" port, which is then directly
2042 connected to the JS20 system. 2042 connected to the JS20 system.
2043 2043
2044 Other concerns 2044 Other concerns
2045 -------------- 2045 --------------
2046 2046
2047 The Serial Over LAN (SoL) link is established over the primary 2047 The Serial Over LAN (SoL) link is established over the primary
2048 ethernet (eth0) only, therefore, any loss of link to eth0 will result 2048 ethernet (eth0) only, therefore, any loss of link to eth0 will result
2049 in losing your SoL connection. It will not fail over with other 2049 in losing your SoL connection. It will not fail over with other
2050 network traffic, as the SoL system is beyond the control of the 2050 network traffic, as the SoL system is beyond the control of the
2051 bonding driver. 2051 bonding driver.
2052 2052
2053 It may be desirable to disable spanning tree on the switch 2053 It may be desirable to disable spanning tree on the switch
2054 (either the internal Ethernet Switch Module, or an external switch) to 2054 (either the internal Ethernet Switch Module, or an external switch) to
2055 avoid fail-over delay issues when using bonding. 2055 avoid fail-over delay issues when using bonding.
2056 2056
2057 2057
2058 15. Frequently Asked Questions 2058 15. Frequently Asked Questions
2059 ============================== 2059 ==============================
2060 2060
2061 1. Is it SMP safe? 2061 1. Is it SMP safe?
2062 2062
2063 Yes. The old 2.0.xx channel bonding patch was not SMP safe. 2063 Yes. The old 2.0.xx channel bonding patch was not SMP safe.
2064 The new driver was designed to be SMP safe from the start. 2064 The new driver was designed to be SMP safe from the start.
2065 2065
2066 2. What type of cards will work with it? 2066 2. What type of cards will work with it?
2067 2067
2068 Any Ethernet type cards (you can even mix cards - a Intel 2068 Any Ethernet type cards (you can even mix cards - a Intel
2069 EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, 2069 EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes,
2070 devices need not be of the same speed. 2070 devices need not be of the same speed.
2071 2071
2072 3. How many bonding devices can I have? 2072 3. How many bonding devices can I have?
2073 2073
2074 There is no limit. 2074 There is no limit.
2075 2075
2076 4. How many slaves can a bonding device have? 2076 4. How many slaves can a bonding device have?
2077 2077
2078 This is limited only by the number of network interfaces Linux 2078 This is limited only by the number of network interfaces Linux
2079 supports and/or the number of network cards you can place in your 2079 supports and/or the number of network cards you can place in your
2080 system. 2080 system.
2081 2081
2082 5. What happens when a slave link dies? 2082 5. What happens when a slave link dies?
2083 2083
2084 If link monitoring is enabled, then the failing device will be 2084 If link monitoring is enabled, then the failing device will be
2085 disabled. The active-backup mode will fail over to a backup link, and 2085 disabled. The active-backup mode will fail over to a backup link, and
2086 other modes will ignore the failed link. The link will continue to be 2086 other modes will ignore the failed link. The link will continue to be
2087 monitored, and should it recover, it will rejoin the bond (in whatever 2087 monitored, and should it recover, it will rejoin the bond (in whatever
2088 manner is appropriate for the mode). See the sections on High 2088 manner is appropriate for the mode). See the sections on High
2089 Availability and the documentation for each mode for additional 2089 Availability and the documentation for each mode for additional
2090 information. 2090 information.
2091 2091
2092 Link monitoring can be enabled via either the miimon or 2092 Link monitoring can be enabled via either the miimon or
2093 arp_interval parameters (described in the module parameters section, 2093 arp_interval parameters (described in the module parameters section,
2094 above). In general, miimon monitors the carrier state as sensed by 2094 above). In general, miimon monitors the carrier state as sensed by
2095 the underlying network device, and the arp monitor (arp_interval) 2095 the underlying network device, and the arp monitor (arp_interval)
2096 monitors connectivity to another host on the local network. 2096 monitors connectivity to another host on the local network.
2097 2097
2098 If no link monitoring is configured, the bonding driver will 2098 If no link monitoring is configured, the bonding driver will
2099 be unable to detect link failures, and will assume that all links are 2099 be unable to detect link failures, and will assume that all links are
2100 always available. This will likely result in lost packets, and a 2100 always available. This will likely result in lost packets, and a
2101 resulting degradation of performance. The precise performance loss 2101 resulting degradation of performance. The precise performance loss
2102 depends upon the bonding mode and network configuration. 2102 depends upon the bonding mode and network configuration.
2103 2103
2104 6. Can bonding be used for High Availability? 2104 6. Can bonding be used for High Availability?
2105 2105
2106 Yes. See the section on High Availability for details. 2106 Yes. See the section on High Availability for details.
2107 2107
2108 7. Which switches/systems does it work with? 2108 7. Which switches/systems does it work with?
2109 2109
2110 The full answer to this depends upon the desired mode. 2110 The full answer to this depends upon the desired mode.
2111 2111
2112 In the basic balance modes (balance-rr and balance-xor), it 2112 In the basic balance modes (balance-rr and balance-xor), it
2113 works with any system that supports etherchannel (also called 2113 works with any system that supports etherchannel (also called
2114 trunking). Most managed switches currently available have such 2114 trunking). Most managed switches currently available have such
2115 support, and many unmanaged switches as well. 2115 support, and many unmanaged switches as well.
2116 2116
2117 The advanced balance modes (balance-tlb and balance-alb) do 2117 The advanced balance modes (balance-tlb and balance-alb) do
2118 not have special switch requirements, but do need device drivers that 2118 not have special switch requirements, but do need device drivers that
2119 support specific features (described in the appropriate section under 2119 support specific features (described in the appropriate section under
2120 module parameters, above). 2120 module parameters, above).
2121 2121
2122 In 802.3ad mode, it works with systems that support IEEE 2122 In 802.3ad mode, it works with systems that support IEEE
2123 802.3ad Dynamic Link Aggregation. Most managed and many unmanaged 2123 802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
2124 switches currently available support 802.3ad. 2124 switches currently available support 802.3ad.
2125 2125
2126 The active-backup mode should work with any Layer-II switch. 2126 The active-backup mode should work with any Layer-II switch.
2127 2127
2128 8. Where does a bonding device get its MAC address from? 2128 8. Where does a bonding device get its MAC address from?
2129 2129
2130 If not explicitly configured (with ifconfig or ip link), the 2130 If not explicitly configured (with ifconfig or ip link), the
2131 MAC address of the bonding device is taken from its first slave 2131 MAC address of the bonding device is taken from its first slave
2132 device. This MAC address is then passed to all following slaves and 2132 device. This MAC address is then passed to all following slaves and
2133 remains persistent (even if the first slave is removed) until the 2133 remains persistent (even if the first slave is removed) until the
2134 bonding device is brought down or reconfigured. 2134 bonding device is brought down or reconfigured.
2135 2135
2136 If you wish to change the MAC address, you can set it with 2136 If you wish to change the MAC address, you can set it with
2137 ifconfig or ip link: 2137 ifconfig or ip link:
2138 2138
2139 # ifconfig bond0 hw ether 00:11:22:33:44:55 2139 # ifconfig bond0 hw ether 00:11:22:33:44:55
2140 2140
2141 # ip link set bond0 address 66:77:88:99:aa:bb 2141 # ip link set bond0 address 66:77:88:99:aa:bb
2142 2142
2143 The MAC address can be also changed by bringing down/up the 2143 The MAC address can be also changed by bringing down/up the
2144 device and then changing its slaves (or their order): 2144 device and then changing its slaves (or their order):
2145 2145
2146 # ifconfig bond0 down ; modprobe -r bonding 2146 # ifconfig bond0 down ; modprobe -r bonding
2147 # ifconfig bond0 .... up 2147 # ifconfig bond0 .... up
2148 # ifenslave bond0 eth... 2148 # ifenslave bond0 eth...
2149 2149
2150 This method will automatically take the address from the next 2150 This method will automatically take the address from the next
2151 slave that is added. 2151 slave that is added.
2152 2152
2153 To restore your slaves' MAC addresses, you need to detach them 2153 To restore your slaves' MAC addresses, you need to detach them
2154 from the bond (`ifenslave -d bond0 eth0'). The bonding driver will 2154 from the bond (`ifenslave -d bond0 eth0'). The bonding driver will
2155 then restore the MAC addresses that the slaves had before they were 2155 then restore the MAC addresses that the slaves had before they were
2156 enslaved. 2156 enslaved.
2157 2157
2158 16. Resources and Links 2158 16. Resources and Links
2159 ======================= 2159 =======================
2160 2160
2161 The latest version of the bonding driver can be found in the latest 2161 The latest version of the bonding driver can be found in the latest
2162 version of the linux kernel, found on http://kernel.org 2162 version of the linux kernel, found on http://kernel.org
2163 2163
2164 The latest version of this document can be found in either the latest 2164 The latest version of this document can be found in either the latest
2165 kernel source (named Documentation/networking/bonding.txt), or on the 2165 kernel source (named Documentation/networking/bonding.txt), or on the
2166 bonding sourceforge site: 2166 bonding sourceforge site:
2167 2167
2168 http://www.sourceforge.net/projects/bonding 2168 http://www.sourceforge.net/projects/bonding
2169 2169
2170 Discussions regarding the bonding driver take place primarily on the 2170 Discussions regarding the bonding driver take place primarily on the
2171 bonding-devel mailing list, hosted at sourceforge.net. If you have 2171 bonding-devel mailing list, hosted at sourceforge.net. If you have
2172 questions or problems, post them to the list. The list address is: 2172 questions or problems, post them to the list. The list address is:
2173 2173
2174 bonding-devel@lists.sourceforge.net 2174 bonding-devel@lists.sourceforge.net
2175 2175
2176 The administrative interface (to subscribe or unsubscribe) can 2176 The administrative interface (to subscribe or unsubscribe) can
2177 be found at: 2177 be found at:
2178 2178
2179 https://lists.sourceforge.net/lists/listinfo/bonding-devel 2179 https://lists.sourceforge.net/lists/listinfo/bonding-devel
2180 2180
2181 Donald Becker's Ethernet Drivers and diag programs may be found at : 2181 Donald Becker's Ethernet Drivers and diag programs may be found at :
2182 - http://www.scyld.com/network/ 2182 - http://www.scyld.com/network/
2183 2183
2184 You will also find a lot of information regarding Ethernet, NWay, MII, 2184 You will also find a lot of information regarding Ethernet, NWay, MII,
2185 etc. at www.scyld.com. 2185 etc. at www.scyld.com.
2186 2186
2187 -- END -- 2187 -- END --
2188 2188
Documentation/networking/cs89x0.txt
1 1
2 NOTE 2 NOTE
3 ---- 3 ----
4 4
5 This document was contributed by Cirrus Logic for kernel 2.2.5. This version 5 This document was contributed by Cirrus Logic for kernel 2.2.5. This version
6 has been updated for 2.3.48 by Andrew Morton <andrewm@uow.edu.au> 6 has been updated for 2.3.48 by Andrew Morton <andrewm@uow.edu.au>
7 7
8 Cirrus make a copy of this driver available at their website, as 8 Cirrus make a copy of this driver available at their website, as
9 described below. In general, you should use the driver version which 9 described below. In general, you should use the driver version which
10 comes with your Linux distribution. 10 comes with your Linux distribution.
11 11
12 12
13 13
14 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS 14 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
15 Linux Network Interface Driver ver. 2.00 <kernel 2.3.48> 15 Linux Network Interface Driver ver. 2.00 <kernel 2.3.48>
16 =============================================================================== 16 ===============================================================================
17 17
18 18
19 TABLE OF CONTENTS 19 TABLE OF CONTENTS
20 20
21 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS 21 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
22 1.1 Product Overview 22 1.1 Product Overview
23 1.2 Driver Description 23 1.2 Driver Description
24 1.2.1 Driver Name 24 1.2.1 Driver Name
25 1.2.2 File in the Driver Package 25 1.2.2 File in the Driver Package
26 1.3 System Requirements 26 1.3 System Requirements
27 1.4 Licensing Information 27 1.4 Licensing Information
28 28
29 2.0 ADAPTER INSTALLATION and CONFIGURATION 29 2.0 ADAPTER INSTALLATION and CONFIGURATION
30 2.1 CS8900-based Adapter Configuration 30 2.1 CS8900-based Adapter Configuration
31 2.2 CS8920-based Adapter Configuration 31 2.2 CS8920-based Adapter Configuration
32 32
33 3.0 LOADING THE DRIVER AS A MODULE 33 3.0 LOADING THE DRIVER AS A MODULE
34 34
35 4.0 COMPILING THE DRIVER 35 4.0 COMPILING THE DRIVER
36 4.1 Compiling the Driver as a Loadable Module 36 4.1 Compiling the Driver as a Loadable Module
37 4.2 Compiling the driver to support memory mode 37 4.2 Compiling the driver to support memory mode
38 4.3 Compiling the driver to support Rx DMA 38 4.3 Compiling the driver to support Rx DMA
39 4.4 Compiling the Driver into the Kernel 39 4.4 Compiling the Driver into the Kernel
40 40
41 5.0 TESTING AND TROUBLESHOOTING 41 5.0 TESTING AND TROUBLESHOOTING
42 5.1 Known Defects and Limitations 42 5.1 Known Defects and Limitations
43 5.2 Testing the Adapter 43 5.2 Testing the Adapter
44 5.2.1 Diagnostic Self-Test 44 5.2.1 Diagnostic Self-Test
45 5.2.2 Diagnostic Network Test 45 5.2.2 Diagnostic Network Test
46 5.3 Using the Adapter's LEDs 46 5.3 Using the Adapter's LEDs
47 5.4 Resolving I/O Conflicts 47 5.4 Resolving I/O Conflicts
48 48
49 6.0 TECHNICAL SUPPORT 49 6.0 TECHNICAL SUPPORT
50 6.1 Contacting Cirrus Logic's Technical Support 50 6.1 Contacting Cirrus Logic's Technical Support
51 6.2 Information Required Before Contacting Technical Support 51 6.2 Information Required Before Contacting Technical Support
52 6.3 Obtaining the Latest Driver Version 52 6.3 Obtaining the Latest Driver Version
53 6.4 Current maintainer 53 6.4 Current maintainer
54 6.5 Kernel boot parameters 54 6.5 Kernel boot parameters
55 55
56 56
57 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS 57 1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
58 =============================================================================== 58 ===============================================================================
59 59
60 60
61 1.1 PRODUCT OVERVIEW 61 1.1 PRODUCT OVERVIEW
62 62
63 The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow 63 The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow
64 IEEE 802.3 standards and support half or full-duplex operation in ISA bus 64 IEEE 802.3 standards and support half or full-duplex operation in ISA bus
65 computers on 10 Mbps Ethernet networks. The adapters are designed for operation 65 computers on 10 Mbps Ethernet networks. The adapters are designed for operation
66 in 16-bit ISA or EISA bus expansion slots and are available in 66 in 16-bit ISA or EISA bus expansion slots and are available in
67 10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5 67 10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5
68 or fiber networks). 68 or fiber networks).
69 69
70 CS8920-based adapters are similar to the CS8900-based adapter with additional 70 CS8920-based adapters are similar to the CS8900-based adapter with additional
71 features for Plug and Play (PnP) support and Wakeup Frame recognition. As 71 features for Plug and Play (PnP) support and Wakeup Frame recognition. As
72 such, the configuration procedures differ somewhat between the two types of 72 such, the configuration procedures differ somewhat between the two types of
73 adapters. Refer to the "Adapter Configuration" section for details on 73 adapters. Refer to the "Adapter Configuration" section for details on
74 configuring both types of adapters. 74 configuring both types of adapters.
75 75
76 76
77 1.2 DRIVER DESCRIPTION 77 1.2 DRIVER DESCRIPTION
78 78
79 The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux 79 The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux
80 v2.3.48 or greater kernel. It can be compiled directly into the kernel 80 v2.3.48 or greater kernel. It can be compiled directly into the kernel
81 or loaded at run-time as a device driver module. 81 or loaded at run-time as a device driver module.
82 82
83 1.2.1 Driver Name: cs89x0 83 1.2.1 Driver Name: cs89x0
84 84
85 1.2.2 Files in the Driver Archive: 85 1.2.2 Files in the Driver Archive:
86 86
87 The files in the driver at Cirrus' website include: 87 The files in the driver at Cirrus' website include:
88 88
89 readme.txt - this file 89 readme.txt - this file
90 build - batch file to compile cs89x0.c. 90 build - batch file to compile cs89x0.c.
91 cs89x0.c - driver C code 91 cs89x0.c - driver C code
92 cs89x0.h - driver header file 92 cs89x0.h - driver header file
93 cs89x0.o - pre-compiled module (for v2.2.5 kernel) 93 cs89x0.o - pre-compiled module (for v2.2.5 kernel)
94 config/Config.in - sample file to include cs89x0 driver in the kernel. 94 config/Config.in - sample file to include cs89x0 driver in the kernel.
95 config/Makefile - sample file to include cs89x0 driver in the kernel. 95 config/Makefile - sample file to include cs89x0 driver in the kernel.
96 config/Space.c - sample file to include cs89x0 driver in the kernel. 96 config/Space.c - sample file to include cs89x0 driver in the kernel.
97 97
98 98
99 99
100 1.3 SYSTEM REQUIREMENTS 100 1.3 SYSTEM REQUIREMENTS
101 101
102 The following hardware is required: 102 The following hardware is required:
103 103
104 * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter 104 * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter
105 105
106 * IBM or IBM-compatible PC with: 106 * IBM or IBM-compatible PC with:
107 * An 80386 or higher processor 107 * An 80386 or higher processor
108 * 16 bytes of contiguous IO space available between 210h - 370h 108 * 16 bytes of contiguous IO space available between 210h - 370h
109 * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920). 109 * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920).
110 110
111 * Appropriate cable (and connector for AUI, 10BASE-2) for your network 111 * Appropriate cable (and connector for AUI, 10BASE-2) for your network
112 topology. 112 topology.
113 113
114 The following software is required: 114 The following software is required:
115 115
116 * LINUX kernel version 2.3.48 or higher 116 * LINUX kernel version 2.3.48 or higher
117 117
118 * CS8900/20 Setup Utility (DOS-based) 118 * CS8900/20 Setup Utility (DOS-based)
119 119
120 * LINUX kernel sources for your kernel (if compiling into kernel) 120 * LINUX kernel sources for your kernel (if compiling into kernel)
121 121
122 * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel 122 * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel
123 or a module) 123 or a module)
124 124
125 125
126 126
127 1.4 LICENSING INFORMATION 127 1.4 LICENSING INFORMATION
128 128
129 This program is free software; you can redistribute it and/or modify it under 129 This program is free software; you can redistribute it and/or modify it under
130 the terms of the GNU General Public License as published by the Free Software 130 the terms of the GNU General Public License as published by the Free Software
131 Foundation, version 1. 131 Foundation, version 1.
132 132
133 This program is distributed in the hope that it will be useful, but WITHOUT 133 This program is distributed in the hope that it will be useful, but WITHOUT
134 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 134 ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
135 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for 135 FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
136 more details. 136 more details.
137 137
138 For a full copy of the GNU General Public License, write to the Free Software 138 For a full copy of the GNU General Public License, write to the Free Software
139 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 139 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
140 140
141 141
142 142
143 2.0 ADAPTER INSTALLATION and CONFIGURATION 143 2.0 ADAPTER INSTALLATION and CONFIGURATION
144 =============================================================================== 144 ===============================================================================
145 145
146 Both the CS8900 and CS8920-based adapters can be configured using parameters 146 Both the CS8900 and CS8920-based adapters can be configured using parameters
147 stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup 147 stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup
148 Utility if you want to change the adapter's configuration in EEPROM. 148 Utility if you want to change the adapter's configuration in EEPROM.
149 149
150 When loading the driver as a module, you can specify many of the adapter's 150 When loading the driver as a module, you can specify many of the adapter's
151 configuration parameters on the command-line to override the EEPROM's settings 151 configuration parameters on the command-line to override the EEPROM's settings
152 or for interface configuration when an EEPROM is not used. (CS8920-based 152 or for interface configuration when an EEPROM is not used. (CS8920-based
153 adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE. 153 adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE.
154 154
155 Since the CS8900/20 Setup Utility is a DOS-based application, you must install 155 Since the CS8900/20 Setup Utility is a DOS-based application, you must install
156 and configure the adapter in a DOS-based system using the CS8900/20 Setup 156 and configure the adapter in a DOS-based system using the CS8900/20 Setup
157 Utility before installation in the target LINUX system. (Not required if 157 Utility before installation in the target LINUX system. (Not required if
158 installing a CS8900-based adapter and the default configuration is acceptable.) 158 installing a CS8900-based adapter and the default configuration is acceptable.)
159 159
160 160
161 2.1 CS8900-BASED ADAPTER CONFIGURATION 161 2.1 CS8900-BASED ADAPTER CONFIGURATION
162 162
163 CS8900-based adapters shipped from Cirrus Logic have been configured 163 CS8900-based adapters shipped from Cirrus Logic have been configured
164 with the following "default" settings: 164 with the following "default" settings:
165 165
166 Operation Mode: Memory Mode 166 Operation Mode: Memory Mode
167 IRQ: 10 167 IRQ: 10
168 Base I/O Address: 300 168 Base I/O Address: 300
169 Memory Base Address: D0000 169 Memory Base Address: D0000
170 Optimization: DOS Client 170 Optimization: DOS Client
171 Transmission Mode: Half-duplex 171 Transmission Mode: Half-duplex
172 BootProm: None 172 BootProm: None
173 Media Type: Autodetect (3-media cards) or 173 Media Type: Autodetect (3-media cards) or
174 10BASE-T (10BASE-T only adapter) 174 10BASE-T (10BASE-T only adapter)
175 175
176 You should only change the default configuration settings if conflicts with 176 You should only change the default configuration settings if conflicts with
177 another adapter exists. To change the adapter's configuration, run the 177 another adapter exists. To change the adapter's configuration, run the
178 CS8900/20 Setup Utility. 178 CS8900/20 Setup Utility.
179 179
180 180
181 2.2 CS8920-BASED ADAPTER CONFIGURATION 181 2.2 CS8920-BASED ADAPTER CONFIGURATION
182 182
183 CS8920-based adapters are shipped from Cirrus Logic configured as Plug 183 CS8920-based adapters are shipped from Cirrus Logic configured as Plug
184 and Play (PnP) enabled. However, since the cs89x0 driver does NOT 184 and Play (PnP) enabled. However, since the cs89x0 driver does NOT
185 support PnP, you must install the CS8920 adapter in a DOS-based PC and 185 support PnP, you must install the CS8920 adapter in a DOS-based PC and
186 run the CS8900/20 Setup Utility to disable PnP and configure the 186 run the CS8900/20 Setup Utility to disable PnP and configure the
187 adapter before installation in the target Linux system. Failure to do 187 adapter before installation in the target Linux system. Failure to do
188 this will leave the adapter inactive and the driver will be unable to 188 this will leave the adapter inactive and the driver will be unable to
189 communicate with the adapter. 189 communicate with the adapter.
190 190
191 191
192 **************************************************************** 192 ****************************************************************
193 * CS8920-BASED ADAPTERS: * 193 * CS8920-BASED ADAPTERS: *
194 * * 194 * *
195 * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. * 195 * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. *
196 * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST * 196 * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST *
197 * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND * 197 * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND *
198 * TO ACTIVATE THE ADAPTER. * 198 * TO ACTIVATE THE ADAPTER. *
199 **************************************************************** 199 ****************************************************************
200 200
201 201
202 202
203 203
204 3.0 LOADING THE DRIVER AS A MODULE 204 3.0 LOADING THE DRIVER AS A MODULE
205 =============================================================================== 205 ===============================================================================
206 206
207 If the driver is compiled as a loadable module, you can load the driver module 207 If the driver is compiled as a loadable module, you can load the driver module
208 with the 'modprobe' command. Many of the adapter's configuration parameters can 208 with the 'modprobe' command. Many of the adapter's configuration parameters can
209 be specified as command-line arguments to the load command. This facility 209 be specified as command-line arguments to the load command. This facility
210 provides a means to override the EEPROM's settings or for interface 210 provides a means to override the EEPROM's settings or for interface
211 configuration when an EEPROM is not used. 211 configuration when an EEPROM is not used.
212 212
213 Example: 213 Example:
214 214
215 insmod cs89x0.o io=0x200 irq=0xA media=aui 215 insmod cs89x0.o io=0x200 irq=0xA media=aui
216 216
217 This example loads the module and configures the adapter to use an IO port base 217 This example loads the module and configures the adapter to use an IO port base
218 address of 200h, interrupt 10, and use the AUI media connection. The following 218 address of 200h, interrupt 10, and use the AUI media connection. The following
219 configuration options are available on the command line: 219 configuration options are available on the command line:
220 220
221 * io=### - specify IO address (200h-360h) 221 * io=### - specify IO address (200h-360h)
222 * irq=## - specify interrupt level 222 * irq=## - specify interrupt level
223 * use_dma=1 - Enable DMA 223 * use_dma=1 - Enable DMA
224 * dma=# - specify dma channel (Driver is compiled to support 224 * dma=# - specify dma channel (Driver is compiled to support
225 Rx DMA only) 225 Rx DMA only)
226 * dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16. 226 * dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16.
227 * media=rj45 - specify media type 227 * media=rj45 - specify media type
228 or media=bnc 228 or media=bnc
229 or media=aui 229 or media=aui
230 or media=auto 230 or media=auto
231 * duplex=full - specify forced half/full/autonegotiate duplex 231 * duplex=full - specify forced half/full/autonegotiate duplex
232 or duplex=half 232 or duplex=half
233 or duplex=auto 233 or duplex=auto
234 * debug=# - debug level (only available if the driver was compiled 234 * debug=# - debug level (only available if the driver was compiled
235 for debugging) 235 for debugging)
236 236
237 NOTES: 237 NOTES:
238 238
239 a) If an EEPROM is present, any specified command-line parameter 239 a) If an EEPROM is present, any specified command-line parameter
240 will override the corresponding configuration value stored in 240 will override the corresponding configuration value stored in
241 EEPROM. 241 EEPROM.
242 242
243 b) The "io" parameter must be specified on the command-line. 243 b) The "io" parameter must be specified on the command-line.
244 244
245 c) The driver's hardware probe routine is designed to avoid 245 c) The driver's hardware probe routine is designed to avoid
246 writing to I/O space until it knows that there is a cs89x0 246 writing to I/O space until it knows that there is a cs89x0
247 card at the written addresses. This could cause problems 247 card at the written addresses. This could cause problems
248 with device probing. To avoid this behaviour, add one 248 with device probing. To avoid this behaviour, add one
249 to the `io=' module parameter. This doesn't actually change 249 to the `io=' module parameter. This doesn't actually change
250 the I/O address, but it is a flag to tell the driver 250 the I/O address, but it is a flag to tell the driver
251 topartially initialise the hardware before trying to 251 topartially initialise the hardware before trying to
252 identify the card. This could be dangerous if you are 252 identify the card. This could be dangerous if you are
253 not sure that there is a cs89x0 card at the provided address. 253 not sure that there is a cs89x0 card at the provided address.
254 254
255 For example, to scan for an adapter located at IO base 0x300, 255 For example, to scan for an adapter located at IO base 0x300,
256 specify an IO address of 0x301. 256 specify an IO address of 0x301.
257 257
258 d) The "duplex=auto" parameter is only supported for the CS8920. 258 d) The "duplex=auto" parameter is only supported for the CS8920.
259 259
260 e) The minimum command-line configuration required if an EEPROM is 260 e) The minimum command-line configuration required if an EEPROM is
261 not present is: 261 not present is:
262 262
263 io 263 io
264 irq 264 irq
265 media type (no autodetect) 265 media type (no autodetect)
266 266
267 f) The following additional parameters are CS89XX defaults (values 267 f) The following additional parameters are CS89XX defaults (values
268 used with no EEPROM or command-line argument). 268 used with no EEPROM or command-line argument).
269 269
270 * DMA Burst = enabled 270 * DMA Burst = enabled
271 * IOCHRDY Enabled = enabled 271 * IOCHRDY Enabled = enabled
272 * UseSA = enabled 272 * UseSA = enabled
273 * CS8900 defaults to half-duplex if not specified on command-line 273 * CS8900 defaults to half-duplex if not specified on command-line
274 * CS8920 defaults to autoneg if not specified on command-line 274 * CS8920 defaults to autoneg if not specified on command-line
275 * Use reset defaults for other config parameters 275 * Use reset defaults for other config parameters
276 * dma_mode = 0 276 * dma_mode = 0
277 277
278 g) You can use ifconfig to set the adapter's Ethernet address. 278 g) You can use ifconfig to set the adapter's Ethernet address.
279 279
280 h) Many Linux distributions use the 'modprobe' command to load 280 h) Many Linux distributions use the 'modprobe' command to load
281 modules. This program uses the '/etc/conf.modules' file to 281 modules. This program uses the '/etc/conf.modules' file to
282 determine configuration information which is passed to a driver 282 determine configuration information which is passed to a driver
283 module when it is loaded. All the configuration options which are 283 module when it is loaded. All the configuration options which are
284 described above may be placed within /etc/conf.modules. 284 described above may be placed within /etc/conf.modules.
285 285
286 For example: 286 For example:
287 287
288 > cat /etc/conf.modules 288 > cat /etc/conf.modules
289 ... 289 ...
290 alias eth0 cs89x0 290 alias eth0 cs89x0
291 options cs89x0 io=0x0200 dma=5 use_dma=1 291 options cs89x0 io=0x0200 dma=5 use_dma=1
292 ... 292 ...
293 293
294 In this example we are telling the module system that the 294 In this example we are telling the module system that the
295 ethernet driver for this machine should use the cs89x0 driver. We 295 ethernet driver for this machine should use the cs89x0 driver. We
296 are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma' 296 are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma'
297 arguments to the driver when it is loaded. 297 arguments to the driver when it is loaded.
298 298
299 i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or 299 i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or
300 7. You will probably find that other DMA channels will not work. 300 7. You will probably find that other DMA channels will not work.
301 301
302 j) The cs89x0 supports DMA for receiving only. DMA mode is 302 j) The cs89x0 supports DMA for receiving only. DMA mode is
303 significantly more efficient. Flooding a 400 MHz Celeron machine 303 significantly more efficient. Flooding a 400 MHz Celeron machine
304 with large ping packets consumes 82% of its CPU capacity in non-DMA 304 with large ping packets consumes 82% of its CPU capacity in non-DMA
305 mode. With DMA this is reduced to 45%. 305 mode. With DMA this is reduced to 45%.
306 306
307 k) If your Linux kernel was compiled with inbuilt plug-and-play 307 k) If your Linux kernel was compiled with inbuilt plug-and-play
308 support you will be able to find information about the cs89x0 card 308 support you will be able to find information about the cs89x0 card
309 with the command 309 with the command
310 310
311 cat /proc/isapnp 311 cat /proc/isapnp
312 312
313 l) If during DMA operation you find erratic behavior or network data 313 l) If during DMA operation you find erratic behavior or network data
314 corruption you should use your PC's BIOS to slow the EISA bus clock. 314 corruption you should use your PC's BIOS to slow the EISA bus clock.
315 315
316 m) If the cs89x0 driver is compiled directly into the kernel 316 m) If the cs89x0 driver is compiled directly into the kernel
317 (non-modular) then its I/O address is automatically determined by 317 (non-modular) then its I/O address is automatically determined by
318 ISA bus probing. The IRQ number, media options, etc are determined 318 ISA bus probing. The IRQ number, media options, etc are determined
319 from the card's EEPROM. 319 from the card's EEPROM.
320 320
321 n) If the cs89x0 driver is compiled directly into the kernel, DMA 321 n) If the cs89x0 driver is compiled directly into the kernel, DMA
322 mode may be selected by providing the kernel with a boot option 322 mode may be selected by providing the kernel with a boot option
323 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7). 323 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7).
324 324
325 Kernel boot options may be provided on the LILO command line: 325 Kernel boot options may be provided on the LILO command line:
326 326
327 LILO boot: linux cs89x0_dma=5 327 LILO boot: linux cs89x0_dma=5
328 328
329 or they may be placed in /etc/lilo.conf: 329 or they may be placed in /etc/lilo.conf:
330 330
331 image=/boot/bzImage-2.3.48 331 image=/boot/bzImage-2.3.48
332 append="cs89x0_dma=5" 332 append="cs89x0_dma=5"
333 label=linux 333 label=linux
334 root=/dev/hda5 334 root=/dev/hda5
335 read-only 335 read-only
336 336
337 The DMA Rx buffer size is hardwired to 16 kbytes in this mode. 337 The DMA Rx buffer size is hardwired to 16 kbytes in this mode.
338 (64k mode is not available). 338 (64k mode is not available).
339 339
340 340
341 4.0 COMPILING THE DRIVER 341 4.0 COMPILING THE DRIVER
342 =============================================================================== 342 ===============================================================================
343 343
344 The cs89x0 driver can be compiled directly into the kernel or compiled into 344 The cs89x0 driver can be compiled directly into the kernel or compiled into
345 a loadable device driver module. 345 a loadable device driver module.
346 346
347 347
348 4.1 COMPILING THE DRIVER AS A LOADABLE MODULE 348 4.1 COMPILING THE DRIVER AS A LOADABLE MODULE
349 349
350 To compile the driver into a loadable module, use the following command 350 To compile the driver into a loadable module, use the following command
351 (single command line, without quotes): 351 (single command line, without quotes):
352 352
353 "gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall 353 "gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall
354 -Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS 354 -Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS
355 -c cs89x0.c" 355 -c cs89x0.c"
356 356
357 4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE 357 4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE
358 358
359 Support for memory mode was not carried over into the 2.3 series kernels. 359 Support for memory mode was not carried over into the 2.3 series kernels.
360 360
361 4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA 361 4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA
362 362
363 The compile-time optionality for DMA was removed in the 2.3 kernel 363 The compile-time optionality for DMA was removed in the 2.3 kernel
364 series. DMA support is now unconditionally part of the driver. It is 364 series. DMA support is now unconditionally part of the driver. It is
365 enabled by the 'use_dma=1' module option. 365 enabled by the 'use_dma=1' module option.
366 366
367 4.4 COMPILING THE DRIVER INTO THE KERNEL 367 4.4 COMPILING THE DRIVER INTO THE KERNEL
368 368
369 If your Linux distribution already has support for the cs89x0 driver 369 If your Linux distribution already has support for the cs89x0 driver
370 then simply copy the source file to the /usr/src/linux/drivers/net 370 then simply copy the source file to the /usr/src/linux/drivers/net
371 directory to replace the original ones and run the make utility to 371 directory to replace the original ones and run the make utility to
372 rebuild the kernel. See Step 3 for rebuilding the kernel. 372 rebuild the kernel. See Step 3 for rebuilding the kernel.
373 373
374 If your Linux does not include the cs89x0 driver, you need to edit three 374 If your Linux does not include the cs89x0 driver, you need to edit three
375 configuration files, copy the source file to the /usr/src/linux/drivers/net 375 configuration files, copy the source file to the /usr/src/linux/drivers/net
376 directory, and then run the make utility to rebuild the kernel. 376 directory, and then run the make utility to rebuild the kernel.
377 377
378 1. Edit the following configuration files by adding the statements as 378 1. Edit the following configuration files by adding the statements as
379 indicated. (When possible, try to locate the added text to the section of the 379 indicated. (When possible, try to locate the added text to the section of the
380 file containing similar statements). 380 file containing similar statements).
381 381
382 382
383 a.) In /usr/src/linux/drivers/net/Config.in, add: 383 a.) In /usr/src/linux/drivers/net/Config.in, add:
384 384
385 tristate 'CS89x0 support' CONFIG_CS89x0 385 tristate 'CS89x0 support' CONFIG_CS89x0
386 386
387 Example: 387 Example:
388 388
389 if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then 389 if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
390 tristate 'ICL EtherTeam 16i/32 support' CONFIG_ETH16I 390 tristate 'ICL EtherTeam 16i/32 support' CONFIG_ETH16I
391 fi 391 fi
392 392
393 tristate 'CS89x0 support' CONFIG_CS89x0 393 tristate 'CS89x0 support' CONFIG_CS89x0
394 394
395 tristate 'NE2000/NE1000 support' CONFIG_NE2000 395 tristate 'NE2000/NE1000 support' CONFIG_NE2000
396 if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then 396 if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then
397 tristate 'NI5210 support' CONFIG_NI52 397 tristate 'NI5210 support' CONFIG_NI52
398 398
399 399
400 b.) In /usr/src/linux/drivers/net/Makefile, add the following lines: 400 b.) In /usr/src/linux/drivers/net/Makefile, add the following lines:
401 401
402 ifeq ($(CONFIG_CS89x0),y) 402 ifeq ($(CONFIG_CS89x0),y)
403 L_OBJS += cs89x0.o 403 L_OBJS += cs89x0.o
404 else 404 else
405 ifeq ($(CONFIG_CS89x0),m) 405 ifeq ($(CONFIG_CS89x0),m)
406 M_OBJS += cs89x0.o 406 M_OBJS += cs89x0.o
407 endif 407 endif
408 endif 408 endif
409 409
410 410
411 c.) In /linux/drivers/net/Space.c file, add the line: 411 c.) In /linux/drivers/net/Space.c file, add the line:
412 412
413 extern int cs89x0_probe(struct device *dev); 413 extern int cs89x0_probe(struct device *dev);
414 414
415 415
416 Example: 416 Example:
417 417
418 extern int ultra_probe(struct device *dev); 418 extern int ultra_probe(struct device *dev);
419 extern int wd_probe(struct device *dev); 419 extern int wd_probe(struct device *dev);
420 extern int el2_probe(struct device *dev); 420 extern int el2_probe(struct device *dev);
421 421
422 extern int cs89x0_probe(struct device *dev); 422 extern int cs89x0_probe(struct device *dev);
423 423
424 extern int ne_probe(struct device *dev); 424 extern int ne_probe(struct device *dev);
425 extern int hp_probe(struct device *dev); 425 extern int hp_probe(struct device *dev);
426 extern int hp_plus_probe(struct device *dev); 426 extern int hp_plus_probe(struct device *dev);
427 427
428 428
429 Also add: 429 Also add:
430 430
431 #ifdef CONFIG_CS89x0 431 #ifdef CONFIG_CS89x0
432 { cs89x0_probe,0 }, 432 { cs89x0_probe,0 },
433 #endif 433 #endif
434 434
435 435
436 2.) Copy the driver source files (cs89x0.c and cs89x0.h) 436 2.) Copy the driver source files (cs89x0.c and cs89x0.h)
437 into the /usr/src/linux/drivers/net directory. 437 into the /usr/src/linux/drivers/net directory.
438 438
439 439
440 3.) Go to /usr/src/linux directory and run 'make config' followed by 'make' 440 3.) Go to /usr/src/linux directory and run 'make config' followed by 'make'
441 (or make bzImage) to rebuild the kernel. 441 (or make bzImage) to rebuild the kernel.
442 442
443 4.) Use the DOS 'setup' utility to disable plug and play on the NIC. 443 4.) Use the DOS 'setup' utility to disable plug and play on the NIC.
444 444
445 445
446 5.0 TESTING AND TROUBLESHOOTING 446 5.0 TESTING AND TROUBLESHOOTING
447 =============================================================================== 447 ===============================================================================
448 448
449 5.1 KNOWN DEFECTS and LIMITATIONS 449 5.1 KNOWN DEFECTS and LIMITATIONS
450 450
451 Refer to the RELEASE.TXT file distributed as part of this archive for a list of 451 Refer to the RELEASE.TXT file distributed as part of this archive for a list of
452 known defects, driver limitations, and work arounds. 452 known defects, driver limitations, and work arounds.
453 453
454 454
455 5.2 TESTING THE ADAPTER 455 5.2 TESTING THE ADAPTER
456 456
457 Once the adapter has been installed and configured, the diagnostic option of 457 Once the adapter has been installed and configured, the diagnostic option of
458 the CS8900/20 Setup Utility can be used to test the functionality of the 458 the CS8900/20 Setup Utility can be used to test the functionality of the
459 adapter and its network connection. Use the diagnostics 'Self Test' option to 459 adapter and its network connection. Use the diagnostics 'Self Test' option to
460 test the functionality of the adapter with the hardware configuration you have 460 test the functionality of the adapter with the hardware configuration you have
461 assigned. You can use the diagnostics 'Network Test' to test the ability of the 461 assigned. You can use the diagnostics 'Network Test' to test the ability of the
462 adapter to communicate across the Ethernet with another PC equipped with a 462 adapter to communicate across the Ethernet with another PC equipped with a
463 CS8900/20-based adapter card (it must also be running the CS8900/20 Setup 463 CS8900/20-based adapter card (it must also be running the CS8900/20 Setup
464 Utility). 464 Utility).
465 465
466 NOTE: The Setup Utility's diagnostics are designed to run in a 466 NOTE: The Setup Utility's diagnostics are designed to run in a
467 DOS-only operating system environment. DO NOT run the diagnostics 467 DOS-only operating system environment. DO NOT run the diagnostics
468 from a DOS or command prompt session under Windows 95, Windows NT, 468 from a DOS or command prompt session under Windows 95, Windows NT,
469 OS/2, or other operating system. 469 OS/2, or other operating system.
470 470
471 To run the diagnostics tests on the CS8900/20 adapter: 471 To run the diagnostics tests on the CS8900/20 adapter:
472 472
473 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility. 473 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility.
474 474
475 2.) The adapter's current configuration is displayed. Hit the ENTER key to 475 2.) The adapter's current configuration is displayed. Hit the ENTER key to
476 get to the main menu. 476 get to the main menu.
477 477
478 4.) Select 'Diagnostics' (ALT-G) from the main menu. 478 4.) Select 'Diagnostics' (ALT-G) from the main menu.
479 * Select 'Self-Test' to test the adapter's basic functionality. 479 * Select 'Self-Test' to test the adapter's basic functionality.
480 * Select 'Network Test' to test the network connection and cabling. 480 * Select 'Network Test' to test the network connection and cabling.
481 481
482 482
483 5.2.1 DIAGNOSTIC SELF-TEST 483 5.2.1 DIAGNOSTIC SELF-TEST
484 484
485 The diagnostic self-test checks the adapter's basic functionality as well as 485 The diagnostic self-test checks the adapter's basic functionality as well as
486 its ability to communicate across the ISA bus based on the system resources 486 its ability to communicate across the ISA bus based on the system resources
487 assigned during hardware configuration. The following tests are performed: 487 assigned during hardware configuration. The following tests are performed:
488 488
489 * IO Register Read/Write Test 489 * IO Register Read/Write Test
490 The IO Register Read/Write test insures that the CS8900/20 can be 490 The IO Register Read/Write test insures that the CS8900/20 can be
491 accessed in IO mode, and that the IO base address is correct. 491 accessed in IO mode, and that the IO base address is correct.
492 492
493 * Shared Memory Test 493 * Shared Memory Test
494 The Shared Memory test insures the CS8900/20 can be accessed in memory 494 The Shared Memory test insures the CS8900/20 can be accessed in memory
495 mode and that the range of memory addresses assigned does not conflict 495 mode and that the range of memory addresses assigned does not conflict
496 with other devices in the system. 496 with other devices in the system.
497 497
498 * Interrupt Test 498 * Interrupt Test
499 The Interrupt test insures there are no conflicts with the assigned IRQ 499 The Interrupt test insures there are no conflicts with the assigned IRQ
500 signal. 500 signal.
501 501
502 * EEPROM Test 502 * EEPROM Test
503 The EEPROM test insures the EEPROM can be read. 503 The EEPROM test insures the EEPROM can be read.
504 504
505 * Chip RAM Test 505 * Chip RAM Test
506 The Chip RAM test insures the 4K of memory internal to the CS8900/20 is 506 The Chip RAM test insures the 4K of memory internal to the CS8900/20 is
507 working properly. 507 working properly.
508 508
509 * Internal Loop-back Test 509 * Internal Loop-back Test
510 The Internal Loop Back test insures the adapter's transmitter and 510 The Internal Loop Back test insures the adapter's transmitter and
511 receiver are operating properly. If this test fails, make sure the 511 receiver are operating properly. If this test fails, make sure the
512 adapter's cable is connected to the network (check for LED activity for 512 adapter's cable is connected to the network (check for LED activity for
513 example). 513 example).
514 514
515 * Boot PROM Test 515 * Boot PROM Test
516 The Boot PROM test insures the Boot PROM is present, and can be read. 516 The Boot PROM test insures the Boot PROM is present, and can be read.
517 Failure indicates the Boot PROM was not successfully read due to a 517 Failure indicates the Boot PROM was not successfully read due to a
518 hardware problem or due to a conflicts on the Boot PROM address 518 hardware problem or due to a conflicts on the Boot PROM address
519 assignment. (Test only applies if the adapter is configured to use the 519 assignment. (Test only applies if the adapter is configured to use the
520 Boot PROM option.) 520 Boot PROM option.)
521 521
522 Failure of a test item indicates a possible system resource conflict with 522 Failure of a test item indicates a possible system resource conflict with
523 another device on the ISA bus. In this case, you should use the Manual Setup 523 another device on the ISA bus. In this case, you should use the Manual Setup
524 option to reconfigure the adapter by selecting a different value for the system 524 option to reconfigure the adapter by selecting a different value for the system
525 resource that failed. 525 resource that failed.
526 526
527 527
528 5.2.2 DIAGNOSTIC NETWORK TEST 528 5.2.2 DIAGNOSTIC NETWORK TEST
529 529
530 The Diagnostic Network Test verifies a working network connection by 530 The Diagnostic Network Test verifies a working network connection by
531 transferring data between two CS8900/20 adapters installed in different PCs 531 transferring data between two CS8900/20 adapters installed in different PCs
532 on the same network. (Note: the diagnostic network test should not be run 532 on the same network. (Note: the diagnostic network test should not be run
533 between two nodes across a router.) 533 between two nodes across a router.)
534 534
535 This test requires that each of the two PCs have a CS8900/20-based adapter 535 This test requires that each of the two PCs have a CS8900/20-based adapter
536 installed and have the CS8900/20 Setup Utility running. The first PC is 536 installed and have the CS8900/20 Setup Utility running. The first PC is
537 configured as a Responder and the other PC is configured as an Initiator. 537 configured as a Responder and the other PC is configured as an Initiator.
538 Once the Initiator is started, it sends data frames to the Responder which 538 Once the Initiator is started, it sends data frames to the Responder which
539 returns the frames to the Initiator. 539 returns the frames to the Initiator.
540 540
541 The total number of frames received and transmitted are displayed on the 541 The total number of frames received and transmitted are displayed on the
542 Initiator's display, along with a count of the number of frames received and 542 Initiator's display, along with a count of the number of frames received and
543 transmitted OK or in error. The test can be terminated anytime by the user at 543 transmitted OK or in error. The test can be terminated anytime by the user at
544 either PC. 544 either PC.
545 545
546 To setup the Diagnostic Network Test: 546 To setup the Diagnostic Network Test:
547 547
548 1.) Select a PC with a CS8900/20-based adapter and a known working network 548 1.) Select a PC with a CS8900/20-based adapter and a known working network
549 connection to act as the Responder. Run the CS8900/20 Setup Utility 549 connection to act as the Responder. Run the CS8900/20 Setup Utility
550 and select 'Diagnostics -> Network Test -> Responder' from the main 550 and select 'Diagnostics -> Network Test -> Responder' from the main
551 menu. Hit ENTER to start the Responder. 551 menu. Hit ENTER to start the Responder.
552 552
553 2.) Return to the PC with the CS8900/20-based adapter you want to test and 553 2.) Return to the PC with the CS8900/20-based adapter you want to test and
554 start the CS8900/20 Setup Utility. 554 start the CS8900/20 Setup Utility.
555 555
556 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'. 556 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'.
557 Hit ENTER to start the test. 557 Hit ENTER to start the test.
558 558
559 You may stop the test on the Initiator at any time while allowing the Responder 559 You may stop the test on the Initiator at any time while allowing the Responder
560 to continue running. In this manner, you can move to additional PCs and test 560 to continue running. In this manner, you can move to additional PCs and test
561 them by starting the Initiator on another PC without having to stop/start the 561 them by starting the Initiator on another PC without having to stop/start the
562 Responder. 562 Responder.
563 563
564 564
565 565
566 5.3 USING THE ADAPTER'S LEDs 566 5.3 USING THE ADAPTER'S LEDs
567 567
568 The 2 and 3-media adapters have two LEDs visible on the back end of the board 568 The 2 and 3-media adapters have two LEDs visible on the back end of the board
569 located near the 10Base-T connector. 569 located near the 10Base-T connector.
570 570
571 Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T 571 Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T
572 connection. (Only applies to 10Base-T. The green LED has no significance for 572 connection. (Only applies to 10Base-T. The green LED has no significance for
573 a 10Base-2 or AUI connection.) 573 a 10Base-2 or AUI connection.)
574 574
575 TX/RX LED: The yellow LED lights briefly each time the adapter transmits or 575 TX/RX LED: The yellow LED lights briefly each time the adapter transmits or
576 receives data. (The yellow LED will appear to "flicker" on a typical network.) 576 receives data. (The yellow LED will appear to "flicker" on a typical network.)
577 577
578 578
579 5.4 RESOLVING I/O CONFLICTS 579 5.4 RESOLVING I/O CONFLICTS
580 580
581 An IO conflict occurs when two or more adapter use the same ISA resource (IO 581 An IO conflict occurs when two or more adapter use the same ISA resource (IO
582 address, memory address or IRQ). You can usually detect an IO conflict in one 582 address, memory address or IRQ). You can usually detect an IO conflict in one
583 of four ways after installing and or configuring the CS8900/20-based adapter: 583 of four ways after installing and or configuring the CS8900/20-based adapter:
584 584
585 1.) The system does not boot properly (or at all). 585 1.) The system does not boot properly (or at all).
586 586
587 2.) The driver cannot communicate with the adapter, reporting an "Adapter 587 2.) The driver cannot communicate with the adapter, reporting an "Adapter
588 not found" error message. 588 not found" error message.
589 589
590 3.) You cannot connect to the network or the driver will not load. 590 3.) You cannot connect to the network or the driver will not load.
591 591
592 4.) If you have configured the adapter to run in memory mode but the driver 592 4.) If you have configured the adapter to run in memory mode but the driver
593 reports it is using IO mode when loading, this is an indication of a 593 reports it is using IO mode when loading, this is an indication of a
594 memory address conflict. 594 memory address conflict.
595 595
596 If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a 596 If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a
597 diagnostic self-test. Normally, the ISA resource in conflict will fail the 597 diagnostic self-test. Normally, the ISA resource in conflict will fail the
598 self-test. If so, reconfigure the adapter selecting another choice for the 598 self-test. If so, reconfigure the adapter selecting another choice for the
599 resource in conflict. Run the diagnostics again to check for further IO 599 resource in conflict. Run the diagnostics again to check for further IO
600 conflicts. 600 conflicts.
601 601
602 In some cases, such as when the PC will not boot, it may be necessary to remove 602 In some cases, such as when the PC will not boot, it may be necessary to remove
603 the adapter and reconfigure it by installing it in another PC to run the 603 the adapter and reconfigure it by installing it in another PC to run the
604 CS8900/20 Setup Utility. Once reinstalled in the target system, run the 604 CS8900/20 Setup Utility. Once reinstalled in the target system, run the
605 diagnostics self-test to ensure the new configuration is free of conflicts 605 diagnostics self-test to ensure the new configuration is free of conflicts
606 before loading the driver again. 606 before loading the driver again.
607 607
608 When manually configuring the adapter, keep in mind the typical ISA system 608 When manually configuring the adapter, keep in mind the typical ISA system
609 resource usage as indicated in the tables below. 609 resource usage as indicated in the tables below.
610 610
611 I/O Address Device IRQ Device 611 I/O Address Device IRQ Device
612 ----------- -------- --- -------- 612 ----------- -------- --- --------
613 200-20F Game I/O adapter 3 COM2, Bus Mouse 613 200-20F Game I/O adapter 3 COM2, Bus Mouse
614 230-23F Bus Mouse 4 COM1 614 230-23F Bus Mouse 4 COM1
615 270-27F LPT3: third parallel port 5 LPT2 615 270-27F LPT3: third parallel port 5 LPT2
616 2F0-2FF COM2: second serial port 6 Floppy Disk controller 616 2F0-2FF COM2: second serial port 6 Floppy Disk controller
617 320-32F Fixed disk controller 7 LPT1 617 320-32F Fixed disk controller 7 LPT1
618 8 Real-time Clock 618 8 Real-time Clock
619 9 EGA/VGA display adapter 619 9 EGA/VGA display adapter
620 12 Mouse (PS/2) 620 12 Mouse (PS/2)
621 Memory Address Device 13 Math Coprocessor 621 Memory Address Device 13 Math Coprocessor
622 -------------- --------------------- 14 Hard Disk controller 622 -------------- --------------------- 14 Hard Disk controller
623 A000-BFFF EGA Graphics Adpater 623 A000-BFFF EGA Graphics Adpater
624 A000-C7FF VGA Graphics Adpater 624 A000-C7FF VGA Graphics Adpater
625 B000-BFFF Mono Graphics Adapter 625 B000-BFFF Mono Graphics Adapter
626 B800-BFFF Color Graphics Adapter 626 B800-BFFF Color Graphics Adapter
627 E000-FFFF AT BIOS 627 E000-FFFF AT BIOS
628 628
629 629
630 630
631 631
632 6.0 TECHNICAL SUPPORT 632 6.0 TECHNICAL SUPPORT
633 =============================================================================== 633 ===============================================================================
634 634
635 6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT 635 6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT
636 636
637 Cirrus Logic's CS89XX Technical Application Support can be reached at: 637 Cirrus Logic's CS89XX Technical Application Support can be reached at:
638 638
639 Telephone :(800) 888-5016 (from inside U.S. and Canada) 639 Telephone :(800) 888-5016 (from inside U.S. and Canada)
640 :(512) 442-7555 (from outside the U.S. and Canada) 640 :(512) 442-7555 (from outside the U.S. and Canada)
641 Fax :(512) 912-3871 641 Fax :(512) 912-3871
642 Email :ethernet@crystal.cirrus.com 642 Email :ethernet@crystal.cirrus.com
643 WWW :http://www.cirrus.com 643 WWW :http://www.cirrus.com
644 644
645 645
646 6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT 646 6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT
647 647
648 Before contacting Cirrus Logic for technical support, be prepared to provide as 648 Before contacting Cirrus Logic for technical support, be prepared to provide as
649 Much of the following information as possible. 649 Much of the following information as possible.
650 650
651 1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.) 651 1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.)
652 652
653 2.) Adapter configuration 653 2.) Adapter configuration
654 654
655 * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel 655 * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel
656 * Plug and Play enabled/disabled (CS8920-based adapters only) 656 * Plug and Play enabled/disabled (CS8920-based adapters only)
657 * Configured for media auto-detect or specific media type (which type). 657 * Configured for media auto-detect or specific media type (which type).
658 658
659 3.) PC System's Configuration 659 3.) PC System's Configuration
660 660
661 * Plug and Play system (yes/no) 661 * Plug and Play system (yes/no)
662 * BIOS (make and version) 662 * BIOS (make and version)
663 * System make and model 663 * System make and model
664 * CPU (type and speed) 664 * CPU (type and speed)
665 * System RAM 665 * System RAM
666 * SCSI Adapter 666 * SCSI Adapter
667 667
668 4.) Software 668 4.) Software
669 669
670 * CS89XX driver and version 670 * CS89XX driver and version
671 * Your network operating system and version 671 * Your network operating system and version
672 * Your system's OS version 672 * Your system's OS version
673 * Version of all protocol support files 673 * Version of all protocol support files
674 674
675 5.) Any Error Message displayed. 675 5.) Any Error Message displayed.
676 676
677 677
678 678
679 6.3 OBTAINING THE LATEST DRIVER VERSION 679 6.3 OBTAINING THE LATEST DRIVER VERSION
680 680
681 You can obtain the latest CS89XX drivers and support software from Cirrus Logic's 681 You can obtain the latest CS89XX drivers and support software from Cirrus Logic's
682 Web site. You can also contact Cirrus Logic's Technical Support (email: 682 Web site. You can also contact Cirrus Logic's Technical Support (email:
683 ethernet@crystal.cirrus.com) and request that you be registered for automatic 683 ethernet@crystal.cirrus.com) and request that you be registered for automatic
684 software-update notification. 684 software-update notification.
685 685
686 Cirrus Logic maintains a web page at http://www.cirrus.com with the 686 Cirrus Logic maintains a web page at http://www.cirrus.com with the
687 the latest drivers and technical publications. 687 latest drivers and technical publications.
688 688
689 689
690 6.4 Current maintainer 690 6.4 Current maintainer
691 691
692 In February 2000 the maintenance of this driver was assumed by Andrew 692 In February 2000 the maintenance of this driver was assumed by Andrew
693 Morton <akpm@zip.com.au> 693 Morton <akpm@zip.com.au>
694 694
695 6.5 Kernel module parameters 695 6.5 Kernel module parameters
696 696
697 For use in embedded environments with no cs89x0 EEPROM, the kernel boot 697 For use in embedded environments with no cs89x0 EEPROM, the kernel boot
698 parameter `cs89x0_media=' has been implemented. Usage is: 698 parameter `cs89x0_media=' has been implemented. Usage is:
699 699
700 cs89x0_media=rj45 or 700 cs89x0_media=rj45 or
701 cs89x0_media=aui or 701 cs89x0_media=aui or
702 cs89x0_media=bnc 702 cs89x0_media=bnc
703 703
704 704
Documentation/networking/decnet.txt
1 Linux DECnet Networking Layer Information 1 Linux DECnet Networking Layer Information
2 =========================================== 2 ===========================================
3 3
4 1) Other documentation.... 4 1) Other documentation....
5 5
6 o Project Home Pages 6 o Project Home Pages
7 http://www.chygwyn.com/DECnet/ - Kernel info 7 http://www.chygwyn.com/DECnet/ - Kernel info
8 http://linux-decnet.sourceforge.net/ - Userland tools 8 http://linux-decnet.sourceforge.net/ - Userland tools
9 http://www.sourceforge.net/projects/linux-decnet/ - Status page 9 http://www.sourceforge.net/projects/linux-decnet/ - Status page
10 10
11 2) Configuring the kernel 11 2) Configuring the kernel
12 12
13 Be sure to turn on the following options: 13 Be sure to turn on the following options:
14 14
15 CONFIG_DECNET (obviously) 15 CONFIG_DECNET (obviously)
16 CONFIG_PROC_FS (to see what's going on) 16 CONFIG_PROC_FS (to see what's going on)
17 CONFIG_SYSCTL (for easy configuration) 17 CONFIG_SYSCTL (for easy configuration)
18 18
19 if you want to try out router support (not properly debugged yet) 19 if you want to try out router support (not properly debugged yet)
20 you'll need the following options as well... 20 you'll need the following options as well...
21 21
22 CONFIG_DECNET_ROUTER (to be able to add/delete routes) 22 CONFIG_DECNET_ROUTER (to be able to add/delete routes)
23 CONFIG_NETFILTER (will be required for the DECnet routing daemon) 23 CONFIG_NETFILTER (will be required for the DECnet routing daemon)
24 24
25 CONFIG_DECNET_ROUTE_FWMARK is optional 25 CONFIG_DECNET_ROUTE_FWMARK is optional
26 26
27 Don't turn on SIOCGIFCONF support for DECnet unless you are really sure 27 Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
28 that you need it, in general you won't and it can cause ifconfig to 28 that you need it, in general you won't and it can cause ifconfig to
29 malfunction. 29 malfunction.
30 30
31 Run time configuration has changed slightly from the 2.4 system. If you 31 Run time configuration has changed slightly from the 2.4 system. If you
32 want to configure an endnode, then the simplified procedure is as follows: 32 want to configure an endnode, then the simplified procedure is as follows:
33 33
34 o Set the MAC address on your ethernet card before starting _any_ other 34 o Set the MAC address on your ethernet card before starting _any_ other
35 network protocols. 35 network protocols.
36 36
37 As soon as your network card is brought into the UP state, DECnet should 37 As soon as your network card is brought into the UP state, DECnet should
38 start working. If you need something more complicated or are unsure how 38 start working. If you need something more complicated or are unsure how
39 to set the MAC address, see the next section. Also all configurations which 39 to set the MAC address, see the next section. Also all configurations which
40 worked with 2.4 will work under 2.5 with no change. 40 worked with 2.4 will work under 2.5 with no change.
41 41
42 3) Command line options 42 3) Command line options
43 43
44 You can set a DECnet address on the kernel command line for compatibility 44 You can set a DECnet address on the kernel command line for compatibility
45 with the 2.4 configuration procedure, but in general it's not needed any more. 45 with the 2.4 configuration procedure, but in general it's not needed any more.
46 If you do st a DECnet address on the command line, it has only one purpose 46 If you do st a DECnet address on the command line, it has only one purpose
47 which is that its added to the addresses on the loopback device. 47 which is that its added to the addresses on the loopback device.
48 48
49 With 2.4 kernels, DECnet would only recognise addresses as local if they 49 With 2.4 kernels, DECnet would only recognise addresses as local if they
50 were added to the loopback device. In 2.5, any local interface address 50 were added to the loopback device. In 2.5, any local interface address
51 can be used to loop back to the local machine. Of course this does not 51 can be used to loop back to the local machine. Of course this does not
52 prevent you adding further addresses to the loopback device if you 52 prevent you adding further addresses to the loopback device if you
53 want to. 53 want to.
54 54
55 N.B. Since the address list of an interface determines the addresses for 55 N.B. Since the address list of an interface determines the addresses for
56 which "hello" messages are sent, if you don't set an address on the loopback 56 which "hello" messages are sent, if you don't set an address on the loopback
57 interface then you won't see any entries in /proc/net/neigh for the local 57 interface then you won't see any entries in /proc/net/neigh for the local
58 host until such time as you start a connection. This doesn't affect the 58 host until such time as you start a connection. This doesn't affect the
59 operation of the local communications in any other way though. 59 operation of the local communications in any other way though.
60 60
61 The kernel command line takes options looking like the following: 61 The kernel command line takes options looking like the following:
62 62
63 decnet=1,2 63 decnet=1,2
64 64
65 the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels 65 the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels
66 and early 2.3.xx kernels, you must use a comma when specifying the 66 and early 2.3.xx kernels, you must use a comma when specifying the
67 DECnet address like this. For more recent 2.3.xx kernels, you may 67 DECnet address like this. For more recent 2.3.xx kernels, you may
68 use almost any character except space, although a `.` would be the most 68 use almost any character except space, although a `.` would be the most
69 obvious choice :-) 69 obvious choice :-)
70 70
71 There used to be a third number specifying the node type. This option 71 There used to be a third number specifying the node type. This option
72 has gone away in favour of a per interface node type. This is now set 72 has gone away in favour of a per interface node type. This is now set
73 using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be 73 using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be
74 set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router. 74 set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router.
75 75
76 There are also equivalent options for modules. The node address can 76 There are also equivalent options for modules. The node address can
77 also be set through the /proc/sys/net/decnet/ files, as can other system 77 also be set through the /proc/sys/net/decnet/ files, as can other system
78 parameters. 78 parameters.
79 79
80 Currently the only supported devices are ethernet and ip_gre. The 80 Currently the only supported devices are ethernet and ip_gre. The
81 ethernet address of your ethernet card has to be set according to the DECnet 81 ethernet address of your ethernet card has to be set according to the DECnet
82 address of the node in order for it to be autoconfigured (and then appear in 82 address of the node in order for it to be autoconfigured (and then appear in
83 /proc/net/decnet_dev). There is a utility available at the above 83 /proc/net/decnet_dev). There is a utility available at the above
84 FTP sites called dn2ethaddr which can compute the correct ethernet 84 FTP sites called dn2ethaddr which can compute the correct ethernet
85 address to use. The address can be set by ifconfig either before at 85 address to use. The address can be set by ifconfig either before or
86 at the time the device is brought up. If you are using RedHat you can 86 at the time the device is brought up. If you are using RedHat you can
87 add the line: 87 add the line:
88 88
89 MACADDR=AA:00:04:00:03:04 89 MACADDR=AA:00:04:00:03:04
90 90
91 or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or 91 or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or
92 wherever your network card's configuration lives. Setting the MAC address 92 wherever your network card's configuration lives. Setting the MAC address
93 of your ethernet card to an address starting with "hi-ord" will cause a 93 of your ethernet card to an address starting with "hi-ord" will cause a
94 DECnet address which matches to be added to the interface (which you can 94 DECnet address which matches to be added to the interface (which you can
95 verify with iproute2). 95 verify with iproute2).
96 96
97 The default device for routing can be set through the /proc filesystem 97 The default device for routing can be set through the /proc filesystem
98 by setting /proc/sys/net/decnet/default_device to the 98 by setting /proc/sys/net/decnet/default_device to the
99 device you want DECnet to route packets out of when no specific route 99 device you want DECnet to route packets out of when no specific route
100 is available. Usually this will be eth0, for example: 100 is available. Usually this will be eth0, for example:
101 101
102 echo -n "eth0" >/proc/sys/net/decnet/default_device 102 echo -n "eth0" >/proc/sys/net/decnet/default_device
103 103
104 If you don't set the default device, then it will default to the first 104 If you don't set the default device, then it will default to the first
105 ethernet card which has been autoconfigured as described above. You can 105 ethernet card which has been autoconfigured as described above. You can
106 confirm that by looking in the default_device file of course. 106 confirm that by looking in the default_device file of course.
107 107
108 There is a list of what the other files under /proc/sys/net/decnet/ do 108 There is a list of what the other files under /proc/sys/net/decnet/ do
109 on the kernel patch web site (shown above). 109 on the kernel patch web site (shown above).
110 110
111 4) Run time kernel configuration 111 4) Run time kernel configuration
112 112
113 This is either done through the sysctl/proc interface (see the kernel web 113 This is either done through the sysctl/proc interface (see the kernel web
114 pages for details on what the various options do) or through the iproute2 114 pages for details on what the various options do) or through the iproute2
115 package in the same way as IPv4/6 configuration is performed. 115 package in the same way as IPv4/6 configuration is performed.
116 116
117 Documentation for iproute2 is included with the package, although there is 117 Documentation for iproute2 is included with the package, although there is
118 as yet no specific section on DECnet, most of the features apply to both 118 as yet no specific section on DECnet, most of the features apply to both
119 IP and DECnet, albeit with DECnet addresses instead of IP addresses and 119 IP and DECnet, albeit with DECnet addresses instead of IP addresses and
120 a reduced functionality. 120 a reduced functionality.
121 121
122 If you want to configure a DECnet router you'll need the iproute2 package 122 If you want to configure a DECnet router you'll need the iproute2 package
123 since its the _only_ way to add and delete routes currently. Eventually 123 since its the _only_ way to add and delete routes currently. Eventually
124 there will be a routing daemon to send and receive routing messages for 124 there will be a routing daemon to send and receive routing messages for
125 each interface and update the kernel routing tables accordingly. The 125 each interface and update the kernel routing tables accordingly. The
126 routing daemon will use netfilter to listen to routing packets, and 126 routing daemon will use netfilter to listen to routing packets, and
127 rtnetlink to update the kernels routing tables. 127 rtnetlink to update the kernels routing tables.
128 128
129 The DECnet raw socket layer has been removed since it was there purely 129 The DECnet raw socket layer has been removed since it was there purely
130 for use by the routing daemon which will now use netfilter (a much cleaner 130 for use by the routing daemon which will now use netfilter (a much cleaner
131 and more generic solution) instead. 131 and more generic solution) instead.
132 132
133 5) How can I tell if its working ? 133 5) How can I tell if its working ?
134 134
135 Here is a quick guide of what to look for in order to know if your DECnet 135 Here is a quick guide of what to look for in order to know if your DECnet
136 kernel subsystem is working. 136 kernel subsystem is working.
137 137
138 - Is the node address set (see /proc/sys/net/decnet/node_address) 138 - Is the node address set (see /proc/sys/net/decnet/node_address)
139 - Is the node of the correct type 139 - Is the node of the correct type
140 (see /proc/sys/net/decnet/conf/<dev>/forwarding) 140 (see /proc/sys/net/decnet/conf/<dev>/forwarding)
141 - Is the Ethernet MAC address of each Ethernet card set to match 141 - Is the Ethernet MAC address of each Ethernet card set to match
142 the DECnet address. If in doubt use the dn2ethaddr utility available 142 the DECnet address. If in doubt use the dn2ethaddr utility available
143 at the ftp archive. 143 at the ftp archive.
144 - If the previous two steps are satisfied, and the Ethernet card is up, 144 - If the previous two steps are satisfied, and the Ethernet card is up,
145 you should find that it is listed in /proc/net/decnet_dev and also 145 you should find that it is listed in /proc/net/decnet_dev and also
146 that it appears as a directory in /proc/sys/net/decnet/conf/. The 146 that it appears as a directory in /proc/sys/net/decnet/conf/. The
147 loopback device (lo) should also appear and is required to communicate 147 loopback device (lo) should also appear and is required to communicate
148 within a node. 148 within a node.
149 - If you have any DECnet routers on your network, they should appear 149 - If you have any DECnet routers on your network, they should appear
150 in /proc/net/decnet_neigh, otherwise this file will only contain the 150 in /proc/net/decnet_neigh, otherwise this file will only contain the
151 entry for the node itself (if it doesn't check to see if lo is up). 151 entry for the node itself (if it doesn't check to see if lo is up).
152 - If you want to send to any node which is not listed in the 152 - If you want to send to any node which is not listed in the
153 /proc/net/decnet_neigh file, you'll need to set the default device 153 /proc/net/decnet_neigh file, you'll need to set the default device
154 to point to an Ethernet card with connection to a router. This is 154 to point to an Ethernet card with connection to a router. This is
155 again done with the /proc/sys/net/decnet/default_device file. 155 again done with the /proc/sys/net/decnet/default_device file.
156 - Try starting a simple server and client, like the dnping/dnmirror 156 - Try starting a simple server and client, like the dnping/dnmirror
157 over the loopback interface. With luck they should communicate. 157 over the loopback interface. With luck they should communicate.
158 For this step and those after, you'll need the DECnet library 158 For this step and those after, you'll need the DECnet library
159 which can be obtained from the above ftp sites as well as the 159 which can be obtained from the above ftp sites as well as the
160 actual utilities themselves. 160 actual utilities themselves.
161 - If this seems to work, then try talking to a node on your local 161 - If this seems to work, then try talking to a node on your local
162 network, and see if you can obtain the same results. 162 network, and see if you can obtain the same results.
163 - At this point you are on your own... :-) 163 - At this point you are on your own... :-)
164 164
165 6) How to send a bug report 165 6) How to send a bug report
166 166
167 If you've found a bug and want to report it, then there are several things 167 If you've found a bug and want to report it, then there are several things
168 you can do to help me work out exactly what it is that is wrong. Useful 168 you can do to help me work out exactly what it is that is wrong. Useful
169 information (_most_ of which _is_ _essential_) includes: 169 information (_most_ of which _is_ _essential_) includes:
170 170
171 - What kernel version are you running ? 171 - What kernel version are you running ?
172 - What version of the patch are you running ? 172 - What version of the patch are you running ?
173 - How far though the above set of tests can you get ? 173 - How far though the above set of tests can you get ?
174 - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ? 174 - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ?
175 - Which services are you running ? 175 - Which services are you running ?
176 - Which client caused the problem ? 176 - Which client caused the problem ?
177 - How much data was being transferred ? 177 - How much data was being transferred ?
178 - Was the network congested ? 178 - Was the network congested ?
179 - How can the problem be reproduced ? 179 - How can the problem be reproduced ?
180 - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of 180 - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of
181 tcpdump don't understand how to dump DECnet properly, so including 181 tcpdump don't understand how to dump DECnet properly, so including
182 the hex listing of the packet contents is _essential_, usually the -x flag. 182 the hex listing of the packet contents is _essential_, usually the -x flag.
183 You may also need to increase the length grabbed with the -s flag. The 183 You may also need to increase the length grabbed with the -s flag. The
184 -e flag also provides very useful information (ethernet MAC addresses)) 184 -e flag also provides very useful information (ethernet MAC addresses))
185 185
186 7) MAC FAQ 186 7) MAC FAQ
187 187
188 A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet 188 A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet
189 interact and how to get the best performance from your hardware. 189 interact and how to get the best performance from your hardware.
190 190
191 Ethernet cards are designed to normally only pass received network frames 191 Ethernet cards are designed to normally only pass received network frames
192 to a host computer when they are addressed to it, or to the broadcast address. 192 to a host computer when they are addressed to it, or to the broadcast address.
193 193
194 Linux has an interface which allows the setting of extra addresses for 194 Linux has an interface which allows the setting of extra addresses for
195 an ethernet card to listen to. If the ethernet card supports it, the 195 an ethernet card to listen to. If the ethernet card supports it, the
196 filtering operation will be done in hardware, if not the extra unwanted packets 196 filtering operation will be done in hardware, if not the extra unwanted packets
197 received will be discarded by the host computer. In the latter case, 197 received will be discarded by the host computer. In the latter case,
198 significant processor time and bus bandwidth can be used up on a busy 198 significant processor time and bus bandwidth can be used up on a busy
199 network (see the NAPI documentation for a longer explanation of these 199 network (see the NAPI documentation for a longer explanation of these
200 effects). 200 effects).
201 201
202 DECnet makes use of this interface to allow running DECnet on an ethernet 202 DECnet makes use of this interface to allow running DECnet on an ethernet
203 card which has already been configured using TCP/IP (presumably using the 203 card which has already been configured using TCP/IP (presumably using the
204 built in MAC address of the card, as usual) and/or to allow multiple DECnet 204 built in MAC address of the card, as usual) and/or to allow multiple DECnet
205 addresses on each physical interface. If you do this, be aware that if your 205 addresses on each physical interface. If you do this, be aware that if your
206 ethernet card doesn't support perfect hashing in its MAC address filter 206 ethernet card doesn't support perfect hashing in its MAC address filter
207 then your computer will be doing more work than required. Some cards 207 then your computer will be doing more work than required. Some cards
208 will simply set themselves into promiscuous mode in order to receive 208 will simply set themselves into promiscuous mode in order to receive
209 packets from the DECnet specified addresses. So if you have one of these 209 packets from the DECnet specified addresses. So if you have one of these
210 cards its better to set the MAC address of the card as described above 210 cards its better to set the MAC address of the card as described above
211 to gain the best efficiency. Better still is to use a card which supports 211 to gain the best efficiency. Better still is to use a card which supports
212 NAPI as well. 212 NAPI as well.
213 213
214 214
215 8) Mailing list 215 8) Mailing list
216 216
217 If you are keen to get involved in development, or want to ask questions 217 If you are keen to get involved in development, or want to ask questions
218 about configuration, or even just report bugs, then there is a mailing 218 about configuration, or even just report bugs, then there is a mailing
219 list that you can join, details are at: 219 list that you can join, details are at:
220 220
221 http://sourceforge.net/mail/?group_id=4993 221 http://sourceforge.net/mail/?group_id=4993
222 222
223 9) Legal Info 223 9) Legal Info
224 224
225 The Linux DECnet project team have placed their code under the GPL. The 225 The Linux DECnet project team have placed their code under the GPL. The
226 software is provided "as is" and without warranty express or implied. 226 software is provided "as is" and without warranty express or implied.
227 DECnet is a trademark of Compaq. This software is not a product of 227 DECnet is a trademark of Compaq. This software is not a product of
228 Compaq. We acknowledge the help of people at Compaq in providing extra 228 Compaq. We acknowledge the help of people at Compaq in providing extra
229 documentation above and beyond what was previously publicly available. 229 documentation above and beyond what was previously publicly available.
230 230
231 Steve Whitehouse <SteveW@ACM.org> 231 Steve Whitehouse <SteveW@ACM.org>
232 232
233 233
Documentation/networking/e1000.txt
1 Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters 1 Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters
2 =============================================================== 2 ===============================================================
3 3
4 November 15, 2005 4 November 15, 2005
5 5
6 6
7 Contents 7 Contents
8 ======== 8 ========
9 9
10 - In This Release 10 - In This Release
11 - Identifying Your Adapter 11 - Identifying Your Adapter
12 - Command Line Parameters 12 - Command Line Parameters
13 - Speed and Duplex Configuration 13 - Speed and Duplex Configuration
14 - Additional Configurations 14 - Additional Configurations
15 - Known Issues 15 - Known Issues
16 - Support 16 - Support
17 17
18 18
19 In This Release 19 In This Release
20 =============== 20 ===============
21 21
22 This file describes the Linux* Base Driver for the Intel(R) PRO/1000 Family 22 This file describes the Linux* Base Driver for the Intel(R) PRO/1000 Family
23 of Adapters. This driver includes support for Itanium(R)2-based systems. 23 of Adapters. This driver includes support for Itanium(R)2-based systems.
24 24
25 For questions related to hardware requirements, refer to the documentation 25 For questions related to hardware requirements, refer to the documentation
26 supplied with your Intel PRO/1000 adapter. All hardware requirements listed 26 supplied with your Intel PRO/1000 adapter. All hardware requirements listed
27 apply to use with Linux. 27 apply to use with Linux.
28 28
29 The following features are now available in supported kernels: 29 The following features are now available in supported kernels:
30 - Native VLANs 30 - Native VLANs
31 - Channel Bonding (teaming) 31 - Channel Bonding (teaming)
32 - SNMP 32 - SNMP
33 33
34 Channel Bonding documentation can be found in the Linux kernel source: 34 Channel Bonding documentation can be found in the Linux kernel source:
35 /Documentation/networking/bonding.txt 35 /Documentation/networking/bonding.txt
36 36
37 The driver information previously displayed in the /proc filesystem is not 37 The driver information previously displayed in the /proc filesystem is not
38 supported in this release. Alternatively, you can use ethtool (version 1.6 38 supported in this release. Alternatively, you can use ethtool (version 1.6
39 or later), lspci, and ifconfig to obtain the same information. 39 or later), lspci, and ifconfig to obtain the same information.
40 40
41 Instructions on updating ethtool can be found in the section "Additional 41 Instructions on updating ethtool can be found in the section "Additional
42 Configurations" later in this document. 42 Configurations" later in this document.
43 43
44 44
45 Identifying Your Adapter 45 Identifying Your Adapter
46 ======================== 46 ========================
47 47
48 For more information on how to identify your adapter, go to the Adapter & 48 For more information on how to identify your adapter, go to the Adapter &
49 Driver ID Guide at: 49 Driver ID Guide at:
50 50
51 http://support.intel.com/support/network/adapter/pro100/21397.htm 51 http://support.intel.com/support/network/adapter/pro100/21397.htm
52 52
53 For the latest Intel network drivers for Linux, refer to the following 53 For the latest Intel network drivers for Linux, refer to the following
54 website. In the search field, enter your adapter name or type, or use the 54 website. In the search field, enter your adapter name or type, or use the
55 networking link on the left to search for your adapter: 55 networking link on the left to search for your adapter:
56 56
57 http://downloadfinder.intel.com/scripts-df/support_intel.asp 57 http://downloadfinder.intel.com/scripts-df/support_intel.asp
58 58
59 59
60 Command Line Parameters ======================= 60 Command Line Parameters =======================
61 61
62 If the driver is built as a module, the following optional parameters 62 If the driver is built as a module, the following optional parameters
63 are used by entering them on the command line with the modprobe or insmod 63 are used by entering them on the command line with the modprobe or insmod
64 command using this syntax: 64 command using this syntax:
65 65
66 modprobe e1000 [<option>=<VAL1>,<VAL2>,...] 66 modprobe e1000 [<option>=<VAL1>,<VAL2>,...]
67 67
68 insmod e1000 [<option>=<VAL1>,<VAL2>,...] 68 insmod e1000 [<option>=<VAL1>,<VAL2>,...]
69 69
70 For example, with two PRO/1000 PCI adapters, entering: 70 For example, with two PRO/1000 PCI adapters, entering:
71 71
72 insmod e1000 TxDescriptors=80,128 72 insmod e1000 TxDescriptors=80,128
73 73
74 loads the e1000 driver with 80 TX descriptors for the first adapter and 128 74 loads the e1000 driver with 80 TX descriptors for the first adapter and 128
75 TX descriptors for the second adapter. 75 TX descriptors for the second adapter.
76 76
77 The default value for each parameter is generally the recommended setting, 77 The default value for each parameter is generally the recommended setting,
78 unless otherwise noted. 78 unless otherwise noted.
79 79
80 NOTES: For more information about the AutoNeg, Duplex, and Speed 80 NOTES: For more information about the AutoNeg, Duplex, and Speed
81 parameters, see the "Speed and Duplex Configuration" section in 81 parameters, see the "Speed and Duplex Configuration" section in
82 this document. 82 this document.
83 83
84 For more information about the InterruptThrottleRate, 84 For more information about the InterruptThrottleRate,
85 RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay 85 RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay
86 parameters, see the application note at: 86 parameters, see the application note at:
87 http://www.intel.com/design/network/applnots/ap450.htm 87 http://www.intel.com/design/network/applnots/ap450.htm
88 88
89 A descriptor describes a data buffer and attributes related to 89 A descriptor describes a data buffer and attributes related to
90 the data buffer. This information is accessed by the hardware. 90 the data buffer. This information is accessed by the hardware.
91 91
92 92
93 AutoNeg 93 AutoNeg
94 ------- 94 -------
95 (Supported only on adapters with copper connections) 95 (Supported only on adapters with copper connections)
96 Valid Range: 0x01-0x0F, 0x20-0x2F 96 Valid Range: 0x01-0x0F, 0x20-0x2F
97 Default Value: 0x2F 97 Default Value: 0x2F
98 98
99 This parameter is a bit mask that specifies which speed and duplex 99 This parameter is a bit mask that specifies which speed and duplex
100 settings the board advertises. When this parameter is used, the Speed 100 settings the board advertises. When this parameter is used, the Speed
101 and Duplex parameters must not be specified. 101 and Duplex parameters must not be specified.
102 102
103 NOTE: Refer to the Speed and Duplex section of this readme for more 103 NOTE: Refer to the Speed and Duplex section of this readme for more
104 information on the AutoNeg parameter. 104 information on the AutoNeg parameter.
105 105
106 106
107 Duplex 107 Duplex
108 ------ 108 ------
109 (Supported only on adapters with copper connections) 109 (Supported only on adapters with copper connections)
110 Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) 110 Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full)
111 Default Value: 0 111 Default Value: 0
112 112
113 Defines the direction in which data is allowed to flow. Can be either 113 Defines the direction in which data is allowed to flow. Can be either
114 one or two-directional. If both Duplex and the link partner are set to 114 one or two-directional. If both Duplex and the link partner are set to
115 auto-negotiate, the board auto-detects the correct duplex. If the link 115 auto-negotiate, the board auto-detects the correct duplex. If the link
116 partner is forced (either full or half), Duplex defaults to half-duplex. 116 partner is forced (either full or half), Duplex defaults to half-duplex.
117 117
118 118
119 FlowControl 119 FlowControl
120 ---------- 120 ----------
121 Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) 121 Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
122 Default Value: Reads flow control settings from the EEPROM 122 Default Value: Reads flow control settings from the EEPROM
123 123
124 This parameter controls the automatic generation(Tx) and response(Rx) 124 This parameter controls the automatic generation(Tx) and response(Rx)
125 to Ethernet PAUSE frames. 125 to Ethernet PAUSE frames.
126 126
127 127
128 InterruptThrottleRate 128 InterruptThrottleRate
129 --------------------- 129 ---------------------
130 (not supported on Intel 82542, 82543 or 82544-based adapters) 130 (not supported on Intel 82542, 82543 or 82544-based adapters)
131 Valid Range: 100-100000 (0=off, 1=dynamic) 131 Valid Range: 100-100000 (0=off, 1=dynamic)
132 Default Value: 8000 132 Default Value: 8000
133 133
134 This value represents the maximum number of interrupts per second the 134 This value represents the maximum number of interrupts per second the
135 controller generates. InterruptThrottleRate is another setting used in 135 controller generates. InterruptThrottleRate is another setting used in
136 interrupt moderation. Dynamic mode uses a heuristic algorithm to adjust 136 interrupt moderation. Dynamic mode uses a heuristic algorithm to adjust
137 InterruptThrottleRate based on the current traffic load. 137 InterruptThrottleRate based on the current traffic load.
138 138
139 NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and 139 NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and
140 RxAbsIntDelay parameters. In other words, minimizing the receive 140 RxAbsIntDelay parameters. In other words, minimizing the receive
141 and/or transmit absolute delays does not force the controller to 141 and/or transmit absolute delays does not force the controller to
142 generate more interrupts than what the Interrupt Throttle Rate 142 generate more interrupts than what the Interrupt Throttle Rate
143 allows. 143 allows.
144 144
145 CAUTION: If you are using the Intel PRO/1000 CT Network Connection 145 CAUTION: If you are using the Intel PRO/1000 CT Network Connection
146 (controller 82547), setting InterruptThrottleRate to a value 146 (controller 82547), setting InterruptThrottleRate to a value
147 greater than 75,000, may hang (stop transmitting) adapters 147 greater than 75,000, may hang (stop transmitting) adapters
148 under certain network conditions. If this occurs a NETDEV 148 under certain network conditions. If this occurs a NETDEV
149 WATCHDOG message is logged in the system event log. In 149 WATCHDOG message is logged in the system event log. In
150 addition, the controller is automatically reset, restoring 150 addition, the controller is automatically reset, restoring
151 the network connection. To eliminate the potential for the 151 the network connection. To eliminate the potential for the
152 hang, ensure that InterruptThrottleRate is set no greater 152 hang, ensure that InterruptThrottleRate is set no greater
153 than 75,000 and is not set to 0. 153 than 75,000 and is not set to 0.
154 154
155 NOTE: When e1000 is loaded with default settings and multiple adapters 155 NOTE: When e1000 is loaded with default settings and multiple adapters
156 are in use simultaneously, the CPU utilization may increase non- 156 are in use simultaneously, the CPU utilization may increase non-
157 linearly. In order to limit the CPU utilization without impacting 157 linearly. In order to limit the CPU utilization without impacting
158 the overall throughput, we recommend that you load the driver as 158 the overall throughput, we recommend that you load the driver as
159 follows: 159 follows:
160 160
161 insmod e1000.o InterruptThrottleRate=3000,3000,3000 161 insmod e1000.o InterruptThrottleRate=3000,3000,3000
162 162
163 This sets the InterruptThrottleRate to 3000 interrupts/sec for 163 This sets the InterruptThrottleRate to 3000 interrupts/sec for
164 the first, second, and third instances of the driver. The range 164 the first, second, and third instances of the driver. The range
165 of 2000 to 3000 interrupts per second works on a majority of 165 of 2000 to 3000 interrupts per second works on a majority of
166 systems and is a good starting point, but the optimal value will 166 systems and is a good starting point, but the optimal value will
167 be platform-specific. If CPU utilization is not a concern, use 167 be platform-specific. If CPU utilization is not a concern, use
168 RX_POLLING (NAPI) and default driver settings. 168 RX_POLLING (NAPI) and default driver settings.
169 169
170 170
171 RxDescriptors 171 RxDescriptors
172 ------------- 172 -------------
173 Valid Range: 80-256 for 82542 and 82543-based adapters 173 Valid Range: 80-256 for 82542 and 82543-based adapters
174 80-4096 for all other supported adapters 174 80-4096 for all other supported adapters
175 Default Value: 256 175 Default Value: 256
176 176
177 This value specifies the number of receive descriptors allocated by the 177 This value specifies the number of receive descriptors allocated by the
178 driver. Increasing this value allows the driver to buffer more incoming 178 driver. Increasing this value allows the driver to buffer more incoming
179 packets. Each descriptor is 16 bytes. A receive buffer is also 179 packets. Each descriptor is 16 bytes. A receive buffer is also
180 allocated for each descriptor and is 2048. 180 allocated for each descriptor and is 2048.
181 181
182 182
183 RxIntDelay 183 RxIntDelay
184 ---------- 184 ----------
185 Valid Range: 0-65535 (0=off) 185 Valid Range: 0-65535 (0=off)
186 Default Value: 0 186 Default Value: 0
187 187
188 This value delays the generation of receive interrupts in units of 1.024 188 This value delays the generation of receive interrupts in units of 1.024
189 microseconds. Receive interrupt reduction can improve CPU efficiency if 189 microseconds. Receive interrupt reduction can improve CPU efficiency if
190 properly tuned for specific network traffic. Increasing this value adds 190 properly tuned for specific network traffic. Increasing this value adds
191 extra latency to frame reception and can end up decreasing the throughput 191 extra latency to frame reception and can end up decreasing the throughput
192 of TCP traffic. If the system is reporting dropped receives, this value 192 of TCP traffic. If the system is reporting dropped receives, this value
193 may be set too high, causing the driver to run out of available receive 193 may be set too high, causing the driver to run out of available receive
194 descriptors. 194 descriptors.
195 195
196 CAUTION: When setting RxIntDelay to a value other than 0, adapters may 196 CAUTION: When setting RxIntDelay to a value other than 0, adapters may
197 hang (stop transmitting) under certain network conditions. If 197 hang (stop transmitting) under certain network conditions. If
198 this occurs a NETDEV WATCHDOG message is logged in the system 198 this occurs a NETDEV WATCHDOG message is logged in the system
199 event log. In addition, the controller is automatically reset, 199 event log. In addition, the controller is automatically reset,
200 restoring the network connection. To eliminate the potential 200 restoring the network connection. To eliminate the potential
201 for the hang ensure that RxIntDelay is set to 0. 201 for the hang ensure that RxIntDelay is set to 0.
202 202
203 203
204 RxAbsIntDelay 204 RxAbsIntDelay
205 ------------- 205 -------------
206 (This parameter is supported only on 82540, 82545 and later adapters.) 206 (This parameter is supported only on 82540, 82545 and later adapters.)
207 Valid Range: 0-65535 (0=off) 207 Valid Range: 0-65535 (0=off)
208 Default Value: 128 208 Default Value: 128
209 209
210 This value, in units of 1.024 microseconds, limits the delay in which a 210 This value, in units of 1.024 microseconds, limits the delay in which a
211 receive interrupt is generated. Useful only if RxIntDelay is non-zero, 211 receive interrupt is generated. Useful only if RxIntDelay is non-zero,
212 this value ensures that an interrupt is generated after the initial 212 this value ensures that an interrupt is generated after the initial
213 packet is received within the set amount of time. Proper tuning, 213 packet is received within the set amount of time. Proper tuning,
214 along with RxIntDelay, may improve traffic throughput in specific network 214 along with RxIntDelay, may improve traffic throughput in specific network
215 conditions. 215 conditions.
216 216
217 217
218 Speed 218 Speed
219 ----- 219 -----
220 (This parameter is supported only on adapters with copper connections.) 220 (This parameter is supported only on adapters with copper connections.)
221 Valid Settings: 0, 10, 100, 1000 221 Valid Settings: 0, 10, 100, 1000
222 Default Value: 0 (auto-negotiate at all supported speeds) 222 Default Value: 0 (auto-negotiate at all supported speeds)
223 223
224 Speed forces the line speed to the specified value in megabits per second 224 Speed forces the line speed to the specified value in megabits per second
225 (Mbps). If this parameter is not specified or is set to 0 and the link 225 (Mbps). If this parameter is not specified or is set to 0 and the link
226 partner is set to auto-negotiate, the board will auto-detect the correct 226 partner is set to auto-negotiate, the board will auto-detect the correct
227 speed. Duplex should also be set when Speed is set to either 10 or 100. 227 speed. Duplex should also be set when Speed is set to either 10 or 100.
228 228
229 229
230 TxDescriptors 230 TxDescriptors
231 ------------- 231 -------------
232 Valid Range: 80-256 for 82542 and 82543-based adapters 232 Valid Range: 80-256 for 82542 and 82543-based adapters
233 80-4096 for all other supported adapters 233 80-4096 for all other supported adapters
234 Default Value: 256 234 Default Value: 256
235 235
236 This value is the number of transmit descriptors allocated by the driver. 236 This value is the number of transmit descriptors allocated by the driver.
237 Increasing this value allows the driver to queue more transmits. Each 237 Increasing this value allows the driver to queue more transmits. Each
238 descriptor is 16 bytes. 238 descriptor is 16 bytes.
239 239
240 NOTE: Depending on the available system resources, the request for a 240 NOTE: Depending on the available system resources, the request for a
241 higher number of transmit descriptors may be denied. In this case, 241 higher number of transmit descriptors may be denied. In this case,
242 use a lower number. 242 use a lower number.
243 243
244 244
245 TxIntDelay 245 TxIntDelay
246 ---------- 246 ----------
247 Valid Range: 0-65535 (0=off) 247 Valid Range: 0-65535 (0=off)
248 Default Value: 64 248 Default Value: 64
249 249
250 This value delays the generation of transmit interrupts in units of 250 This value delays the generation of transmit interrupts in units of
251 1.024 microseconds. Transmit interrupt reduction can improve CPU 251 1.024 microseconds. Transmit interrupt reduction can improve CPU
252 efficiency if properly tuned for specific network traffic. If the 252 efficiency if properly tuned for specific network traffic. If the
253 system is reporting dropped transmits, this value may be set too high 253 system is reporting dropped transmits, this value may be set too high
254 causing the driver to run out of available transmit descriptors. 254 causing the driver to run out of available transmit descriptors.
255 255
256 256
257 TxAbsIntDelay 257 TxAbsIntDelay
258 ------------- 258 -------------
259 (This parameter is supported only on 82540, 82545 and later adapters.) 259 (This parameter is supported only on 82540, 82545 and later adapters.)
260 Valid Range: 0-65535 (0=off) 260 Valid Range: 0-65535 (0=off)
261 Default Value: 64 261 Default Value: 64
262 262
263 This value, in units of 1.024 microseconds, limits the delay in which a 263 This value, in units of 1.024 microseconds, limits the delay in which a
264 transmit interrupt is generated. Useful only if TxIntDelay is non-zero, 264 transmit interrupt is generated. Useful only if TxIntDelay is non-zero,
265 this value ensures that an interrupt is generated after the initial 265 this value ensures that an interrupt is generated after the initial
266 packet is sent on the wire within the set amount of time. Proper tuning, 266 packet is sent on the wire within the set amount of time. Proper tuning,
267 along with TxIntDelay, may improve traffic throughput in specific 267 along with TxIntDelay, may improve traffic throughput in specific
268 network conditions. 268 network conditions.
269 269
270 XsumRX 270 XsumRX
271 ------ 271 ------
272 (This parameter is NOT supported on the 82542-based adapter.) 272 (This parameter is NOT supported on the 82542-based adapter.)
273 Valid Range: 0-1 273 Valid Range: 0-1
274 Default Value: 1 274 Default Value: 1
275 275
276 A value of '1' indicates that the driver should enable IP checksum 276 A value of '1' indicates that the driver should enable IP checksum
277 offload for received packets (both UDP and TCP) to the adapter hardware. 277 offload for received packets (both UDP and TCP) to the adapter hardware.
278 278
279 279
280 Speed and Duplex Configuration 280 Speed and Duplex Configuration
281 ============================== 281 ==============================
282 282
283 Three keywords are used to control the speed and duplex configuration. 283 Three keywords are used to control the speed and duplex configuration.
284 These keywords are Speed, Duplex, and AutoNeg. 284 These keywords are Speed, Duplex, and AutoNeg.
285 285
286 If the board uses a fiber interface, these keywords are ignored, and the 286 If the board uses a fiber interface, these keywords are ignored, and the
287 fiber interface board only links at 1000 Mbps full-duplex. 287 fiber interface board only links at 1000 Mbps full-duplex.
288 288
289 For copper-based boards, the keywords interact as follows: 289 For copper-based boards, the keywords interact as follows:
290 290
291 The default operation is auto-negotiate. The board advertises all 291 The default operation is auto-negotiate. The board advertises all
292 supported speed and duplex combinations, and it links at the highest 292 supported speed and duplex combinations, and it links at the highest
293 common speed and duplex mode IF the link partner is set to auto-negotiate. 293 common speed and duplex mode IF the link partner is set to auto-negotiate.
294 294
295 If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps 295 If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps
296 is advertised (The 1000BaseT spec requires auto-negotiation.) 296 is advertised (The 1000BaseT spec requires auto-negotiation.)
297 297
298 If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- 298 If Speed = 10 or 100, then both Speed and Duplex should be set. Auto-
299 negotiation is disabled, and the AutoNeg parameter is ignored. Partner 299 negotiation is disabled, and the AutoNeg parameter is ignored. Partner
300 SHOULD also be forced. 300 SHOULD also be forced.
301 301
302 The AutoNeg parameter is used when more control is required over the 302 The AutoNeg parameter is used when more control is required over the
303 auto-negotiation process. It should be used when you wish to control which 303 auto-negotiation process. It should be used when you wish to control which
304 speed and duplex combinations are advertised during the auto-negotiation 304 speed and duplex combinations are advertised during the auto-negotiation
305 process. 305 process.
306 306
307 The parameter may be specified as either a decimal or hexidecimal value as 307 The parameter may be specified as either a decimal or hexidecimal value as
308 determined by the bitmap below. 308 determined by the bitmap below.
309 309
310 Bit position 7 6 5 4 3 2 1 0 310 Bit position 7 6 5 4 3 2 1 0
311 Decimal Value 128 64 32 16 8 4 2 1 311 Decimal Value 128 64 32 16 8 4 2 1
312 Hex value 80 40 20 10 8 4 2 1 312 Hex value 80 40 20 10 8 4 2 1
313 Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10 313 Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10
314 Duplex Full Full Half Full Half 314 Duplex Full Full Half Full Half
315 315
316 Some examples of using AutoNeg: 316 Some examples of using AutoNeg:
317 317
318 modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half) 318 modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half)
319 modprobe e1000 AutoNeg=1 (Same as above) 319 modprobe e1000 AutoNeg=1 (Same as above)
320 modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full) 320 modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full)
321 modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full) 321 modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full)
322 modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half) 322 modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half)
323 modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100 323 modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100
324 Half) 324 Half)
325 modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full) 325 modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full)
326 modprobe e1000 AutoNeg=32 (Same as above) 326 modprobe e1000 AutoNeg=32 (Same as above)
327 327
328 Note that when this parameter is used, Speed and Duplex must not be specified. 328 Note that when this parameter is used, Speed and Duplex must not be specified.
329 329
330 If the link partner is forced to a specific speed and duplex, then this 330 If the link partner is forced to a specific speed and duplex, then this
331 parameter should not be used. Instead, use the Speed and Duplex parameters 331 parameter should not be used. Instead, use the Speed and Duplex parameters
332 previously mentioned to force the adapter to the same speed and duplex. 332 previously mentioned to force the adapter to the same speed and duplex.
333 333
334 334
335 Additional Configurations 335 Additional Configurations
336 ========================= 336 =========================
337 337
338 Configuring the Driver on Different Distributions 338 Configuring the Driver on Different Distributions
339 ------------------------------------------------- 339 -------------------------------------------------
340 340
341 Configuring a network driver to load properly when the system is started 341 Configuring a network driver to load properly when the system is started
342 is distribution dependent. Typically, the configuration process involves 342 is distribution dependent. Typically, the configuration process involves
343 adding an alias line to /etc/modules.conf or /etc/modprobe.conf as well 343 adding an alias line to /etc/modules.conf or /etc/modprobe.conf as well
344 as editing other system startup scripts and/or configuration files. Many 344 as editing other system startup scripts and/or configuration files. Many
345 popular Linux distributions ship with tools to make these changes for you. 345 popular Linux distributions ship with tools to make these changes for you.
346 To learn the proper way to configure a network device for your system, 346 To learn the proper way to configure a network device for your system,
347 refer to your distribution documentation. If during this process you are 347 refer to your distribution documentation. If during this process you are
348 asked for the driver or module name, the name for the Linux Base Driver 348 asked for the driver or module name, the name for the Linux Base Driver
349 for the Intel PRO/1000 Family of Adapters is e1000. 349 for the Intel PRO/1000 Family of Adapters is e1000.
350 350
351 As an example, if you install the e1000 driver for two PRO/1000 adapters 351 As an example, if you install the e1000 driver for two PRO/1000 adapters
352 (eth0 and eth1) and set the speed and duplex to 10full and 100half, add 352 (eth0 and eth1) and set the speed and duplex to 10full and 100half, add
353 the following to modules.conf or or modprobe.conf: 353 the following to modules.conf or modprobe.conf:
354 354
355 alias eth0 e1000 355 alias eth0 e1000
356 alias eth1 e1000 356 alias eth1 e1000
357 options e1000 Speed=10,100 Duplex=2,1 357 options e1000 Speed=10,100 Duplex=2,1
358 358
359 Viewing Link Messages 359 Viewing Link Messages
360 --------------------- 360 ---------------------
361 361
362 Link messages will not be displayed to the console if the distribution is 362 Link messages will not be displayed to the console if the distribution is
363 restricting system messages. In order to see network driver link messages 363 restricting system messages. In order to see network driver link messages
364 on your console, set dmesg to eight by entering the following: 364 on your console, set dmesg to eight by entering the following:
365 365
366 dmesg -n 8 366 dmesg -n 8
367 367
368 NOTE: This setting is not saved across reboots. 368 NOTE: This setting is not saved across reboots.
369 369
370 Jumbo Frames 370 Jumbo Frames
371 ------------ 371 ------------
372 372
373 The driver supports Jumbo Frames for all adapters except 82542 and 373 The driver supports Jumbo Frames for all adapters except 82542 and
374 82573-based adapters. Jumbo Frames support is enabled by changing the 374 82573-based adapters. Jumbo Frames support is enabled by changing the
375 MTU to a value larger than the default of 1500. Use the ifconfig command 375 MTU to a value larger than the default of 1500. Use the ifconfig command
376 to increase the MTU size. For example: 376 to increase the MTU size. For example:
377 377
378 ifconfig eth<x> mtu 9000 up 378 ifconfig eth<x> mtu 9000 up
379 379
380 This setting is not saved across reboots. It can be made permanent if 380 This setting is not saved across reboots. It can be made permanent if
381 you add: 381 you add:
382 382
383 MTU=9000 383 MTU=9000
384 384
385 to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example 385 to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example
386 applies to the Red Hat distributions; other distributions may store this 386 applies to the Red Hat distributions; other distributions may store this
387 setting in a different location. 387 setting in a different location.
388 388
389 Notes: 389 Notes:
390 390
391 - To enable Jumbo Frames, increase the MTU size on the interface beyond 391 - To enable Jumbo Frames, increase the MTU size on the interface beyond
392 1500. 392 1500.
393 - The maximum MTU setting for Jumbo Frames is 16110. This value coincides 393 - The maximum MTU setting for Jumbo Frames is 16110. This value coincides
394 with the maximum Jumbo Frames size of 16128. 394 with the maximum Jumbo Frames size of 16128.
395 - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or 395 - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or
396 loss of link. 396 loss of link.
397 - Some Intel gigabit adapters that support Jumbo Frames have a frame size 397 - Some Intel gigabit adapters that support Jumbo Frames have a frame size
398 limit of 9238 bytes, with a corresponding MTU size limit of 9216 bytes. 398 limit of 9238 bytes, with a corresponding MTU size limit of 9216 bytes.
399 The adapters with this limitation are based on the Intel 82571EB and 399 The adapters with this limitation are based on the Intel 82571EB and
400 82572EI controllers, which correspond to these product names: 400 82572EI controllers, which correspond to these product names:
401 Intelยฎ PRO/1000 PT Dual Port Server Adapter 401 Intelยฎ PRO/1000 PT Dual Port Server Adapter
402 Intelยฎ PRO/1000 PF Dual Port Server Adapter 402 Intelยฎ PRO/1000 PF Dual Port Server Adapter
403 Intelยฎ PRO/1000 PT Server Adapter 403 Intelยฎ PRO/1000 PT Server Adapter
404 Intelยฎ PRO/1000 PT Desktop Adapter 404 Intelยฎ PRO/1000 PT Desktop Adapter
405 Intelยฎ PRO/1000 PF Server Adapter 405 Intelยฎ PRO/1000 PF Server Adapter
406 406
407 - The Intel PRO/1000 PM Network Connection does not support jumbo frames. 407 - The Intel PRO/1000 PM Network Connection does not support jumbo frames.
408 408
409 409
410 Ethtool 410 Ethtool
411 ------- 411 -------
412 412
413 The driver utilizes the ethtool interface for driver configuration and 413 The driver utilizes the ethtool interface for driver configuration and
414 diagnostics, as well as displaying statistical information. Ethtool 414 diagnostics, as well as displaying statistical information. Ethtool
415 version 1.6 or later is required for this functionality. 415 version 1.6 or later is required for this functionality.
416 416
417 The latest release of ethtool can be found from 417 The latest release of ethtool can be found from
418 http://sourceforge.net/projects/gkernel. 418 http://sourceforge.net/projects/gkernel.
419 419
420 NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support 420 NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support
421 for a more complete ethtool feature set can be enabled by upgrading 421 for a more complete ethtool feature set can be enabled by upgrading
422 ethtool to ethtool-1.8.1. 422 ethtool to ethtool-1.8.1.
423 423
424 Enabling Wake on LAN* (WoL) 424 Enabling Wake on LAN* (WoL)
425 --------------------------- 425 ---------------------------
426 426
427 WoL is configured through the Ethtool* utility. Ethtool is included with 427 WoL is configured through the Ethtool* utility. Ethtool is included with
428 all versions of Red Hat after Red Hat 7.2. For other Linux distributions, 428 all versions of Red Hat after Red Hat 7.2. For other Linux distributions,
429 download and install Ethtool from the following website: 429 download and install Ethtool from the following website:
430 http://sourceforge.net/projects/gkernel. 430 http://sourceforge.net/projects/gkernel.
431 431
432 For instructions on enabling WoL with Ethtool, refer to the website listed 432 For instructions on enabling WoL with Ethtool, refer to the website listed
433 above. 433 above.
434 434
435 WoL will be enabled on the system during the next shut down or reboot. 435 WoL will be enabled on the system during the next shut down or reboot.
436 For this driver version, in order to enable WoL, the e1000 driver must be 436 For this driver version, in order to enable WoL, the e1000 driver must be
437 loaded when shutting down or rebooting the system. 437 loaded when shutting down or rebooting the system.
438 438
439 NAPI 439 NAPI
440 ---- 440 ----
441 441
442 NAPI (Rx polling mode) is supported in the e1000 driver. NAPI is enabled 442 NAPI (Rx polling mode) is supported in the e1000 driver. NAPI is enabled
443 or disabled based on the configuration of the kernel. To override 443 or disabled based on the configuration of the kernel. To override
444 the default, use the following compile-time flags. 444 the default, use the following compile-time flags.
445 445
446 To enable NAPI, compile the driver module, passing in a configuration option: 446 To enable NAPI, compile the driver module, passing in a configuration option:
447 447
448 make CFLAGS_EXTRA=-DE1000_NAPI install 448 make CFLAGS_EXTRA=-DE1000_NAPI install
449 449
450 To disable NAPI, compile the driver module, passing in a configuration option: 450 To disable NAPI, compile the driver module, passing in a configuration option:
451 451
452 make CFLAGS_EXTRA=-DE1000_NO_NAPI install 452 make CFLAGS_EXTRA=-DE1000_NO_NAPI install
453 453
454 See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. 454 See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
455 455
456 456
457 Known Issues 457 Known Issues
458 ============ 458 ============
459 459
460 Jumbo Frames System Requirement 460 Jumbo Frames System Requirement
461 ------------------------------- 461 -------------------------------
462 462
463 Memory allocation failures have been observed on Linux systems with 64 MB 463 Memory allocation failures have been observed on Linux systems with 64 MB
464 of RAM or less that are running Jumbo Frames. If you are using Jumbo 464 of RAM or less that are running Jumbo Frames. If you are using Jumbo
465 Frames, your system may require more than the advertised minimum 465 Frames, your system may require more than the advertised minimum
466 requirement of 64 MB of system memory. 466 requirement of 64 MB of system memory.
467 467
468 Performance Degradation with Jumbo Frames 468 Performance Degradation with Jumbo Frames
469 ----------------------------------------- 469 -----------------------------------------
470 470
471 Degradation in throughput performance may be observed in some Jumbo frames 471 Degradation in throughput performance may be observed in some Jumbo frames
472 environments. If this is observed, increasing the application's socket 472 environments. If this is observed, increasing the application's socket
473 buffer size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values 473 buffer size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values
474 may help. See the specific application manual and 474 may help. See the specific application manual and
475 /usr/src/linux*/Documentation/ 475 /usr/src/linux*/Documentation/
476 networking/ip-sysctl.txt for more details. 476 networking/ip-sysctl.txt for more details.
477 477
478 Jumbo frames on Foundry BigIron 8000 switch 478 Jumbo frames on Foundry BigIron 8000 switch
479 ------------------------------------------- 479 -------------------------------------------
480 There is a known issue using Jumbo frames when connected to a Foundry 480 There is a known issue using Jumbo frames when connected to a Foundry
481 BigIron 8000 switch. This is a 3rd party limitation. If you experience 481 BigIron 8000 switch. This is a 3rd party limitation. If you experience
482 loss of packets, lower the MTU size. 482 loss of packets, lower the MTU size.
483 483
484 Multiple Interfaces on Same Ethernet Broadcast Network 484 Multiple Interfaces on Same Ethernet Broadcast Network
485 ------------------------------------------------------ 485 ------------------------------------------------------
486 486
487 Due to the default ARP behavior on Linux, it is not possible to have 487 Due to the default ARP behavior on Linux, it is not possible to have
488 one system on two IP networks in the same Ethernet broadcast domain 488 one system on two IP networks in the same Ethernet broadcast domain
489 (non-partitioned switch) behave as expected. All Ethernet interfaces 489 (non-partitioned switch) behave as expected. All Ethernet interfaces
490 will respond to IP traffic for any IP address assigned to the system. 490 will respond to IP traffic for any IP address assigned to the system.
491 This results in unbalanced receive traffic. 491 This results in unbalanced receive traffic.
492 492
493 If you have multiple interfaces in a server, either turn on ARP 493 If you have multiple interfaces in a server, either turn on ARP
494 filtering by entering: 494 filtering by entering:
495 495
496 echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter 496 echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
497 (this only works if your kernel's version is higher than 2.4.5), 497 (this only works if your kernel's version is higher than 2.4.5),
498 498
499 NOTE: This setting is not saved across reboots. The configuration 499 NOTE: This setting is not saved across reboots. The configuration
500 change can be made permanent by adding the line: 500 change can be made permanent by adding the line:
501 net.ipv4.conf.all.arp_filter = 1 501 net.ipv4.conf.all.arp_filter = 1
502 to the file /etc/sysctl.conf 502 to the file /etc/sysctl.conf
503 503
504 or, 504 or,
505 505
506 install the interfaces in separate broadcast domains (either in 506 install the interfaces in separate broadcast domains (either in
507 different switches or in a switch partitioned to VLANs). 507 different switches or in a switch partitioned to VLANs).
508 508
509 82541/82547 can't link or are slow to link with some link partners 509 82541/82547 can't link or are slow to link with some link partners
510 ----------------------------------------------------------------- 510 -----------------------------------------------------------------
511 511
512 There is a known compatibility issue with 82541/82547 and some 512 There is a known compatibility issue with 82541/82547 and some
513 low-end switches where the link will not be established, or will 513 low-end switches where the link will not be established, or will
514 be slow to establish. In particular, these switches are known to 514 be slow to establish. In particular, these switches are known to
515 be incompatible with 82541/82547: 515 be incompatible with 82541/82547:
516 516
517 Planex FXG-08TE 517 Planex FXG-08TE
518 I-O Data ETG-SH8 518 I-O Data ETG-SH8
519 519
520 To workaround this issue, the driver can be compiled with an override 520 To workaround this issue, the driver can be compiled with an override
521 of the PHY's master/slave setting. Forcing master or forcing slave 521 of the PHY's master/slave setting. Forcing master or forcing slave
522 mode will improve time-to-link. 522 mode will improve time-to-link.
523 523
524 # make EXTRA_CFLAGS=-DE1000_MASTER_SLAVE=<n> 524 # make EXTRA_CFLAGS=-DE1000_MASTER_SLAVE=<n>
525 525
526 Where <n> is: 526 Where <n> is:
527 527
528 0 = Hardware default 528 0 = Hardware default
529 1 = Master mode 529 1 = Master mode
530 2 = Slave mode 530 2 = Slave mode
531 3 = Auto master/slave 531 3 = Auto master/slave
532 532
533 Disable rx flow control with ethtool 533 Disable rx flow control with ethtool
534 ------------------------------------ 534 ------------------------------------
535 535
536 In order to disable receive flow control using ethtool, you must turn 536 In order to disable receive flow control using ethtool, you must turn
537 off auto-negotiation on the same command line. 537 off auto-negotiation on the same command line.
538 538
539 For example: 539 For example:
540 540
541 ethtool -A eth? autoneg off rx off 541 ethtool -A eth? autoneg off rx off
542 542
543 543
544 Support 544 Support
545 ======= 545 =======
546 546
547 For general information, go to the Intel support website at: 547 For general information, go to the Intel support website at:
548 548
549 http://support.intel.com 549 http://support.intel.com
550 550
551 or the Intel Wired Networking project hosted by Sourceforge at: 551 or the Intel Wired Networking project hosted by Sourceforge at:
552 552
553 http://sourceforge.net/projects/e1000 553 http://sourceforge.net/projects/e1000
554 554
555 If an issue is identified with the released source code on the supported 555 If an issue is identified with the released source code on the supported
556 kernel with a supported adapter, email the specific information related 556 kernel with a supported adapter, email the specific information related
557 to the issue to e1000-devel@lists.sourceforge.net 557 to the issue to e1000-devel@lists.sourceforge.net
558 558
559 559
560 License 560 License
561 ======= 561 =======
562 562
563 This software program is released under the terms of a license agreement 563 This software program is released under the terms of a license agreement
564 between you ('Licensee') and Intel. Do not use or load this software or any 564 between you ('Licensee') and Intel. Do not use or load this software or any
565 associated materials (collectively, the 'Software') until you have carefully 565 associated materials (collectively, the 'Software') until you have carefully
566 read the full terms and conditions of the file COPYING located in this software 566 read the full terms and conditions of the file COPYING located in this software
567 package. By loading or using the Software, you agree to the terms of this 567 package. By loading or using the Software, you agree to the terms of this
568 Agreement. If you do not agree with the terms of this Agreement, do not 568 Agreement. If you do not agree with the terms of this Agreement, do not
569 install or use the Software. 569 install or use the Software.
570 570
571 * Other names and brands may be claimed as the property of others. 571 * Other names and brands may be claimed as the property of others.
572 572
Documentation/networking/s2io.txt
1 Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver. 1 Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver.
2 2
3 Contents 3 Contents
4 ======= 4 =======
5 - 1. Introduction 5 - 1. Introduction
6 - 2. Identifying the adapter/interface 6 - 2. Identifying the adapter/interface
7 - 3. Features supported 7 - 3. Features supported
8 - 4. Command line parameters 8 - 4. Command line parameters
9 - 5. Performance suggestions 9 - 5. Performance suggestions
10 - 6. Available Downloads 10 - 6. Available Downloads
11 11
12 12
13 1. Introduction: 13 1. Introduction:
14 This Linux driver supports Neterion's Xframe I PCI-X 1.0 and 14 This Linux driver supports Neterion's Xframe I PCI-X 1.0 and
15 Xframe II PCI-X 2.0 adapters. It supports several features 15 Xframe II PCI-X 2.0 adapters. It supports several features
16 such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on. 16 such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on.
17 See below for complete list of features. 17 See below for complete list of features.
18 All features are supported for both IPv4 and IPv6. 18 All features are supported for both IPv4 and IPv6.
19 19
20 2. Identifying the adapter/interface: 20 2. Identifying the adapter/interface:
21 a. Insert the adapter(s) in your system. 21 a. Insert the adapter(s) in your system.
22 b. Build and load driver 22 b. Build and load driver
23 # insmod s2io.ko 23 # insmod s2io.ko
24 c. View log messages 24 c. View log messages
25 # dmesg | tail -40 25 # dmesg | tail -40
26 You will see messages similar to: 26 You will see messages similar to:
27 eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA 27 eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA
28 eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA 28 eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA
29 eth4: Device is on 64 bit 133MHz PCIX(M1) bus 29 eth4: Device is on 64 bit 133MHz PCIX(M1) bus
30 30
31 The above messages identify the adapter type(Xframe I/II), adapter revision, 31 The above messages identify the adapter type(Xframe I/II), adapter revision,
32 driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X). 32 driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X).
33 In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed 33 In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed
34 as well. 34 as well.
35 35
36 To associate an interface with a physical adapter use "ethtool -p <ethX>". 36 To associate an interface with a physical adapter use "ethtool -p <ethX>".
37 The corresponding adapter's LED will blink multiple times. 37 The corresponding adapter's LED will blink multiple times.
38 38
39 3. Features supported: 39 3. Features supported:
40 a. Jumbo frames. Xframe I/II supports MTU upto 9600 bytes, 40 a. Jumbo frames. Xframe I/II supports MTU upto 9600 bytes,
41 modifiable using ifconfig command. 41 modifiable using ifconfig command.
42 42
43 b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit 43 b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit
44 and receive, TSO. 44 and receive, TSO.
45 45
46 c. Multi-buffer receive mode. Scattering of packet across multiple 46 c. Multi-buffer receive mode. Scattering of packet across multiple
47 buffers. Currently driver supports 2-buffer mode which yields 47 buffers. Currently driver supports 2-buffer mode which yields
48 significant performance improvement on certain platforms(SGI Altix, 48 significant performance improvement on certain platforms(SGI Altix,
49 IBM xSeries). 49 IBM xSeries).
50 50
51 d. MSI/MSI-X. Can be enabled on platforms which support this feature 51 d. MSI/MSI-X. Can be enabled on platforms which support this feature
52 (IA64, Xeon) resulting in noticeable performance improvement(upto 7% 52 (IA64, Xeon) resulting in noticeable performance improvement(upto 7%
53 on certain platforms). 53 on certain platforms).
54 54
55 e. NAPI. Compile-time option(CONFIG_S2IO_NAPI) for better Rx interrupt 55 e. NAPI. Compile-time option(CONFIG_S2IO_NAPI) for better Rx interrupt
56 moderation. 56 moderation.
57 57
58 f. Statistics. Comprehensive MAC-level and software statistics displayed 58 f. Statistics. Comprehensive MAC-level and software statistics displayed
59 using "ethtool -S" option. 59 using "ethtool -S" option.
60 60
61 g. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings, 61 g. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings,
62 with multiple steering options. 62 with multiple steering options.
63 63
64 4. Command line parameters 64 4. Command line parameters
65 a. tx_fifo_num 65 a. tx_fifo_num
66 Number of transmit queues 66 Number of transmit queues
67 Valid range: 1-8 67 Valid range: 1-8
68 Default: 1 68 Default: 1
69 69
70 b. rx_ring_num 70 b. rx_ring_num
71 Number of receive rings 71 Number of receive rings
72 Valid range: 1-8 72 Valid range: 1-8
73 Default: 1 73 Default: 1
74 74
75 c. tx_fifo_len 75 c. tx_fifo_len
76 Size of each transmit queue 76 Size of each transmit queue
77 Valid range: Total length of all queues should not exceed 8192 77 Valid range: Total length of all queues should not exceed 8192
78 Default: 4096 78 Default: 4096
79 79
80 d. rx_ring_sz 80 d. rx_ring_sz
81 Size of each receive ring(in 4K blocks) 81 Size of each receive ring(in 4K blocks)
82 Valid range: Limited by memory on system 82 Valid range: Limited by memory on system
83 Default: 30 83 Default: 30
84 84
85 e. intr_type 85 e. intr_type
86 Specifies interrupt type. Possible values 1(INTA), 2(MSI), 3(MSI-X) 86 Specifies interrupt type. Possible values 1(INTA), 2(MSI), 3(MSI-X)
87 Valid range: 1-3 87 Valid range: 1-3
88 Default: 1 88 Default: 1
89 89
90 5. Performance suggestions 90 5. Performance suggestions
91 General: 91 General:
92 a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration) 92 a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration)
93 b. Set TCP windows size to optimal value. 93 b. Set TCP windows size to optimal value.
94 For instance, for MTU=1500 a value of 210K has been observed to result in 94 For instance, for MTU=1500 a value of 210K has been observed to result in
95 good performance. 95 good performance.
96 # sysctl -w net.ipv4.tcp_rmem="210000 210000 210000" 96 # sysctl -w net.ipv4.tcp_rmem="210000 210000 210000"
97 # sysctl -w net.ipv4.tcp_wmem="210000 210000 210000" 97 # sysctl -w net.ipv4.tcp_wmem="210000 210000 210000"
98 For MTU=9000, TCP window size of 10 MB is recommended. 98 For MTU=9000, TCP window size of 10 MB is recommended.
99 # sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000" 99 # sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
100 # sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000" 100 # sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
101 101
102 Transmit performance: 102 Transmit performance:
103 a. By default, the driver respects BIOS settings for PCI bus parameters. 103 a. By default, the driver respects BIOS settings for PCI bus parameters.
104 However, you may want to experiment with PCI bus parameters 104 However, you may want to experiment with PCI bus parameters
105 max-split-transactions(MOST) and MMRBC (use setpci command). 105 max-split-transactions(MOST) and MMRBC (use setpci command).
106 A MOST value of 2 has been found optimal for Opterons and 3 for Itanium. 106 A MOST value of 2 has been found optimal for Opterons and 3 for Itanium.
107 It could be different for your hardware. 107 It could be different for your hardware.
108 Set MMRBC to 4K**. 108 Set MMRBC to 4K**.
109 109
110 For example you can set 110 For example you can set
111 For opteron 111 For opteron
112 #setpci -d 17d5:* 62=1d 112 #setpci -d 17d5:* 62=1d
113 For Itanium 113 For Itanium
114 #setpci -d 17d5:* 62=3d 114 #setpci -d 17d5:* 62=3d
115 115
116 For detailed description of the PCI registers, please see Xframe User Guide. 116 For detailed description of the PCI registers, please see Xframe User Guide.
117 117
118 b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this 118 b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this
119 parameter. 119 parameter.
120 c. Turn on TSO(using "ethtool -K") 120 c. Turn on TSO(using "ethtool -K")
121 # ethtool -K <ethX> tso on 121 # ethtool -K <ethX> tso on
122 122
123 Receive performance: 123 Receive performance:
124 a. By default, the driver respects BIOS settings for PCI bus parameters. 124 a. By default, the driver respects BIOS settings for PCI bus parameters.
125 However, you may want to set PCI latency timer to 248. 125 However, you may want to set PCI latency timer to 248.
126 #setpci -d 17d5:* LATENCY_TIMER=f8 126 #setpci -d 17d5:* LATENCY_TIMER=f8
127 For detailed description of the PCI registers, please see Xframe User Guide. 127 For detailed description of the PCI registers, please see Xframe User Guide.
128 b. Use 2-buffer mode. This results in large performance boost on 128 b. Use 2-buffer mode. This results in large performance boost on
129 on certain platforms(eg. SGI Altix, IBM xSeries). 129 certain platforms(eg. SGI Altix, IBM xSeries).
130 c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to 130 c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to
131 set/verify this option. 131 set/verify this option.
132 d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network 132 d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network
133 device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to 133 device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to
134 bring down CPU utilization. 134 bring down CPU utilization.
135 135
136 ** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are 136 ** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are
137 recommended as safe parameters. 137 recommended as safe parameters.
138 For more information, please review the AMD8131 errata at 138 For more information, please review the AMD8131 errata at
139 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf 139 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf
140 140
141 6. Available Downloads 141 6. Available Downloads
142 Neterion "s2io" driver in Red Hat and Suse 2.6-based distributions is kept up 142 Neterion "s2io" driver in Red Hat and Suse 2.6-based distributions is kept up
143 to date, also the latest "s2io" code (including support for 2.4 kernels) is 143 to date, also the latest "s2io" code (including support for 2.4 kernels) is
144 available via "Support" link on the Neterion site: http://www.neterion.com. 144 available via "Support" link on the Neterion site: http://www.neterion.com.
145 145
146 For Xframe User Guide (Programming manual), visit ftp site ns1.s2io.com, 146 For Xframe User Guide (Programming manual), visit ftp site ns1.s2io.com,
147 user: linuxdocs password: HALdocs 147 user: linuxdocs password: HALdocs
148 148
149 7. Support 149 7. Support
150 For further support please contact either your 10GbE Xframe NIC vendor (IBM, 150 For further support please contact either your 10GbE Xframe NIC vendor (IBM,
151 HP, SGI etc.) or click on the "Support" link on the Neterion site: 151 HP, SGI etc.) or click on the "Support" link on the Neterion site:
152 http://www.neterion.com. 152 http://www.neterion.com.
153 153
154 154
Documentation/networking/sk98lin.txt
1 (C)Copyright 1999-2004 Marvell(R). 1 (C)Copyright 1999-2004 Marvell(R).
2 All rights reserved 2 All rights reserved
3 =========================================================================== 3 ===========================================================================
4 4
5 sk98lin.txt created 13-Feb-2004 5 sk98lin.txt created 13-Feb-2004
6 6
7 Readme File for sk98lin v6.23 7 Readme File for sk98lin v6.23
8 Marvell Yukon/SysKonnect SK-98xx Gigabit Ethernet Adapter family driver for LINUX 8 Marvell Yukon/SysKonnect SK-98xx Gigabit Ethernet Adapter family driver for LINUX
9 9
10 This file contains 10 This file contains
11 1 Overview 11 1 Overview
12 2 Required Files 12 2 Required Files
13 3 Installation 13 3 Installation
14 3.1 Driver Installation 14 3.1 Driver Installation
15 3.2 Inclusion of adapter at system start 15 3.2 Inclusion of adapter at system start
16 4 Driver Parameters 16 4 Driver Parameters
17 4.1 Per-Port Parameters 17 4.1 Per-Port Parameters
18 4.2 Adapter Parameters 18 4.2 Adapter Parameters
19 5 Large Frame Support 19 5 Large Frame Support
20 6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad) 20 6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad)
21 7 Troubleshooting 21 7 Troubleshooting
22 22
23 =========================================================================== 23 ===========================================================================
24 24
25 25
26 1 Overview 26 1 Overview
27 =========== 27 ===========
28 28
29 The sk98lin driver supports the Marvell Yukon and SysKonnect 29 The sk98lin driver supports the Marvell Yukon and SysKonnect
30 SK-98xx/SK-95xx compliant Gigabit Ethernet Adapter on Linux. It has 30 SK-98xx/SK-95xx compliant Gigabit Ethernet Adapter on Linux. It has
31 been tested with Linux on Intel/x86 machines. 31 been tested with Linux on Intel/x86 machines.
32 *** 32 ***
33 33
34 34
35 2 Required Files 35 2 Required Files
36 ================= 36 =================
37 37
38 The linux kernel source. 38 The linux kernel source.
39 No additional files required. 39 No additional files required.
40 *** 40 ***
41 41
42 42
43 3 Installation 43 3 Installation
44 =============== 44 ===============
45 45
46 It is recommended to download the latest version of the driver from the 46 It is recommended to download the latest version of the driver from the
47 SysKonnect web site www.syskonnect.com. If you have downloaded the latest 47 SysKonnect web site www.syskonnect.com. If you have downloaded the latest
48 driver, the Linux kernel has to be patched before the driver can be 48 driver, the Linux kernel has to be patched before the driver can be
49 installed. For details on how to patch a Linux kernel, refer to the 49 installed. For details on how to patch a Linux kernel, refer to the
50 patch.txt file. 50 patch.txt file.
51 51
52 3.1 Driver Installation 52 3.1 Driver Installation
53 ------------------------ 53 ------------------------
54 54
55 The following steps describe the actions that are required to install 55 The following steps describe the actions that are required to install
56 the driver and to start it manually. These steps should be carried 56 the driver and to start it manually. These steps should be carried
57 out for the initial driver setup. Once confirmed to be ok, they can 57 out for the initial driver setup. Once confirmed to be ok, they can
58 be included in the system start. 58 be included in the system start.
59 59
60 NOTE 1: To perform the following tasks you need 'root' access. 60 NOTE 1: To perform the following tasks you need 'root' access.
61 61
62 NOTE 2: In case of problems, please read the section "Troubleshooting" 62 NOTE 2: In case of problems, please read the section "Troubleshooting"
63 below. 63 below.
64 64
65 The driver can either be integrated into the kernel or it can be compiled 65 The driver can either be integrated into the kernel or it can be compiled
66 as a module. Select the appropriate option during the kernel 66 as a module. Select the appropriate option during the kernel
67 configuration. 67 configuration.
68 68
69 Compile/use the driver as a module 69 Compile/use the driver as a module
70 ---------------------------------- 70 ----------------------------------
71 To compile the driver, go to the directory /usr/src/linux and 71 To compile the driver, go to the directory /usr/src/linux and
72 execute the command "make menuconfig" or "make xconfig" and proceed as 72 execute the command "make menuconfig" or "make xconfig" and proceed as
73 follows: 73 follows:
74 74
75 To integrate the driver permanently into the kernel, proceed as follows: 75 To integrate the driver permanently into the kernel, proceed as follows:
76 76
77 1. Select the menu "Network device support" and then "Ethernet(1000Mbit)" 77 1. Select the menu "Network device support" and then "Ethernet(1000Mbit)"
78 2. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support" 78 2. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support"
79 with (*) 79 with (*)
80 3. Build a new kernel when the configuration of the above options is 80 3. Build a new kernel when the configuration of the above options is
81 finished. 81 finished.
82 4. Install the new kernel. 82 4. Install the new kernel.
83 5. Reboot your system. 83 5. Reboot your system.
84 84
85 To use the driver as a module, proceed as follows: 85 To use the driver as a module, proceed as follows:
86 86
87 1. Enable 'loadable module support' in the kernel. 87 1. Enable 'loadable module support' in the kernel.
88 2. For automatic driver start, enable the 'Kernel module loader'. 88 2. For automatic driver start, enable the 'Kernel module loader'.
89 3. Select the menu "Network device support" and then "Ethernet(1000Mbit)" 89 3. Select the menu "Network device support" and then "Ethernet(1000Mbit)"
90 4. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support" 90 4. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support"
91 with (M) 91 with (M)
92 5. Execute the command "make modules". 92 5. Execute the command "make modules".
93 6. Execute the command "make modules_install". 93 6. Execute the command "make modules_install".
94 The appropriate modules will be installed. 94 The appropriate modules will be installed.
95 7. Reboot your system. 95 7. Reboot your system.
96 96
97 97
98 Load the module manually 98 Load the module manually
99 ------------------------ 99 ------------------------
100 To load the module manually, proceed as follows: 100 To load the module manually, proceed as follows:
101 101
102 1. Enter "modprobe sk98lin". 102 1. Enter "modprobe sk98lin".
103 2. If a Marvell Yukon or SysKonnect SK-98xx adapter is installed in 103 2. If a Marvell Yukon or SysKonnect SK-98xx adapter is installed in
104 your computer and you have a /proc file system, execute the command: 104 your computer and you have a /proc file system, execute the command:
105 "ls /proc/net/sk98lin/" 105 "ls /proc/net/sk98lin/"
106 This should produce an output containing a line with the following 106 This should produce an output containing a line with the following
107 format: 107 format:
108 eth0 eth1 ... 108 eth0 eth1 ...
109 which indicates that your adapter has been found and initialized. 109 which indicates that your adapter has been found and initialized.
110 110
111 NOTE 1: If you have more than one Marvell Yukon or SysKonnect SK-98xx 111 NOTE 1: If you have more than one Marvell Yukon or SysKonnect SK-98xx
112 adapter installed, the adapters will be listed as 'eth0', 112 adapter installed, the adapters will be listed as 'eth0',
113 'eth1', 'eth2', etc. 113 'eth1', 'eth2', etc.
114 For each adapter, repeat steps 3 and 4 below. 114 For each adapter, repeat steps 3 and 4 below.
115 115
116 NOTE 2: If you have other Ethernet adapters installed, your Marvell 116 NOTE 2: If you have other Ethernet adapters installed, your Marvell
117 Yukon or SysKonnect SK-98xx adapter will be mapped to the 117 Yukon or SysKonnect SK-98xx adapter will be mapped to the
118 next available number, e.g. 'eth1'. The mapping is executed 118 next available number, e.g. 'eth1'. The mapping is executed
119 automatically. 119 automatically.
120 The module installation message (displayed either in a system 120 The module installation message (displayed either in a system
121 log file or on the console) prints a line for each adapter 121 log file or on the console) prints a line for each adapter
122 found containing the corresponding 'ethX'. 122 found containing the corresponding 'ethX'.
123 123
124 3. Select an IP address and assign it to the respective adapter by 124 3. Select an IP address and assign it to the respective adapter by
125 entering: 125 entering:
126 ifconfig eth0 <ip-address> 126 ifconfig eth0 <ip-address>
127 With this command, the adapter is connected to the Ethernet. 127 With this command, the adapter is connected to the Ethernet.
128 128
129 SK-98xx Gigabit Ethernet Server Adapters: The yellow LED on the adapter 129 SK-98xx Gigabit Ethernet Server Adapters: The yellow LED on the adapter
130 is now active, the link status LED of the primary port is active and 130 is now active, the link status LED of the primary port is active and
131 the link status LED of the secondary port (on dual port adapters) is 131 the link status LED of the secondary port (on dual port adapters) is
132 blinking (if the ports are connected to a switch or hub). 132 blinking (if the ports are connected to a switch or hub).
133 SK-98xx V2.0 Gigabit Ethernet Adapters: The link status LED is active. 133 SK-98xx V2.0 Gigabit Ethernet Adapters: The link status LED is active.
134 In addition, you will receive a status message on the console stating 134 In addition, you will receive a status message on the console stating
135 "ethX: network connection up using port Y" and showing the selected 135 "ethX: network connection up using port Y" and showing the selected
136 connection parameters (x stands for the ethernet device number 136 connection parameters (x stands for the ethernet device number
137 (0,1,2, etc), y stands for the port name (A or B)). 137 (0,1,2, etc), y stands for the port name (A or B)).
138 138
139 NOTE: If you are in doubt about IP addresses, ask your network 139 NOTE: If you are in doubt about IP addresses, ask your network
140 administrator for assistance. 140 administrator for assistance.
141 141
142 4. Your adapter should now be fully operational. 142 4. Your adapter should now be fully operational.
143 Use 'ping <otherstation>' to verify the connection to other computers 143 Use 'ping <otherstation>' to verify the connection to other computers
144 on your network. 144 on your network.
145 5. To check the adapter configuration view /proc/net/sk98lin/[devicename]. 145 5. To check the adapter configuration view /proc/net/sk98lin/[devicename].
146 For example by executing: 146 For example by executing:
147 "cat /proc/net/sk98lin/eth0" 147 "cat /proc/net/sk98lin/eth0"
148 148
149 Unload the module 149 Unload the module
150 ----------------- 150 -----------------
151 To stop and unload the driver modules, proceed as follows: 151 To stop and unload the driver modules, proceed as follows:
152 152
153 1. Execute the command "ifconfig eth0 down". 153 1. Execute the command "ifconfig eth0 down".
154 2. Execute the command "rmmod sk98lin". 154 2. Execute the command "rmmod sk98lin".
155 155
156 3.2 Inclusion of adapter at system start 156 3.2 Inclusion of adapter at system start
157 ----------------------------------------- 157 -----------------------------------------
158 158
159 Since a large number of different Linux distributions are 159 Since a large number of different Linux distributions are
160 available, we are unable to describe a general installation procedure 160 available, we are unable to describe a general installation procedure
161 for the driver module. 161 for the driver module.
162 Because the driver is now integrated in the kernel, installation should 162 Because the driver is now integrated in the kernel, installation should
163 be easy, using the standard mechanism of your distribution. 163 be easy, using the standard mechanism of your distribution.
164 Refer to the distribution's manual for installation of ethernet adapters. 164 Refer to the distribution's manual for installation of ethernet adapters.
165 165
166 *** 166 ***
167 167
168 4 Driver Parameters 168 4 Driver Parameters
169 ==================== 169 ====================
170 170
171 Parameters can be set at the command line after the module has been 171 Parameters can be set at the command line after the module has been
172 loaded with the command 'modprobe'. 172 loaded with the command 'modprobe'.
173 In some distributions, the configuration tools are able to pass parameters 173 In some distributions, the configuration tools are able to pass parameters
174 to the driver module. 174 to the driver module.
175 175
176 If you use the kernel module loader, you can set driver parameters 176 If you use the kernel module loader, you can set driver parameters
177 in the file /etc/modprobe.conf (or /etc/modules.conf in 2.4 or earlier). 177 in the file /etc/modprobe.conf (or /etc/modules.conf in 2.4 or earlier).
178 To set the driver parameters in this file, proceed as follows: 178 To set the driver parameters in this file, proceed as follows:
179 179
180 1. Insert a line of the form : 180 1. Insert a line of the form :
181 options sk98lin ... 181 options sk98lin ...
182 For "...", the same syntax is required as described for the command 182 For "...", the same syntax is required as described for the command
183 line parameters of modprobe below. 183 line parameters of modprobe below.
184 2. To activate the new parameters, either reboot your computer 184 2. To activate the new parameters, either reboot your computer
185 or 185 or
186 unload and reload the driver. 186 unload and reload the driver.
187 The syntax of the driver parameters is: 187 The syntax of the driver parameters is:
188 188
189 modprobe sk98lin parameter=value1[,value2[,value3...]] 189 modprobe sk98lin parameter=value1[,value2[,value3...]]
190 190
191 where value1 refers to the first adapter, value2 to the second etc. 191 where value1 refers to the first adapter, value2 to the second etc.
192 192
193 NOTE: All parameters are case sensitive. Write them exactly as shown 193 NOTE: All parameters are case sensitive. Write them exactly as shown
194 below. 194 below.
195 195
196 Example: 196 Example:
197 Suppose you have two adapters. You want to set auto-negotiation 197 Suppose you have two adapters. You want to set auto-negotiation
198 on the first adapter to ON and on the second adapter to OFF. 198 on the first adapter to ON and on the second adapter to OFF.
199 You also want to set DuplexCapabilities on the first adapter 199 You also want to set DuplexCapabilities on the first adapter
200 to FULL, and on the second adapter to HALF. 200 to FULL, and on the second adapter to HALF.
201 Then, you must enter: 201 Then, you must enter:
202 202
203 modprobe sk98lin AutoNeg_A=On,Off DupCap_A=Full,Half 203 modprobe sk98lin AutoNeg_A=On,Off DupCap_A=Full,Half
204 204
205 NOTE: The number of adapters that can be configured this way is 205 NOTE: The number of adapters that can be configured this way is
206 limited in the driver (file skge.c, constant SK_MAX_CARD_PARAM). 206 limited in the driver (file skge.c, constant SK_MAX_CARD_PARAM).
207 The current limit is 16. If you happen to install 207 The current limit is 16. If you happen to install
208 more adapters, adjust this and recompile. 208 more adapters, adjust this and recompile.
209 209
210 210
211 4.1 Per-Port Parameters 211 4.1 Per-Port Parameters
212 ------------------------ 212 ------------------------
213 213
214 These settings are available for each port on the adapter. 214 These settings are available for each port on the adapter.
215 In the following description, '?' stands for the port for 215 In the following description, '?' stands for the port for
216 which you set the parameter (A or B). 216 which you set the parameter (A or B).
217 217
218 Speed 218 Speed
219 ----- 219 -----
220 Parameter: Speed_? 220 Parameter: Speed_?
221 Values: 10, 100, 1000, Auto 221 Values: 10, 100, 1000, Auto
222 Default: Auto 222 Default: Auto
223 223
224 This parameter is used to set the speed capabilities. It is only valid 224 This parameter is used to set the speed capabilities. It is only valid
225 for the SK-98xx V2.0 copper adapters. 225 for the SK-98xx V2.0 copper adapters.
226 Usually, the speed is negotiated between the two ports during link 226 Usually, the speed is negotiated between the two ports during link
227 establishment. If this fails, a port can be forced to a specific setting 227 establishment. If this fails, a port can be forced to a specific setting
228 with this parameter. 228 with this parameter.
229 229
230 Auto-Negotiation 230 Auto-Negotiation
231 ---------------- 231 ----------------
232 Parameter: AutoNeg_? 232 Parameter: AutoNeg_?
233 Values: On, Off, Sense 233 Values: On, Off, Sense
234 Default: On 234 Default: On
235 235
236 The "Sense"-mode automatically detects whether the link partner supports 236 The "Sense"-mode automatically detects whether the link partner supports
237 auto-negotiation or not. 237 auto-negotiation or not.
238 238
239 Duplex Capabilities 239 Duplex Capabilities
240 ------------------- 240 -------------------
241 Parameter: DupCap_? 241 Parameter: DupCap_?
242 Values: Half, Full, Both 242 Values: Half, Full, Both
243 Default: Both 243 Default: Both
244 244
245 This parameters is only relevant if auto-negotiation for this port is 245 This parameters is only relevant if auto-negotiation for this port is
246 not set to "Sense". If auto-negotiation is set to "On", all three values 246 not set to "Sense". If auto-negotiation is set to "On", all three values
247 are possible. If it is set to "Off", only "Full" and "Half" are allowed. 247 are possible. If it is set to "Off", only "Full" and "Half" are allowed.
248 This parameter is useful if your link partner does not support all 248 This parameter is useful if your link partner does not support all
249 possible combinations. 249 possible combinations.
250 250
251 Flow Control 251 Flow Control
252 ------------ 252 ------------
253 Parameter: FlowCtrl_? 253 Parameter: FlowCtrl_?
254 Values: Sym, SymOrRem, LocSend, None 254 Values: Sym, SymOrRem, LocSend, None
255 Default: SymOrRem 255 Default: SymOrRem
256 256
257 This parameter can be used to set the flow control capabilities the 257 This parameter can be used to set the flow control capabilities the
258 port reports during auto-negotiation. It can be set for each port 258 port reports during auto-negotiation. It can be set for each port
259 individually. 259 individually.
260 Possible modes: 260 Possible modes:
261 -- Sym = Symmetric: both link partners are allowed to send 261 -- Sym = Symmetric: both link partners are allowed to send
262 PAUSE frames 262 PAUSE frames
263 -- SymOrRem = SymmetricOrRemote: both or only remote partner 263 -- SymOrRem = SymmetricOrRemote: both or only remote partner
264 are allowed to send PAUSE frames 264 are allowed to send PAUSE frames
265 -- LocSend = LocalSend: only local link partner is allowed 265 -- LocSend = LocalSend: only local link partner is allowed
266 to send PAUSE frames 266 to send PAUSE frames
267 -- None = no link partner is allowed to send PAUSE frames 267 -- None = no link partner is allowed to send PAUSE frames
268 268
269 NOTE: This parameter is ignored if auto-negotiation is set to "Off". 269 NOTE: This parameter is ignored if auto-negotiation is set to "Off".
270 270
271 Role in Master-Slave-Negotiation (1000Base-T only) 271 Role in Master-Slave-Negotiation (1000Base-T only)
272 -------------------------------------------------- 272 --------------------------------------------------
273 Parameter: Role_? 273 Parameter: Role_?
274 Values: Auto, Master, Slave 274 Values: Auto, Master, Slave
275 Default: Auto 275 Default: Auto
276 276
277 This parameter is only valid for the SK-9821 and SK-9822 adapters. 277 This parameter is only valid for the SK-9821 and SK-9822 adapters.
278 For two 1000Base-T ports to communicate, one must take the role of the 278 For two 1000Base-T ports to communicate, one must take the role of the
279 master (providing timing information), while the other must be the 279 master (providing timing information), while the other must be the
280 slave. Usually, this is negotiated between the two ports during link 280 slave. Usually, this is negotiated between the two ports during link
281 establishment. If this fails, a port can be forced to a specific setting 281 establishment. If this fails, a port can be forced to a specific setting
282 with this parameter. 282 with this parameter.
283 283
284 284
285 4.2 Adapter Parameters 285 4.2 Adapter Parameters
286 ----------------------- 286 -----------------------
287 287
288 Connection Type (SK-98xx V2.0 copper adapters only) 288 Connection Type (SK-98xx V2.0 copper adapters only)
289 --------------- 289 ---------------
290 Parameter: ConType 290 Parameter: ConType
291 Values: Auto, 100FD, 100HD, 10FD, 10HD 291 Values: Auto, 100FD, 100HD, 10FD, 10HD
292 Default: Auto 292 Default: Auto
293 293
294 The parameter 'ConType' is a combination of all five per-port parameters 294 The parameter 'ConType' is a combination of all five per-port parameters
295 within one single parameter. This simplifies the configuration of both ports 295 within one single parameter. This simplifies the configuration of both ports
296 of an adapter card! The different values of this variable reflect the most 296 of an adapter card! The different values of this variable reflect the most
297 meaningful combinations of port parameters. 297 meaningful combinations of port parameters.
298 298
299 The following table shows the values of 'ConType' and the corresponding 299 The following table shows the values of 'ConType' and the corresponding
300 combinations of the per-port parameters: 300 combinations of the per-port parameters:
301 301
302 ConType | DupCap AutoNeg FlowCtrl Role Speed 302 ConType | DupCap AutoNeg FlowCtrl Role Speed
303 ----------+------------------------------------------------------ 303 ----------+------------------------------------------------------
304 Auto | Both On SymOrRem Auto Auto 304 Auto | Both On SymOrRem Auto Auto
305 100FD | Full Off None Auto (ignored) 100 305 100FD | Full Off None Auto (ignored) 100
306 100HD | Half Off None Auto (ignored) 100 306 100HD | Half Off None Auto (ignored) 100
307 10FD | Full Off None Auto (ignored) 10 307 10FD | Full Off None Auto (ignored) 10
308 10HD | Half Off None Auto (ignored) 10 308 10HD | Half Off None Auto (ignored) 10
309 309
310 Stating any other port parameter together with this 'ConType' variable 310 Stating any other port parameter together with this 'ConType' variable
311 will result in a merged configuration of those settings. This due to 311 will result in a merged configuration of those settings. This due to
312 the fact, that the per-port parameters (e.g. Speed_? ) have a higher 312 the fact, that the per-port parameters (e.g. Speed_? ) have a higher
313 priority than the combined variable 'ConType'. 313 priority than the combined variable 'ConType'.
314 314
315 NOTE: This parameter is always used on both ports of the adapter card. 315 NOTE: This parameter is always used on both ports of the adapter card.
316 316
317 Interrupt Moderation 317 Interrupt Moderation
318 -------------------- 318 --------------------
319 Parameter: Moderation 319 Parameter: Moderation
320 Values: None, Static, Dynamic 320 Values: None, Static, Dynamic
321 Default: None 321 Default: None
322 322
323 Interrupt moderation is employed to limit the maximum number of interrupts 323 Interrupt moderation is employed to limit the maximum number of interrupts
324 the driver has to serve. That is, one or more interrupts (which indicate any 324 the driver has to serve. That is, one or more interrupts (which indicate any
325 transmit or receive packet to be processed) are queued until the driver 325 transmit or receive packet to be processed) are queued until the driver
326 processes them. When queued interrupts are to be served, is determined by the 326 processes them. When queued interrupts are to be served, is determined by the
327 'IntsPerSec' parameter, which is explained later below. 327 'IntsPerSec' parameter, which is explained later below.
328 328
329 Possible modes: 329 Possible modes:
330 330
331 -- None - No interrupt moderation is applied on the adapter card. 331 -- None - No interrupt moderation is applied on the adapter card.
332 Therefore, each transmit or receive interrupt is served immediately 332 Therefore, each transmit or receive interrupt is served immediately
333 as soon as it appears on the interrupt line of the adapter card. 333 as soon as it appears on the interrupt line of the adapter card.
334 334
335 -- Static - Interrupt moderation is applied on the adapter card. 335 -- Static - Interrupt moderation is applied on the adapter card.
336 All transmit and receive interrupts are queued until a complete 336 All transmit and receive interrupts are queued until a complete
337 moderation interval ends. If such a moderation interval ends, all 337 moderation interval ends. If such a moderation interval ends, all
338 queued interrupts are processed in one big bunch without any delay. 338 queued interrupts are processed in one big bunch without any delay.
339 The term 'static' reflects the fact, that interrupt moderation is 339 The term 'static' reflects the fact, that interrupt moderation is
340 always enabled, regardless how much network load is currently 340 always enabled, regardless how much network load is currently
341 passing via a particular interface. In addition, the duration of 341 passing via a particular interface. In addition, the duration of
342 the moderation interval has a fixed length that never changes while 342 the moderation interval has a fixed length that never changes while
343 the driver is operational. 343 the driver is operational.
344 344
345 -- Dynamic - Interrupt moderation might be applied on the adapter card, 345 -- Dynamic - Interrupt moderation might be applied on the adapter card,
346 depending on the load of the system. If the driver detects that the 346 depending on the load of the system. If the driver detects that the
347 system load is too high, the driver tries to shield the system against 347 system load is too high, the driver tries to shield the system against
348 too much network load by enabling interrupt moderation. If - at a later 348 too much network load by enabling interrupt moderation. If - at a later
349 time - the CPU utilizaton decreases again (or if the network load is 349 time - the CPU utilizaton decreases again (or if the network load is
350 negligible) the interrupt moderation will automatically be disabled. 350 negligible) the interrupt moderation will automatically be disabled.
351 351
352 Interrupt moderation should be used when the driver has to handle one or more 352 Interrupt moderation should be used when the driver has to handle one or more
353 interfaces with a high network load, which - as a consequence - leads also to a 353 interfaces with a high network load, which - as a consequence - leads also to a
354 high CPU utilization. When moderation is applied in such high network load 354 high CPU utilization. When moderation is applied in such high network load
355 situations, CPU load might be reduced by 20-30%. 355 situations, CPU load might be reduced by 20-30%.
356 356
357 NOTE: The drawback of using interrupt moderation is an increase of the round- 357 NOTE: The drawback of using interrupt moderation is an increase of the round-
358 trip-time (RTT), due to the queueing and serving of interrupts at dedicated 358 trip-time (RTT), due to the queueing and serving of interrupts at dedicated
359 moderation times. 359 moderation times.
360 360
361 Interrupts per second 361 Interrupts per second
362 --------------------- 362 ---------------------
363 Parameter: IntsPerSec 363 Parameter: IntsPerSec
364 Values: 30...40000 (interrupts per second) 364 Values: 30...40000 (interrupts per second)
365 Default: 2000 365 Default: 2000
366 366
367 This parameter is only used if either static or dynamic interrupt moderation 367 This parameter is only used if either static or dynamic interrupt moderation
368 is used on a network adapter card. Using this parameter if no moderation is 368 is used on a network adapter card. Using this parameter if no moderation is
369 applied will lead to no action performed. 369 applied will lead to no action performed.
370 370
371 This parameter determines the length of any interrupt moderation interval. 371 This parameter determines the length of any interrupt moderation interval.
372 Assuming that static interrupt moderation is to be used, an 'IntsPerSec' 372 Assuming that static interrupt moderation is to be used, an 'IntsPerSec'
373 parameter value of 2000 will lead to an interrupt moderation interval of 373 parameter value of 2000 will lead to an interrupt moderation interval of
374 500 microseconds. 374 500 microseconds.
375 375
376 NOTE: The duration of the moderation interval is to be chosen with care. 376 NOTE: The duration of the moderation interval is to be chosen with care.
377 At first glance, selecting a very long duration (e.g. only 100 interrupts per 377 At first glance, selecting a very long duration (e.g. only 100 interrupts per
378 second) seems to be meaningful, but the increase of packet-processing delay 378 second) seems to be meaningful, but the increase of packet-processing delay
379 is tremendous. On the other hand, selecting a very short moderation time might 379 is tremendous. On the other hand, selecting a very short moderation time might
380 compensate the use of any moderation being applied. 380 compensate the use of any moderation being applied.
381 381
382 382
383 Preferred Port 383 Preferred Port
384 -------------- 384 --------------
385 Parameter: PrefPort 385 Parameter: PrefPort
386 Values: A, B 386 Values: A, B
387 Default: A 387 Default: A
388 388
389 This is used to force the preferred port to A or B (on dual-port network 389 This is used to force the preferred port to A or B (on dual-port network
390 adapters). The preferred port is the one that is used if both are detected 390 adapters). The preferred port is the one that is used if both are detected
391 as fully functional. 391 as fully functional.
392 392
393 RLMT Mode (Redundant Link Management Technology) 393 RLMT Mode (Redundant Link Management Technology)
394 ------------------------------------------------ 394 ------------------------------------------------
395 Parameter: RlmtMode 395 Parameter: RlmtMode
396 Values: CheckLinkState,CheckLocalPort, CheckSeg, DualNet 396 Values: CheckLinkState,CheckLocalPort, CheckSeg, DualNet
397 Default: CheckLinkState 397 Default: CheckLinkState
398 398
399 RLMT monitors the status of the port. If the link of the active port 399 RLMT monitors the status of the port. If the link of the active port
400 fails, RLMT switches immediately to the standby link. The virtual link is 400 fails, RLMT switches immediately to the standby link. The virtual link is
401 maintained as long as at least one 'physical' link is up. 401 maintained as long as at least one 'physical' link is up.
402 402
403 Possible modes: 403 Possible modes:
404 404
405 -- CheckLinkState - Check link state only: RLMT uses the link state 405 -- CheckLinkState - Check link state only: RLMT uses the link state
406 reported by the adapter hardware for each individual port to 406 reported by the adapter hardware for each individual port to
407 determine whether a port can be used for all network traffic or 407 determine whether a port can be used for all network traffic or
408 not. 408 not.
409 409
410 -- CheckLocalPort - In this mode, RLMT monitors the network path 410 -- CheckLocalPort - In this mode, RLMT monitors the network path
411 between the two ports of an adapter by regularly exchanging packets 411 between the two ports of an adapter by regularly exchanging packets
412 between them. This mode requires a network configuration in which 412 between them. This mode requires a network configuration in which
413 the two ports are able to "see" each other (i.e. there must not be 413 the two ports are able to "see" each other (i.e. there must not be
414 any router between the ports). 414 any router between the ports).
415 415
416 -- CheckSeg - Check local port and segmentation: This mode supports the 416 -- CheckSeg - Check local port and segmentation: This mode supports the
417 same functions as the CheckLocalPort mode and additionally checks 417 same functions as the CheckLocalPort mode and additionally checks
418 network segmentation between the ports. Therefore, this mode is only 418 network segmentation between the ports. Therefore, this mode is only
419 to be used if Gigabit Ethernet switches are installed on the network 419 to be used if Gigabit Ethernet switches are installed on the network
420 that have been configured to use the Spanning Tree protocol. 420 that have been configured to use the Spanning Tree protocol.
421 421
422 -- DualNet - In this mode, ports A and B are used as separate devices. 422 -- DualNet - In this mode, ports A and B are used as separate devices.
423 If you have a dual port adapter, port A will be configured as eth0 423 If you have a dual port adapter, port A will be configured as eth0
424 and port B as eth1. Both ports can be used independently with 424 and port B as eth1. Both ports can be used independently with
425 distinct IP addresses. The preferred port setting is not used. 425 distinct IP addresses. The preferred port setting is not used.
426 RLMT is turned off. 426 RLMT is turned off.
427 427
428 NOTE: RLMT modes CLP and CLPSS are designed to operate in configurations 428 NOTE: RLMT modes CLP and CLPSS are designed to operate in configurations
429 where a network path between the ports on one adapter exists. 429 where a network path between the ports on one adapter exists.
430 Moreover, they are not designed to work where adapters are connected 430 Moreover, they are not designed to work where adapters are connected
431 back-to-back. 431 back-to-back.
432 *** 432 ***
433 433
434 434
435 5 Large Frame Support 435 5 Large Frame Support
436 ====================== 436 ======================
437 437
438 The driver supports large frames (also called jumbo frames). Using large 438 The driver supports large frames (also called jumbo frames). Using large
439 frames can result in an improved throughput if transferring large amounts 439 frames can result in an improved throughput if transferring large amounts
440 of data. 440 of data.
441 To enable large frames, set the MTU (maximum transfer unit) of the 441 To enable large frames, set the MTU (maximum transfer unit) of the
442 interface to the desired value (up to 9000), execute the following 442 interface to the desired value (up to 9000), execute the following
443 command: 443 command:
444 ifconfig eth0 mtu 9000 444 ifconfig eth0 mtu 9000
445 This will only work if you have two adapters connected back-to-back 445 This will only work if you have two adapters connected back-to-back
446 or if you use a switch that supports large frames. When using a switch, 446 or if you use a switch that supports large frames. When using a switch,
447 it should be configured to allow large frames and auto-negotiation should 447 it should be configured to allow large frames and auto-negotiation should
448 be set to OFF. The setting must be configured on all adapters that can be 448 be set to OFF. The setting must be configured on all adapters that can be
449 reached by the large frames. If one adapter is not set to receive large 449 reached by the large frames. If one adapter is not set to receive large
450 frames, it will simply drop them. 450 frames, it will simply drop them.
451 451
452 You can switch back to the standard ethernet frame size by executing the 452 You can switch back to the standard ethernet frame size by executing the
453 following command: 453 following command:
454 ifconfig eth0 mtu 1500 454 ifconfig eth0 mtu 1500
455 455
456 To permanently configure this setting, add a script with the 'ifconfig' 456 To permanently configure this setting, add a script with the 'ifconfig'
457 line to the system startup sequence (named something like "S99sk98lin" 457 line to the system startup sequence (named something like "S99sk98lin"
458 in /etc/rc.d/rc2.d). 458 in /etc/rc.d/rc2.d).
459 *** 459 ***
460 460
461 461
462 6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad) 462 6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad)
463 ================================================================== 463 ==================================================================
464 464
465 The Marvell Yukon/SysKonnect Linux drivers are able to support VLAN and 465 The Marvell Yukon/SysKonnect Linux drivers are able to support VLAN and
466 Link Aggregation according to IEEE standards 802.1, 802.1q, and 802.3ad. 466 Link Aggregation according to IEEE standards 802.1, 802.1q, and 802.3ad.
467 These features are only available after installation of open source 467 These features are only available after installation of open source
468 modules available on the Internet: 468 modules available on the Internet:
469 For VLAN go to: http://www.candelatech.com/~greear/vlan.html 469 For VLAN go to: http://www.candelatech.com/~greear/vlan.html
470 For Link Aggregation go to: http://www.st.rim.or.jp/~yumo 470 For Link Aggregation go to: http://www.st.rim.or.jp/~yumo
471 471
472 NOTE: SysKonnect GmbH does not offer any support for these open source 472 NOTE: SysKonnect GmbH does not offer any support for these open source
473 modules and does not take the responsibility for any kind of 473 modules and does not take the responsibility for any kind of
474 failures or problems arising in connection with these modules. 474 failures or problems arising in connection with these modules.
475 475
476 NOTE: Configuring Link Aggregation on a SysKonnect dual link adapter may 476 NOTE: Configuring Link Aggregation on a SysKonnect dual link adapter may
477 cause problems when unloading the driver. 477 cause problems when unloading the driver.
478 478
479 479
480 7 Troubleshooting 480 7 Troubleshooting
481 ================== 481 ==================
482 482
483 If any problems occur during the installation process, check the 483 If any problems occur during the installation process, check the
484 following list: 484 following list:
485 485
486 486
487 Problem: The SK-98xx adapter cannot be found by the driver. 487 Problem: The SK-98xx adapter cannot be found by the driver.
488 Solution: In /proc/pci search for the following entry: 488 Solution: In /proc/pci search for the following entry:
489 'Ethernet controller: SysKonnect SK-98xx ...' 489 'Ethernet controller: SysKonnect SK-98xx ...'
490 If this entry exists, the SK-98xx or SK-98xx V2.0 adapter has 490 If this entry exists, the SK-98xx or SK-98xx V2.0 adapter has
491 been found by the system and should be operational. 491 been found by the system and should be operational.
492 If this entry does not exist or if the file '/proc/pci' is not 492 If this entry does not exist or if the file '/proc/pci' is not
493 found, there may be a hardware problem or the PCI support may 493 found, there may be a hardware problem or the PCI support may
494 not be enabled in your kernel. 494 not be enabled in your kernel.
495 The adapter can be checked using the diagnostics program which 495 The adapter can be checked using the diagnostics program which
496 is available on the SysKonnect web site: 496 is available on the SysKonnect web site:
497 www.syskonnect.com 497 www.syskonnect.com
498 498
499 Some COMPAQ machines have problems dealing with PCI under Linux. 499 Some COMPAQ machines have problems dealing with PCI under Linux.
500 Linux. This problem is described in the 'PCI howto' document 500 This problem is described in the 'PCI howto' document
501 (included in some distributions or available from the 501 (included in some distributions or available from the
502 web, e.g. at 'www.linux.org'). 502 web, e.g. at 'www.linux.org').
503 503
504 504
505 Problem: Programs such as 'ifconfig' or 'route' cannot be found or the 505 Problem: Programs such as 'ifconfig' or 'route' cannot be found or the
506 error message 'Operation not permitted' is displayed. 506 error message 'Operation not permitted' is displayed.
507 Reason: You are not logged in as user 'root'. 507 Reason: You are not logged in as user 'root'.
508 Solution: Logout and login as 'root' or change to 'root' via 'su'. 508 Solution: Logout and login as 'root' or change to 'root' via 'su'.
509 509
510 510
511 Problem: Upon use of the command 'ping <address>' the message 511 Problem: Upon use of the command 'ping <address>' the message
512 "ping: sendto: Network is unreachable" is displayed. 512 "ping: sendto: Network is unreachable" is displayed.
513 Reason: Your route is not set correctly. 513 Reason: Your route is not set correctly.
514 Solution: If you are using RedHat, you probably forgot to set up the 514 Solution: If you are using RedHat, you probably forgot to set up the
515 route in the 'network configuration'. 515 route in the 'network configuration'.
516 Check the existing routes with the 'route' command and check 516 Check the existing routes with the 'route' command and check
517 if an entry for 'eth0' exists, and if so, if it is set correctly. 517 if an entry for 'eth0' exists, and if so, if it is set correctly.
518 518
519 519
520 Problem: The driver can be started, the adapter is connected to the 520 Problem: The driver can be started, the adapter is connected to the
521 network, but you cannot receive or transmit any packets; 521 network, but you cannot receive or transmit any packets;
522 e.g. 'ping' does not work. 522 e.g. 'ping' does not work.
523 Reason: There is an incorrect route in your routing table. 523 Reason: There is an incorrect route in your routing table.
524 Solution: Check the routing table with the command 'route' and read the 524 Solution: Check the routing table with the command 'route' and read the
525 manual help pages dealing with routes (enter 'man route'). 525 manual help pages dealing with routes (enter 'man route').
526 526
527 NOTE: Although the 2.2.x kernel versions generate the routing entry 527 NOTE: Although the 2.2.x kernel versions generate the routing entry
528 automatically, problems of this kind may occur here as well. We've 528 automatically, problems of this kind may occur here as well. We've
529 come across a situation in which the driver started correctly at 529 come across a situation in which the driver started correctly at
530 system start, but after the driver has been removed and reloaded, 530 system start, but after the driver has been removed and reloaded,
531 the route of the adapter's network pointed to the 'dummy0'device 531 the route of the adapter's network pointed to the 'dummy0'device
532 and had to be corrected manually. 532 and had to be corrected manually.
533 533
534 534
535 Problem: Your computer should act as a router between multiple 535 Problem: Your computer should act as a router between multiple
536 IP subnetworks (using multiple adapters), but computers in 536 IP subnetworks (using multiple adapters), but computers in
537 other subnetworks cannot be reached. 537 other subnetworks cannot be reached.
538 Reason: Either the router's kernel is not configured for IP forwarding 538 Reason: Either the router's kernel is not configured for IP forwarding
539 or the routing table and gateway configuration of at least one 539 or the routing table and gateway configuration of at least one
540 computer is not working. 540 computer is not working.
541 541
542 Problem: Upon driver start, the following error message is displayed: 542 Problem: Upon driver start, the following error message is displayed:
543 "eth0: -- ERROR -- 543 "eth0: -- ERROR --
544 Class: internal Software error 544 Class: internal Software error
545 Nr: 0xcc 545 Nr: 0xcc
546 Msg: SkGeInitPort() cannot init running ports" 546 Msg: SkGeInitPort() cannot init running ports"
547 Reason: You are using a driver compiled for single processor machines 547 Reason: You are using a driver compiled for single processor machines
548 on a multiprocessor machine with SMP (Symmetric MultiProcessor) 548 on a multiprocessor machine with SMP (Symmetric MultiProcessor)
549 kernel. 549 kernel.
550 Solution: Configure your kernel appropriately and recompile the kernel or 550 Solution: Configure your kernel appropriately and recompile the kernel or
551 the modules. 551 the modules.
552 552
553 553
554 554
555 If your problem is not listed here, please contact SysKonnect's technical 555 If your problem is not listed here, please contact SysKonnect's technical
556 support for help (linux@syskonnect.de). 556 support for help (linux@syskonnect.de).
557 When contacting our technical support, please ensure that the following 557 When contacting our technical support, please ensure that the following
558 information is available: 558 information is available:
559 - System Manufacturer and HW Informations (CPU, Memory... ) 559 - System Manufacturer and HW Informations (CPU, Memory... )
560 - PCI-Boards in your system 560 - PCI-Boards in your system
561 - Distribution 561 - Distribution
562 - Kernel version 562 - Kernel version
563 - Driver version 563 - Driver version
564 *** 564 ***
565 565
566 566
567 567
568 ***End of Readme File*** 568 ***End of Readme File***
569 569
Documentation/pci-error-recovery.txt
1 1
2 PCI Error Recovery 2 PCI Error Recovery
3 ------------------ 3 ------------------
4 February 2, 2006 4 February 2, 2006
5 5
6 Current document maintainer: 6 Current document maintainer:
7 Linas Vepstas <linas@austin.ibm.com> 7 Linas Vepstas <linas@austin.ibm.com>
8 8
9 9
10 Many PCI bus controllers are able to detect a variety of hardware 10 Many PCI bus controllers are able to detect a variety of hardware
11 PCI errors on the bus, such as parity errors on the data and address 11 PCI errors on the bus, such as parity errors on the data and address
12 busses, as well as SERR and PERR errors. Some of the more advanced 12 busses, as well as SERR and PERR errors. Some of the more advanced
13 chipsets are able to deal with these errors; these include PCI-E chipsets, 13 chipsets are able to deal with these errors; these include PCI-E chipsets,
14 and the PCI-host bridges found on IBM Power4 and Power5-based pSeries 14 and the PCI-host bridges found on IBM Power4 and Power5-based pSeries
15 boxes. A typical action taken is to disconnect the affected device, 15 boxes. A typical action taken is to disconnect the affected device,
16 halting all I/O to it. The goal of a disconnection is to avoid system 16 halting all I/O to it. The goal of a disconnection is to avoid system
17 corruption; for example, to halt system memory corruption due to DMA's 17 corruption; for example, to halt system memory corruption due to DMA's
18 to "wild" addresses. Typically, a reconnection mechanism is also 18 to "wild" addresses. Typically, a reconnection mechanism is also
19 offered, so that the affected PCI device(s) are reset and put back 19 offered, so that the affected PCI device(s) are reset and put back
20 into working condition. The reset phase requires coordination 20 into working condition. The reset phase requires coordination
21 between the affected device drivers and the PCI controller chip. 21 between the affected device drivers and the PCI controller chip.
22 This document describes a generic API for notifying device drivers 22 This document describes a generic API for notifying device drivers
23 of a bus disconnection, and then performing error recovery. 23 of a bus disconnection, and then performing error recovery.
24 This API is currently implemented in the 2.6.16 and later kernels. 24 This API is currently implemented in the 2.6.16 and later kernels.
25 25
26 Reporting and recovery is performed in several steps. First, when 26 Reporting and recovery is performed in several steps. First, when
27 a PCI hardware error has resulted in a bus disconnect, that event 27 a PCI hardware error has resulted in a bus disconnect, that event
28 is reported as soon as possible to all affected device drivers, 28 is reported as soon as possible to all affected device drivers,
29 including multiple instances of a device driver on multi-function 29 including multiple instances of a device driver on multi-function
30 cards. This allows device drivers to avoid deadlocking in spinloops, 30 cards. This allows device drivers to avoid deadlocking in spinloops,
31 waiting for some i/o-space register to change, when it never will. 31 waiting for some i/o-space register to change, when it never will.
32 It also gives the drivers a chance to defer incoming I/O as 32 It also gives the drivers a chance to defer incoming I/O as
33 needed. 33 needed.
34 34
35 Next, recovery is performed in several stages. Most of the complexity 35 Next, recovery is performed in several stages. Most of the complexity
36 is forced by the need to handle multi-function devices, that is, 36 is forced by the need to handle multi-function devices, that is,
37 devices that have multiple device drivers associated with them. 37 devices that have multiple device drivers associated with them.
38 In the first stage, each driver is allowed to indicate what type 38 In the first stage, each driver is allowed to indicate what type
39 of reset it desires, the choices being a simple re-enabling of I/O 39 of reset it desires, the choices being a simple re-enabling of I/O
40 or requesting a hard reset (a full electrical #RST of the PCI card). 40 or requesting a hard reset (a full electrical #RST of the PCI card).
41 If any driver requests a full reset, that is what will be done. 41 If any driver requests a full reset, that is what will be done.
42 42
43 After a full reset and/or a re-enabling of I/O, all drivers are 43 After a full reset and/or a re-enabling of I/O, all drivers are
44 again notified, so that they may then perform any device setup/config 44 again notified, so that they may then perform any device setup/config
45 that may be required. After these have all completed, a final 45 that may be required. After these have all completed, a final
46 "resume normal operations" event is sent out. 46 "resume normal operations" event is sent out.
47 47
48 The biggest reason for choosing a kernel-based implementation rather 48 The biggest reason for choosing a kernel-based implementation rather
49 than a user-space implementation was the need to deal with bus 49 than a user-space implementation was the need to deal with bus
50 disconnects of PCI devices attached to storage media, and, in particular, 50 disconnects of PCI devices attached to storage media, and, in particular,
51 disconnects from devices holding the root file system. If the root 51 disconnects from devices holding the root file system. If the root
52 file system is disconnected, a user-space mechanism would have to go 52 file system is disconnected, a user-space mechanism would have to go
53 through a large number of contortions to complete recovery. Almost all 53 through a large number of contortions to complete recovery. Almost all
54 of the current Linux file systems are not tolerant of disconnection 54 of the current Linux file systems are not tolerant of disconnection
55 from/reconnection to their underlying block device. By contrast, 55 from/reconnection to their underlying block device. By contrast,
56 bus errors are easy to manage in the device driver. Indeed, most 56 bus errors are easy to manage in the device driver. Indeed, most
57 device drivers already handle very similar recovery procedures; 57 device drivers already handle very similar recovery procedures;
58 for example, the SCSI-generic layer already provides significant 58 for example, the SCSI-generic layer already provides significant
59 mechanisms for dealing with SCSI bus errors and SCSI bus resets. 59 mechanisms for dealing with SCSI bus errors and SCSI bus resets.
60 60
61 61
62 Detailed Design 62 Detailed Design
63 --------------- 63 ---------------
64 Design and implementation details below, based on a chain of 64 Design and implementation details below, based on a chain of
65 public email discussions with Ben Herrenschmidt, circa 5 April 2005. 65 public email discussions with Ben Herrenschmidt, circa 5 April 2005.
66 66
67 The error recovery API support is exposed to the driver in the form of 67 The error recovery API support is exposed to the driver in the form of
68 a structure of function pointers pointed to by a new field in struct 68 a structure of function pointers pointed to by a new field in struct
69 pci_driver. A driver that fails to provide the structure is "non-aware", 69 pci_driver. A driver that fails to provide the structure is "non-aware",
70 and the actual recovery steps taken are platform dependent. The 70 and the actual recovery steps taken are platform dependent. The
71 arch/powerpc implementation will simulate a PCI hotplug remove/add. 71 arch/powerpc implementation will simulate a PCI hotplug remove/add.
72 72
73 This structure has the form: 73 This structure has the form:
74 struct pci_error_handlers 74 struct pci_error_handlers
75 { 75 {
76 int (*error_detected)(struct pci_dev *dev, enum pci_channel_state); 76 int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
77 int (*mmio_enabled)(struct pci_dev *dev); 77 int (*mmio_enabled)(struct pci_dev *dev);
78 int (*link_reset)(struct pci_dev *dev); 78 int (*link_reset)(struct pci_dev *dev);
79 int (*slot_reset)(struct pci_dev *dev); 79 int (*slot_reset)(struct pci_dev *dev);
80 void (*resume)(struct pci_dev *dev); 80 void (*resume)(struct pci_dev *dev);
81 }; 81 };
82 82
83 The possible channel states are: 83 The possible channel states are:
84 enum pci_channel_state { 84 enum pci_channel_state {
85 pci_channel_io_normal, /* I/O channel is in normal state */ 85 pci_channel_io_normal, /* I/O channel is in normal state */
86 pci_channel_io_frozen, /* I/O to channel is blocked */ 86 pci_channel_io_frozen, /* I/O to channel is blocked */
87 pci_channel_io_perm_failure, /* PCI card is dead */ 87 pci_channel_io_perm_failure, /* PCI card is dead */
88 }; 88 };
89 89
90 Possible return values are: 90 Possible return values are:
91 enum pci_ers_result { 91 enum pci_ers_result {
92 PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ 92 PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
93 PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ 93 PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
94 PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ 94 PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
95 PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ 95 PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
96 PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ 96 PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
97 }; 97 };
98 98
99 A driver does not have to implement all of these callbacks; however, 99 A driver does not have to implement all of these callbacks; however,
100 if it implements any, it must implement error_detected(). If a callback 100 if it implements any, it must implement error_detected(). If a callback
101 is not implemented, the corresponding feature is considered unsupported. 101 is not implemented, the corresponding feature is considered unsupported.
102 For example, if mmio_enabled() and resume() aren't there, then it 102 For example, if mmio_enabled() and resume() aren't there, then it
103 is assumed that the driver is not doing any direct recovery and requires 103 is assumed that the driver is not doing any direct recovery and requires
104 a reset. If link_reset() is not implemented, the card is assumed as 104 a reset. If link_reset() is not implemented, the card is assumed as
105 not care about link resets. Typically a driver will want to know about 105 not care about link resets. Typically a driver will want to know about
106 a slot_reset(). 106 a slot_reset().
107 107
108 The actual steps taken by a platform to recover from a PCI error 108 The actual steps taken by a platform to recover from a PCI error
109 event will be platform-dependent, but will follow the general 109 event will be platform-dependent, but will follow the general
110 sequence described below. 110 sequence described below.
111 111
112 STEP 0: Error Event 112 STEP 0: Error Event
113 ------------------- 113 -------------------
114 PCI bus error is detect by the PCI hardware. On powerpc, the slot 114 PCI bus error is detect by the PCI hardware. On powerpc, the slot
115 is isolated, in that all I/O is blocked: all reads return 0xffffffff, 115 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
116 all writes are ignored. 116 all writes are ignored.
117 117
118 118
119 STEP 1: Notification 119 STEP 1: Notification
120 -------------------- 120 --------------------
121 Platform calls the error_detected() callback on every instance of 121 Platform calls the error_detected() callback on every instance of
122 every driver affected by the error. 122 every driver affected by the error.
123 123
124 At this point, the device might not be accessible anymore, depending on 124 At this point, the device might not be accessible anymore, depending on
125 the platform (the slot will be isolated on powerpc). The driver may 125 the platform (the slot will be isolated on powerpc). The driver may
126 already have "noticed" the error because of a failing I/O, but this 126 already have "noticed" the error because of a failing I/O, but this
127 is the proper "synchronization point", that is, it gives the driver 127 is the proper "synchronization point", that is, it gives the driver
128 a chance to cleanup, waiting for pending stuff (timers, whatever, etc...) 128 a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
129 to complete; it can take semaphores, schedule, etc... everything but 129 to complete; it can take semaphores, schedule, etc... everything but
130 touch the device. Within this function and after it returns, the driver 130 touch the device. Within this function and after it returns, the driver
131 shouldn't do any new IOs. Called in task context. This is sort of a 131 shouldn't do any new IOs. Called in task context. This is sort of a
132 "quiesce" point. See note about interrupts at the end of this doc. 132 "quiesce" point. See note about interrupts at the end of this doc.
133 133
134 All drivers participating in this system must implement this call. 134 All drivers participating in this system must implement this call.
135 The driver must return one of the following result codes: 135 The driver must return one of the following result codes:
136 - PCI_ERS_RESULT_CAN_RECOVER: 136 - PCI_ERS_RESULT_CAN_RECOVER:
137 Driver returns this if it thinks it might be able to recover 137 Driver returns this if it thinks it might be able to recover
138 the HW by just banging IOs or if it wants to be given 138 the HW by just banging IOs or if it wants to be given
139 a chance to extract some diagnostic information (see 139 a chance to extract some diagnostic information (see
140 mmio_enable, below). 140 mmio_enable, below).
141 - PCI_ERS_RESULT_NEED_RESET: 141 - PCI_ERS_RESULT_NEED_RESET:
142 Driver returns this if it can't recover without a hard 142 Driver returns this if it can't recover without a hard
143 slot reset. 143 slot reset.
144 - PCI_ERS_RESULT_DISCONNECT: 144 - PCI_ERS_RESULT_DISCONNECT:
145 Driver returns this if it doesn't want to recover at all. 145 Driver returns this if it doesn't want to recover at all.
146 146
147 The next step taken will depend on the result codes returned by the 147 The next step taken will depend on the result codes returned by the
148 drivers. 148 drivers.
149 149
150 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER, 150 If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
151 then the platform should re-enable IOs on the slot (or do nothing in 151 then the platform should re-enable IOs on the slot (or do nothing in
152 particular, if the platform doesn't isolate slots), and recovery 152 particular, if the platform doesn't isolate slots), and recovery
153 proceeds to STEP 2 (MMIO Enable). 153 proceeds to STEP 2 (MMIO Enable).
154 154
155 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET), 155 If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
156 then recovery proceeds to STEP 4 (Slot Reset). 156 then recovery proceeds to STEP 4 (Slot Reset).
157 157
158 If the platform is unable to recover the slot, the next step 158 If the platform is unable to recover the slot, the next step
159 is STEP 6 (Permanent Failure). 159 is STEP 6 (Permanent Failure).
160 160
161 >>> The current powerpc implementation assumes that a device driver will 161 >>> The current powerpc implementation assumes that a device driver will
162 >>> *not* schedule or semaphore in this routine; the current powerpc 162 >>> *not* schedule or semaphore in this routine; the current powerpc
163 >>> implementation uses one kernel thread to notify all devices; 163 >>> implementation uses one kernel thread to notify all devices;
164 >>> thus, if one device sleeps/schedules, all devices are affected. 164 >>> thus, if one device sleeps/schedules, all devices are affected.
165 >>> Doing better requires complex multi-threaded logic in the error 165 >>> Doing better requires complex multi-threaded logic in the error
166 >>> recovery implementation (e.g. waiting for all notification threads 166 >>> recovery implementation (e.g. waiting for all notification threads
167 >>> to "join" before proceeding with recovery.) This seems excessively 167 >>> to "join" before proceeding with recovery.) This seems excessively
168 >>> complex and not worth implementing. 168 >>> complex and not worth implementing.
169 169
170 >>> The current powerpc implementation doesn't much care if the device 170 >>> The current powerpc implementation doesn't much care if the device
171 >>> attempts I/O at this point, or not. I/O's will fail, returning 171 >>> attempts I/O at this point, or not. I/O's will fail, returning
172 >>> a value of 0xff on read, and writes will be dropped. If the device 172 >>> a value of 0xff on read, and writes will be dropped. If the device
173 >>> driver attempts more than 10K I/O's to a frozen adapter, it will 173 >>> driver attempts more than 10K I/O's to a frozen adapter, it will
174 >>> assume that the device driver has gone into an infinite loop, and 174 >>> assume that the device driver has gone into an infinite loop, and
175 >>> it will panic the the kernel. There doesn't seem to be any other 175 >>> it will panic the kernel. There doesn't seem to be any other
176 >>> way of stopping a device driver that insists on spinning on I/O. 176 >>> way of stopping a device driver that insists on spinning on I/O.
177 177
178 STEP 2: MMIO Enabled 178 STEP 2: MMIO Enabled
179 ------------------- 179 -------------------
180 The platform re-enables MMIO to the device (but typically not the 180 The platform re-enables MMIO to the device (but typically not the
181 DMA), and then calls the mmio_enabled() callback on all affected 181 DMA), and then calls the mmio_enabled() callback on all affected
182 device drivers. 182 device drivers.
183 183
184 This is the "early recovery" call. IOs are allowed again, but DMA is 184 This is the "early recovery" call. IOs are allowed again, but DMA is
185 not (hrm... to be discussed, I prefer not), with some restrictions. This 185 not (hrm... to be discussed, I prefer not), with some restrictions. This
186 is NOT a callback for the driver to start operations again, only to 186 is NOT a callback for the driver to start operations again, only to
187 peek/poke at the device, extract diagnostic information, if any, and 187 peek/poke at the device, extract diagnostic information, if any, and
188 eventually do things like trigger a device local reset or some such, 188 eventually do things like trigger a device local reset or some such,
189 but not restart operations. This is callback is made if all drivers on 189 but not restart operations. This is callback is made if all drivers on
190 a segment agree that they can try to recover and if no automatic link reset 190 a segment agree that they can try to recover and if no automatic link reset
191 was performed by the HW. If the platform can't just re-enable IOs without 191 was performed by the HW. If the platform can't just re-enable IOs without
192 a slot reset or a link reset, it wont call this callback, and instead 192 a slot reset or a link reset, it wont call this callback, and instead
193 will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) 193 will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
194 194
195 >>> The following is proposed; no platform implements this yet: 195 >>> The following is proposed; no platform implements this yet:
196 >>> Proposal: All I/O's should be done _synchronously_ from within 196 >>> Proposal: All I/O's should be done _synchronously_ from within
197 >>> this callback, errors triggered by them will be returned via 197 >>> this callback, errors triggered by them will be returned via
198 >>> the normal pci_check_whatever() API, no new error_detected() 198 >>> the normal pci_check_whatever() API, no new error_detected()
199 >>> callback will be issued due to an error happening here. However, 199 >>> callback will be issued due to an error happening here. However,
200 >>> such an error might cause IOs to be re-blocked for the whole 200 >>> such an error might cause IOs to be re-blocked for the whole
201 >>> segment, and thus invalidate the recovery that other devices 201 >>> segment, and thus invalidate the recovery that other devices
202 >>> on the same segment might have done, forcing the whole segment 202 >>> on the same segment might have done, forcing the whole segment
203 >>> into one of the next states, that is, link reset or slot reset. 203 >>> into one of the next states, that is, link reset or slot reset.
204 204
205 The driver should return one of the following result codes: 205 The driver should return one of the following result codes:
206 - PCI_ERS_RESULT_RECOVERED 206 - PCI_ERS_RESULT_RECOVERED
207 Driver returns this if it thinks the device is fully 207 Driver returns this if it thinks the device is fully
208 functional and thinks it is ready to start 208 functional and thinks it is ready to start
209 normal driver operations again. There is no 209 normal driver operations again. There is no
210 guarantee that the driver will actually be 210 guarantee that the driver will actually be
211 allowed to proceed, as another driver on the 211 allowed to proceed, as another driver on the
212 same segment might have failed and thus triggered a 212 same segment might have failed and thus triggered a
213 slot reset on platforms that support it. 213 slot reset on platforms that support it.
214 214
215 - PCI_ERS_RESULT_NEED_RESET 215 - PCI_ERS_RESULT_NEED_RESET
216 Driver returns this if it thinks the device is not 216 Driver returns this if it thinks the device is not
217 recoverable in it's current state and it needs a slot 217 recoverable in it's current state and it needs a slot
218 reset to proceed. 218 reset to proceed.
219 219
220 - PCI_ERS_RESULT_DISCONNECT 220 - PCI_ERS_RESULT_DISCONNECT
221 Same as above. Total failure, no recovery even after 221 Same as above. Total failure, no recovery even after
222 reset driver dead. (To be defined more precisely) 222 reset driver dead. (To be defined more precisely)
223 223
224 The next step taken depends on the results returned by the drivers. 224 The next step taken depends on the results returned by the drivers.
225 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform 225 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
226 proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). 226 proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
227 227
228 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform 228 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
229 proceeds to STEP 4 (Slot Reset) 229 proceeds to STEP 4 (Slot Reset)
230 230
231 >>> The current powerpc implementation does not implement this callback. 231 >>> The current powerpc implementation does not implement this callback.
232 232
233 233
234 STEP 3: Link Reset 234 STEP 3: Link Reset
235 ------------------ 235 ------------------
236 The platform resets the link, and then calls the link_reset() callback 236 The platform resets the link, and then calls the link_reset() callback
237 on all affected device drivers. This is a PCI-Express specific state 237 on all affected device drivers. This is a PCI-Express specific state
238 and is done whenever a non-fatal error has been detected that can be 238 and is done whenever a non-fatal error has been detected that can be
239 "solved" by resetting the link. This call informs the driver of the 239 "solved" by resetting the link. This call informs the driver of the
240 reset and the driver should check to see if the device appears to be 240 reset and the driver should check to see if the device appears to be
241 in working condition. 241 in working condition.
242 242
243 The driver is not supposed to restart normal driver I/O operations 243 The driver is not supposed to restart normal driver I/O operations
244 at this point. It should limit itself to "probing" the device to 244 at this point. It should limit itself to "probing" the device to
245 check it's recoverability status. If all is right, then the platform 245 check it's recoverability status. If all is right, then the platform
246 will call resume() once all drivers have ack'd link_reset(). 246 will call resume() once all drivers have ack'd link_reset().
247 247
248 Result codes: 248 Result codes:
249 (identical to STEP 3 (MMIO Enabled) 249 (identical to STEP 3 (MMIO Enabled)
250 250
251 The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5 251 The platform then proceeds to either STEP 4 (Slot Reset) or STEP 5
252 (Resume Operations). 252 (Resume Operations).
253 253
254 >>> The current powerpc implementation does not implement this callback. 254 >>> The current powerpc implementation does not implement this callback.
255 255
256 256
257 STEP 4: Slot Reset 257 STEP 4: Slot Reset
258 ------------------ 258 ------------------
259 The platform performs a soft or hard reset of the device, and then 259 The platform performs a soft or hard reset of the device, and then
260 calls the slot_reset() callback. 260 calls the slot_reset() callback.
261 261
262 A soft reset consists of asserting the adapter #RST line and then 262 A soft reset consists of asserting the adapter #RST line and then
263 restoring the PCI BAR's and PCI configuration header to a state 263 restoring the PCI BAR's and PCI configuration header to a state
264 that is equivalent to what it would be after a fresh system 264 that is equivalent to what it would be after a fresh system
265 power-on followed by power-on BIOS/system firmware initialization. 265 power-on followed by power-on BIOS/system firmware initialization.
266 If the platform supports PCI hotplug, then the reset might be 266 If the platform supports PCI hotplug, then the reset might be
267 performed by toggling the slot electrical power off/on. 267 performed by toggling the slot electrical power off/on.
268 268
269 It is important for the platform to restore the PCI config space 269 It is important for the platform to restore the PCI config space
270 to the "fresh poweron" state, rather than the "last state". After 270 to the "fresh poweron" state, rather than the "last state". After
271 a slot reset, the device driver will almost always use its standard 271 a slot reset, the device driver will almost always use its standard
272 device initialization routines, and an unusual config space setup 272 device initialization routines, and an unusual config space setup
273 may result in hung devices, kernel panics, or silent data corruption. 273 may result in hung devices, kernel panics, or silent data corruption.
274 274
275 This call gives drivers the chance to re-initialize the hardware 275 This call gives drivers the chance to re-initialize the hardware
276 (re-download firmware, etc.). At this point, the driver may assume 276 (re-download firmware, etc.). At this point, the driver may assume
277 that he card is in a fresh state and is fully functional. In 277 that he card is in a fresh state and is fully functional. In
278 particular, interrupt generation should work normally. 278 particular, interrupt generation should work normally.
279 279
280 Drivers should not yet restart normal I/O processing operations 280 Drivers should not yet restart normal I/O processing operations
281 at this point. If all device drivers report success on this 281 at this point. If all device drivers report success on this
282 callback, the platform will call resume() to complete the sequence, 282 callback, the platform will call resume() to complete the sequence,
283 and let the driver restart normal I/O processing. 283 and let the driver restart normal I/O processing.
284 284
285 A driver can still return a critical failure for this function if 285 A driver can still return a critical failure for this function if
286 it can't get the device operational after reset. If the platform 286 it can't get the device operational after reset. If the platform
287 previously tried a soft reset, it might now try a hard reset (power 287 previously tried a soft reset, it might now try a hard reset (power
288 cycle) and then call slot_reset() again. It the device still can't 288 cycle) and then call slot_reset() again. It the device still can't
289 be recovered, there is nothing more that can be done; the platform 289 be recovered, there is nothing more that can be done; the platform
290 will typically report a "permanent failure" in such a case. The 290 will typically report a "permanent failure" in such a case. The
291 device will be considered "dead" in this case. 291 device will be considered "dead" in this case.
292 292
293 Drivers for multi-function cards will need to coordinate among 293 Drivers for multi-function cards will need to coordinate among
294 themselves as to which driver instance will perform any "one-shot" 294 themselves as to which driver instance will perform any "one-shot"
295 or global device initialization. For example, the Symbios sym53cxx2 295 or global device initialization. For example, the Symbios sym53cxx2
296 driver performs device init only from PCI function 0: 296 driver performs device init only from PCI function 0:
297 297
298 + if (PCI_FUNC(pdev->devfn) == 0) 298 + if (PCI_FUNC(pdev->devfn) == 0)
299 + sym_reset_scsi_bus(np, 0); 299 + sym_reset_scsi_bus(np, 0);
300 300
301 Result codes: 301 Result codes:
302 - PCI_ERS_RESULT_DISCONNECT 302 - PCI_ERS_RESULT_DISCONNECT
303 Same as above. 303 Same as above.
304 304
305 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent 305 Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
306 Failure). 306 Failure).
307 307
308 >>> The current powerpc implementation does not currently try a 308 >>> The current powerpc implementation does not currently try a
309 >>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT. 309 >>> power-cycle reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
310 >>> However, it probably should. 310 >>> However, it probably should.
311 311
312 312
313 STEP 5: Resume Operations 313 STEP 5: Resume Operations
314 ------------------------- 314 -------------------------
315 The platform will call the resume() callback on all affected device 315 The platform will call the resume() callback on all affected device
316 drivers if all drivers on the segment have returned 316 drivers if all drivers on the segment have returned
317 PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks. 317 PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
318 The goal of this callback is to tell the driver to restart activity, 318 The goal of this callback is to tell the driver to restart activity,
319 that everything is back and running. This callback does not return 319 that everything is back and running. This callback does not return
320 a result code. 320 a result code.
321 321
322 At this point, if a new error happens, the platform will restart 322 At this point, if a new error happens, the platform will restart
323 a new error recovery sequence. 323 a new error recovery sequence.
324 324
325 STEP 6: Permanent Failure 325 STEP 6: Permanent Failure
326 ------------------------- 326 -------------------------
327 A "permanent failure" has occurred, and the platform cannot recover 327 A "permanent failure" has occurred, and the platform cannot recover
328 the device. The platform will call error_detected() with a 328 the device. The platform will call error_detected() with a
329 pci_channel_state value of pci_channel_io_perm_failure. 329 pci_channel_state value of pci_channel_io_perm_failure.
330 330
331 The device driver should, at this point, assume the worst. It should 331 The device driver should, at this point, assume the worst. It should
332 cancel all pending I/O, refuse all new I/O, returning -EIO to 332 cancel all pending I/O, refuse all new I/O, returning -EIO to
333 higher layers. The device driver should then clean up all of its 333 higher layers. The device driver should then clean up all of its
334 memory and remove itself from kernel operations, much as it would 334 memory and remove itself from kernel operations, much as it would
335 during system shutdown. 335 during system shutdown.
336 336
337 The platform will typically notify the system operator of the 337 The platform will typically notify the system operator of the
338 permanent failure in some way. If the device is hotplug-capable, 338 permanent failure in some way. If the device is hotplug-capable,
339 the operator will probably want to remove and replace the device. 339 the operator will probably want to remove and replace the device.
340 Note, however, not all failures are truly "permanent". Some are 340 Note, however, not all failures are truly "permanent". Some are
341 caused by over-heating, some by a poorly seated card. Many 341 caused by over-heating, some by a poorly seated card. Many
342 PCI error events are caused by software bugs, e.g. DMA's to 342 PCI error events are caused by software bugs, e.g. DMA's to
343 wild addresses or bogus split transactions due to programming 343 wild addresses or bogus split transactions due to programming
344 errors. See the discussion in powerpc/eeh-pci-error-recovery.txt 344 errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
345 for additional detail on real-life experience of the causes of 345 for additional detail on real-life experience of the causes of
346 software errors. 346 software errors.
347 347
348 348
349 Conclusion; General Remarks 349 Conclusion; General Remarks
350 --------------------------- 350 ---------------------------
351 The way those callbacks are called is platform policy. A platform with 351 The way those callbacks are called is platform policy. A platform with
352 no slot reset capability may want to just "ignore" drivers that can't 352 no slot reset capability may want to just "ignore" drivers that can't
353 recover (disconnect them) and try to let other cards on the same segment 353 recover (disconnect them) and try to let other cards on the same segment
354 recover. Keep in mind that in most real life cases, though, there will 354 recover. Keep in mind that in most real life cases, though, there will
355 be only one driver per segment. 355 be only one driver per segment.
356 356
357 Now, a note about interrupts. If you get an interrupt and your 357 Now, a note about interrupts. If you get an interrupt and your
358 device is dead or has been isolated, there is a problem :) 358 device is dead or has been isolated, there is a problem :)
359 The current policy is to turn this into a platform policy. 359 The current policy is to turn this into a platform policy.
360 That is, the recovery API only requires that: 360 That is, the recovery API only requires that:
361 361
362 - There is no guarantee that interrupt delivery can proceed from any 362 - There is no guarantee that interrupt delivery can proceed from any
363 device on the segment starting from the error detection and until the 363 device on the segment starting from the error detection and until the
364 resume callback is sent, at which point interrupts are expected to be 364 resume callback is sent, at which point interrupts are expected to be
365 fully operational. 365 fully operational.
366 366
367 - There is no guarantee that interrupt delivery is stopped, that is, 367 - There is no guarantee that interrupt delivery is stopped, that is,
368 a driver that gets an interrupt after detecting an error, or that detects 368 a driver that gets an interrupt after detecting an error, or that detects
369 an error within the interrupt handler such that it prevents proper 369 an error within the interrupt handler such that it prevents proper
370 ack'ing of the interrupt (and thus removal of the source) should just 370 ack'ing of the interrupt (and thus removal of the source) should just
371 return IRQ_NOTHANDLED. It's up to the platform to deal with that 371 return IRQ_NOTHANDLED. It's up to the platform to deal with that
372 condition, typically by masking the IRQ source during the duration of 372 condition, typically by masking the IRQ source during the duration of
373 the error handling. It is expected that the platform "knows" which 373 the error handling. It is expected that the platform "knows" which
374 interrupts are routed to error-management capable slots and can deal 374 interrupts are routed to error-management capable slots and can deal
375 with temporarily disabling that IRQ number during error processing (this 375 with temporarily disabling that IRQ number during error processing (this
376 isn't terribly complex). That means some IRQ latency for other devices 376 isn't terribly complex). That means some IRQ latency for other devices
377 sharing the interrupt, but there is simply no other way. High end 377 sharing the interrupt, but there is simply no other way. High end
378 platforms aren't supposed to share interrupts between many devices 378 platforms aren't supposed to share interrupts between many devices
379 anyway :) 379 anyway :)
380 380
381 >>> Implementation details for the powerpc platform are discussed in 381 >>> Implementation details for the powerpc platform are discussed in
382 >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt 382 >>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
383 383
384 >>> As of this writing, there are six device drivers with patches 384 >>> As of this writing, there are six device drivers with patches
385 >>> implementing error recovery. Not all of these patches are in 385 >>> implementing error recovery. Not all of these patches are in
386 >>> mainline yet. These may be used as "examples": 386 >>> mainline yet. These may be used as "examples":
387 >>> 387 >>>
388 >>> drivers/scsi/ipr.c 388 >>> drivers/scsi/ipr.c
389 >>> drivers/scsi/sym53cxx_2 389 >>> drivers/scsi/sym53cxx_2
390 >>> drivers/next/e100.c 390 >>> drivers/next/e100.c
391 >>> drivers/net/e1000 391 >>> drivers/net/e1000
392 >>> drivers/net/ixgb 392 >>> drivers/net/ixgb
393 >>> drivers/net/s2io.c 393 >>> drivers/net/s2io.c
394 394
395 The End 395 The End
396 ------- 396 -------
397 397
Documentation/power/swsusp.txt
1 Some warnings, first. 1 Some warnings, first.
2 2
3 * BIG FAT WARNING ********************************************************* 3 * BIG FAT WARNING *********************************************************
4 * 4 *
5 * If you touch anything on disk between suspend and resume... 5 * If you touch anything on disk between suspend and resume...
6 * ...kiss your data goodbye. 6 * ...kiss your data goodbye.
7 * 7 *
8 * If you do resume from initrd after your filesystems are mounted... 8 * If you do resume from initrd after your filesystems are mounted...
9 * ...bye bye root partition. 9 * ...bye bye root partition.
10 * [this is actually same case as above] 10 * [this is actually same case as above]
11 * 11 *
12 * If you have unsupported (*) devices using DMA, you may have some 12 * If you have unsupported (*) devices using DMA, you may have some
13 * problems. If your disk driver does not support suspend... (IDE does), 13 * problems. If your disk driver does not support suspend... (IDE does),
14 * it may cause some problems, too. If you change kernel command line 14 * it may cause some problems, too. If you change kernel command line
15 * between suspend and resume, it may do something wrong. If you change 15 * between suspend and resume, it may do something wrong. If you change
16 * your hardware while system is suspended... well, it was not good idea; 16 * your hardware while system is suspended... well, it was not good idea;
17 * but it will probably only crash. 17 * but it will probably only crash.
18 * 18 *
19 * (*) suspend/resume support is needed to make it safe. 19 * (*) suspend/resume support is needed to make it safe.
20 * 20 *
21 * If you have any filesystems on USB devices mounted before software suspend, 21 * If you have any filesystems on USB devices mounted before software suspend,
22 * they won't be accessible after resume and you may lose data, as though 22 * they won't be accessible after resume and you may lose data, as though
23 * you have unplugged the USB devices with mounted filesystems on them; 23 * you have unplugged the USB devices with mounted filesystems on them;
24 * see the FAQ below for details. (This is not true for more traditional 24 * see the FAQ below for details. (This is not true for more traditional
25 * power states like "standby", which normally don't turn USB off.) 25 * power states like "standby", which normally don't turn USB off.)
26 26
27 You need to append resume=/dev/your_swap_partition to kernel command 27 You need to append resume=/dev/your_swap_partition to kernel command
28 line. Then you suspend by 28 line. Then you suspend by
29 29
30 echo shutdown > /sys/power/disk; echo disk > /sys/power/state 30 echo shutdown > /sys/power/disk; echo disk > /sys/power/state
31 31
32 . If you feel ACPI works pretty well on your system, you might try 32 . If you feel ACPI works pretty well on your system, you might try
33 33
34 echo platform > /sys/power/disk; echo disk > /sys/power/state 34 echo platform > /sys/power/disk; echo disk > /sys/power/state
35 35
36 . If you have SATA disks, you'll need recent kernels with SATA suspend 36 . If you have SATA disks, you'll need recent kernels with SATA suspend
37 support. For suspend and resume to work, make sure your disk drivers 37 support. For suspend and resume to work, make sure your disk drivers
38 are built into kernel -- not modules. [There's way to make 38 are built into kernel -- not modules. [There's way to make
39 suspend/resume with modular disk drivers, see FAQ, but you probably 39 suspend/resume with modular disk drivers, see FAQ, but you probably
40 should not do that.] 40 should not do that.]
41 41
42 If you want to limit the suspend image size to N bytes, do 42 If you want to limit the suspend image size to N bytes, do
43 43
44 echo N > /sys/power/image_size 44 echo N > /sys/power/image_size
45 45
46 before suspend (it is limited to 500 MB by default). 46 before suspend (it is limited to 500 MB by default).
47 47
48 48
49 Article about goals and implementation of Software Suspend for Linux 49 Article about goals and implementation of Software Suspend for Linux
50 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 50 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
51 Author: Gโ€šรกbor Kuti 51 Author: Gโ€šรกbor Kuti
52 Last revised: 2003-10-20 by Pavel Machek 52 Last revised: 2003-10-20 by Pavel Machek
53 53
54 Idea and goals to achieve 54 Idea and goals to achieve
55 55
56 Nowadays it is common in several laptops that they have a suspend button. It 56 Nowadays it is common in several laptops that they have a suspend button. It
57 saves the state of the machine to a filesystem or to a partition and switches 57 saves the state of the machine to a filesystem or to a partition and switches
58 to standby mode. Later resuming the machine the saved state is loaded back to 58 to standby mode. Later resuming the machine the saved state is loaded back to
59 ram and the machine can continue its work. It has two real benefits. First we 59 ram and the machine can continue its work. It has two real benefits. First we
60 save ourselves the time machine goes down and later boots up, energy costs 60 save ourselves the time machine goes down and later boots up, energy costs
61 are real high when running from batteries. The other gain is that we don't have to 61 are real high when running from batteries. The other gain is that we don't have to
62 interrupt our programs so processes that are calculating something for a long 62 interrupt our programs so processes that are calculating something for a long
63 time shouldn't need to be written interruptible. 63 time shouldn't need to be written interruptible.
64 64
65 swsusp saves the state of the machine into active swaps and then reboots or 65 swsusp saves the state of the machine into active swaps and then reboots or
66 powerdowns. You must explicitly specify the swap partition to resume from with 66 powerdowns. You must explicitly specify the swap partition to resume from with
67 ``resume='' kernel option. If signature is found it loads and restores saved 67 ``resume='' kernel option. If signature is found it loads and restores saved
68 state. If the option ``noresume'' is specified as a boot parameter, it skips 68 state. If the option ``noresume'' is specified as a boot parameter, it skips
69 the resuming. 69 the resuming.
70 70
71 In the meantime while the system is suspended you should not add/remove any 71 In the meantime while the system is suspended you should not add/remove any
72 of the hardware, write to the filesystems, etc. 72 of the hardware, write to the filesystems, etc.
73 73
74 Sleep states summary 74 Sleep states summary
75 ==================== 75 ====================
76 76
77 There are three different interfaces you can use, /proc/acpi should 77 There are three different interfaces you can use, /proc/acpi should
78 work like this: 78 work like this:
79 79
80 In a really perfect world: 80 In a really perfect world:
81 echo 1 > /proc/acpi/sleep # for standby 81 echo 1 > /proc/acpi/sleep # for standby
82 echo 2 > /proc/acpi/sleep # for suspend to ram 82 echo 2 > /proc/acpi/sleep # for suspend to ram
83 echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative 83 echo 3 > /proc/acpi/sleep # for suspend to ram, but with more power conservative
84 echo 4 > /proc/acpi/sleep # for suspend to disk 84 echo 4 > /proc/acpi/sleep # for suspend to disk
85 echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system 85 echo 5 > /proc/acpi/sleep # for shutdown unfriendly the system
86 86
87 and perhaps 87 and perhaps
88 echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios 88 echo 4b > /proc/acpi/sleep # for suspend to disk via s4bios
89 89
90 Frequently Asked Questions 90 Frequently Asked Questions
91 ========================== 91 ==========================
92 92
93 Q: well, suspending a server is IMHO a really stupid thing, 93 Q: well, suspending a server is IMHO a really stupid thing,
94 but... (Diego Zuccato): 94 but... (Diego Zuccato):
95 95
96 A: You bought new UPS for your server. How do you install it without 96 A: You bought new UPS for your server. How do you install it without
97 bringing machine down? Suspend to disk, rearrange power cables, 97 bringing machine down? Suspend to disk, rearrange power cables,
98 resume. 98 resume.
99 99
100 You have your server on UPS. Power died, and UPS is indicating 30 100 You have your server on UPS. Power died, and UPS is indicating 30
101 seconds to failure. What do you do? Suspend to disk. 101 seconds to failure. What do you do? Suspend to disk.
102 102
103 103
104 Q: Maybe I'm missing something, but why don't the regular I/O paths work? 104 Q: Maybe I'm missing something, but why don't the regular I/O paths work?
105 105
106 A: We do use the regular I/O paths. However we cannot restore the data 106 A: We do use the regular I/O paths. However we cannot restore the data
107 to its original location as we load it. That would create an 107 to its original location as we load it. That would create an
108 inconsistent kernel state which would certainly result in an oops. 108 inconsistent kernel state which would certainly result in an oops.
109 Instead, we load the image into unused memory and then atomically copy 109 Instead, we load the image into unused memory and then atomically copy
110 it back to it original location. This implies, of course, a maximum 110 it back to it original location. This implies, of course, a maximum
111 image size of half the amount of memory. 111 image size of half the amount of memory.
112 112
113 There are two solutions to this: 113 There are two solutions to this:
114 114
115 * require half of memory to be free during suspend. That way you can 115 * require half of memory to be free during suspend. That way you can
116 read "new" data onto free spots, then cli and copy 116 read "new" data onto free spots, then cli and copy
117 117
118 * assume we had special "polling" ide driver that only uses memory 118 * assume we had special "polling" ide driver that only uses memory
119 between 0-640KB. That way, I'd have to make sure that 0-640KB is free 119 between 0-640KB. That way, I'd have to make sure that 0-640KB is free
120 during suspending, but otherwise it would work... 120 during suspending, but otherwise it would work...
121 121
122 suspend2 shares this fundamental limitation, but does not include user 122 suspend2 shares this fundamental limitation, but does not include user
123 data and disk caches into "used memory" by saving them in 123 data and disk caches into "used memory" by saving them in
124 advance. That means that the limitation goes away in practice. 124 advance. That means that the limitation goes away in practice.
125 125
126 Q: Does linux support ACPI S4? 126 Q: Does linux support ACPI S4?
127 127
128 A: Yes. That's what echo platform > /sys/power/disk does. 128 A: Yes. That's what echo platform > /sys/power/disk does.
129 129
130 Q: What is 'suspend2'? 130 Q: What is 'suspend2'?
131 131
132 A: suspend2 is 'Software Suspend 2', a forked implementation of 132 A: suspend2 is 'Software Suspend 2', a forked implementation of
133 suspend-to-disk which is available as separate patches for 2.4 and 2.6 133 suspend-to-disk which is available as separate patches for 2.4 and 2.6
134 kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB 134 kernels from swsusp.sourceforge.net. It includes support for SMP, 4GB
135 highmem and preemption. It also has a extensible architecture that 135 highmem and preemption. It also has a extensible architecture that
136 allows for arbitrary transformations on the image (compression, 136 allows for arbitrary transformations on the image (compression,
137 encryption) and arbitrary backends for writing the image (eg to swap 137 encryption) and arbitrary backends for writing the image (eg to swap
138 or an NFS share[Work In Progress]). Questions regarding suspend2 138 or an NFS share[Work In Progress]). Questions regarding suspend2
139 should be sent to the mailing list available through the suspend2 139 should be sent to the mailing list available through the suspend2
140 website, and not to the Linux Kernel Mailing List. We are working 140 website, and not to the Linux Kernel Mailing List. We are working
141 toward merging suspend2 into the mainline kernel. 141 toward merging suspend2 into the mainline kernel.
142 142
143 Q: A kernel thread must voluntarily freeze itself (call 'refrigerator'). 143 Q: A kernel thread must voluntarily freeze itself (call 'refrigerator').
144 I found some kernel threads that don't do it, and they don't freeze 144 I found some kernel threads that don't do it, and they don't freeze
145 so the system can't sleep. Is this a known behavior? 145 so the system can't sleep. Is this a known behavior?
146 146
147 A: All such kernel threads need to be fixed, one by one. Select the 147 A: All such kernel threads need to be fixed, one by one. Select the
148 place where the thread is safe to be frozen (no kernel semaphores 148 place where the thread is safe to be frozen (no kernel semaphores
149 should be held at that point and it must be safe to sleep there), and 149 should be held at that point and it must be safe to sleep there), and
150 add: 150 add:
151 151
152 try_to_freeze(); 152 try_to_freeze();
153 153
154 If the thread is needed for writing the image to storage, you should 154 If the thread is needed for writing the image to storage, you should
155 instead set the PF_NOFREEZE process flag when creating the thread (and 155 instead set the PF_NOFREEZE process flag when creating the thread (and
156 be very carefull). 156 be very carefull).
157 157
158 158
159 Q: What is the difference between between "platform", "shutdown" and 159 Q: What is the difference between "platform", "shutdown" and
160 "firmware" in /sys/power/disk? 160 "firmware" in /sys/power/disk?
161 161
162 A: 162 A:
163 163
164 shutdown: save state in linux, then tell bios to powerdown 164 shutdown: save state in linux, then tell bios to powerdown
165 165
166 platform: save state in linux, then tell bios to powerdown and blink 166 platform: save state in linux, then tell bios to powerdown and blink
167 "suspended led" 167 "suspended led"
168 168
169 firmware: tell bios to save state itself [needs BIOS-specific suspend 169 firmware: tell bios to save state itself [needs BIOS-specific suspend
170 partition, and has very little to do with swsusp] 170 partition, and has very little to do with swsusp]
171 171
172 "platform" is actually right thing to do, but "shutdown" is most 172 "platform" is actually right thing to do, but "shutdown" is most
173 reliable. 173 reliable.
174 174
175 Q: I do not understand why you have such strong objections to idea of 175 Q: I do not understand why you have such strong objections to idea of
176 selective suspend. 176 selective suspend.
177 177
178 A: Do selective suspend during runtime power management, that's okay. But 178 A: Do selective suspend during runtime power management, that's okay. But
179 it's useless for suspend-to-disk. (And I do not see how you could use 179 it's useless for suspend-to-disk. (And I do not see how you could use
180 it for suspend-to-ram, I hope you do not want that). 180 it for suspend-to-ram, I hope you do not want that).
181 181
182 Lets see, so you suggest to 182 Lets see, so you suggest to
183 183
184 * SUSPEND all but swap device and parents 184 * SUSPEND all but swap device and parents
185 * Snapshot 185 * Snapshot
186 * Write image to disk 186 * Write image to disk
187 * SUSPEND swap device and parents 187 * SUSPEND swap device and parents
188 * Powerdown 188 * Powerdown
189 189
190 Oh no, that does not work, if swap device or its parents uses DMA, 190 Oh no, that does not work, if swap device or its parents uses DMA,
191 you've corrupted data. You'd have to do 191 you've corrupted data. You'd have to do
192 192
193 * SUSPEND all but swap device and parents 193 * SUSPEND all but swap device and parents
194 * FREEZE swap device and parents 194 * FREEZE swap device and parents
195 * Snapshot 195 * Snapshot
196 * UNFREEZE swap device and parents 196 * UNFREEZE swap device and parents
197 * Write 197 * Write
198 * SUSPEND swap device and parents 198 * SUSPEND swap device and parents
199 199
200 Which means that you still need that FREEZE state, and you get more 200 Which means that you still need that FREEZE state, and you get more
201 complicated code. (And I have not yet introduce details like system 201 complicated code. (And I have not yet introduce details like system
202 devices). 202 devices).
203 203
204 Q: There don't seem to be any generally useful behavioral 204 Q: There don't seem to be any generally useful behavioral
205 distinctions between SUSPEND and FREEZE. 205 distinctions between SUSPEND and FREEZE.
206 206
207 A: Doing SUSPEND when you are asked to do FREEZE is always correct, 207 A: Doing SUSPEND when you are asked to do FREEZE is always correct,
208 but it may be unneccessarily slow. If you want your driver to stay simple, 208 but it may be unneccessarily slow. If you want your driver to stay simple,
209 slowness may not matter to you. It can always be fixed later. 209 slowness may not matter to you. It can always be fixed later.
210 210
211 For devices like disk it does matter, you do not want to spindown for 211 For devices like disk it does matter, you do not want to spindown for
212 FREEZE. 212 FREEZE.
213 213
214 Q: After resuming, system is paging heavily, leading to very bad interactivity. 214 Q: After resuming, system is paging heavily, leading to very bad interactivity.
215 215
216 A: Try running 216 A: Try running
217 217
218 cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null 218 cat `cat /proc/[0-9]*/maps | grep / | sed 's:.* /:/:' | sort -u` > /dev/null
219 219
220 after resume. swapoff -a; swapon -a may also be useful. 220 after resume. swapoff -a; swapon -a may also be useful.
221 221
222 Q: What happens to devices during swsusp? They seem to be resumed 222 Q: What happens to devices during swsusp? They seem to be resumed
223 during system suspend? 223 during system suspend?
224 224
225 A: That's correct. We need to resume them if we want to write image to 225 A: That's correct. We need to resume them if we want to write image to
226 disk. Whole sequence goes like 226 disk. Whole sequence goes like
227 227
228 Suspend part 228 Suspend part
229 ~~~~~~~~~~~~ 229 ~~~~~~~~~~~~
230 running system, user asks for suspend-to-disk 230 running system, user asks for suspend-to-disk
231 231
232 user processes are stopped 232 user processes are stopped
233 233
234 suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 234 suspend(PMSG_FREEZE): devices are frozen so that they don't interfere
235 with state snapshot 235 with state snapshot
236 236
237 state snapshot: copy of whole used memory is taken with interrupts disabled 237 state snapshot: copy of whole used memory is taken with interrupts disabled
238 238
239 resume(): devices are woken up so that we can write image to swap 239 resume(): devices are woken up so that we can write image to swap
240 240
241 write image to swap 241 write image to swap
242 242
243 suspend(PMSG_SUSPEND): suspend devices so that we can power off 243 suspend(PMSG_SUSPEND): suspend devices so that we can power off
244 244
245 turn the power off 245 turn the power off
246 246
247 Resume part 247 Resume part
248 ~~~~~~~~~~~ 248 ~~~~~~~~~~~
249 (is actually pretty similar) 249 (is actually pretty similar)
250 250
251 running system, user asks for suspend-to-disk 251 running system, user asks for suspend-to-disk
252 252
253 user processes are stopped (in common case there are none, but with resume-from-initrd, noone knows) 253 user processes are stopped (in common case there are none, but with resume-from-initrd, noone knows)
254 254
255 read image from disk 255 read image from disk
256 256
257 suspend(PMSG_FREEZE): devices are frozen so that they don't interfere 257 suspend(PMSG_FREEZE): devices are frozen so that they don't interfere
258 with image restoration 258 with image restoration
259 259
260 image restoration: rewrite memory with image 260 image restoration: rewrite memory with image
261 261
262 resume(): devices are woken up so that system can continue 262 resume(): devices are woken up so that system can continue
263 263
264 thaw all user processes 264 thaw all user processes
265 265
266 Q: What is this 'Encrypt suspend image' for? 266 Q: What is this 'Encrypt suspend image' for?
267 267
268 A: First of all: it is not a replacement for dm-crypt encrypted swap. 268 A: First of all: it is not a replacement for dm-crypt encrypted swap.
269 It cannot protect your computer while it is suspended. Instead it does 269 It cannot protect your computer while it is suspended. Instead it does
270 protect from leaking sensitive data after resume from suspend. 270 protect from leaking sensitive data after resume from suspend.
271 271
272 Think of the following: you suspend while an application is running 272 Think of the following: you suspend while an application is running
273 that keeps sensitive data in memory. The application itself prevents 273 that keeps sensitive data in memory. The application itself prevents
274 the data from being swapped out. Suspend, however, must write these 274 the data from being swapped out. Suspend, however, must write these
275 data to swap to be able to resume later on. Without suspend encryption 275 data to swap to be able to resume later on. Without suspend encryption
276 your sensitive data are then stored in plaintext on disk. This means 276 your sensitive data are then stored in plaintext on disk. This means
277 that after resume your sensitive data are accessible to all 277 that after resume your sensitive data are accessible to all
278 applications having direct access to the swap device which was used 278 applications having direct access to the swap device which was used
279 for suspend. If you don't need swap after resume these data can remain 279 for suspend. If you don't need swap after resume these data can remain
280 on disk virtually forever. Thus it can happen that your system gets 280 on disk virtually forever. Thus it can happen that your system gets
281 broken in weeks later and sensitive data which you thought were 281 broken in weeks later and sensitive data which you thought were
282 encrypted and protected are retrieved and stolen from the swap device. 282 encrypted and protected are retrieved and stolen from the swap device.
283 To prevent this situation you should use 'Encrypt suspend image'. 283 To prevent this situation you should use 'Encrypt suspend image'.
284 284
285 During suspend a temporary key is created and this key is used to 285 During suspend a temporary key is created and this key is used to
286 encrypt the data written to disk. When, during resume, the data was 286 encrypt the data written to disk. When, during resume, the data was
287 read back into memory the temporary key is destroyed which simply 287 read back into memory the temporary key is destroyed which simply
288 means that all data written to disk during suspend are then 288 means that all data written to disk during suspend are then
289 inaccessible so they can't be stolen later on. The only thing that 289 inaccessible so they can't be stolen later on. The only thing that
290 you must then take care of is that you call 'mkswap' for the swap 290 you must then take care of is that you call 'mkswap' for the swap
291 partition used for suspend as early as possible during regular 291 partition used for suspend as early as possible during regular
292 boot. This asserts that any temporary key from an oopsed suspend or 292 boot. This asserts that any temporary key from an oopsed suspend or
293 from a failed or aborted resume is erased from the swap device. 293 from a failed or aborted resume is erased from the swap device.
294 294
295 As a rule of thumb use encrypted swap to protect your data while your 295 As a rule of thumb use encrypted swap to protect your data while your
296 system is shut down or suspended. Additionally use the encrypted 296 system is shut down or suspended. Additionally use the encrypted
297 suspend image to prevent sensitive data from being stolen after 297 suspend image to prevent sensitive data from being stolen after
298 resume. 298 resume.
299 299
300 Q: Why can't we suspend to a swap file? 300 Q: Why can't we suspend to a swap file?
301 301
302 A: Because accessing swap file needs the filesystem mounted, and 302 A: Because accessing swap file needs the filesystem mounted, and
303 filesystem might do something wrong (like replaying the journal) 303 filesystem might do something wrong (like replaying the journal)
304 during mount. 304 during mount.
305 305
306 There are few ways to get that fixed: 306 There are few ways to get that fixed:
307 307
308 1) Probably could be solved by modifying every filesystem to support 308 1) Probably could be solved by modifying every filesystem to support
309 some kind of "really read-only!" option. Patches welcome. 309 some kind of "really read-only!" option. Patches welcome.
310 310
311 2) suspend2 gets around that by storing absolute positions in on-disk 311 2) suspend2 gets around that by storing absolute positions in on-disk
312 image (and blocksize), with resume parameter pointing directly to 312 image (and blocksize), with resume parameter pointing directly to
313 suspend header. 313 suspend header.
314 314
315 Q: Is there a maximum system RAM size that is supported by swsusp? 315 Q: Is there a maximum system RAM size that is supported by swsusp?
316 316
317 A: It should work okay with highmem. 317 A: It should work okay with highmem.
318 318
319 Q: Does swsusp (to disk) use only one swap partition or can it use 319 Q: Does swsusp (to disk) use only one swap partition or can it use
320 multiple swap partitions (aggregate them into one logical space)? 320 multiple swap partitions (aggregate them into one logical space)?
321 321
322 A: Only one swap partition, sorry. 322 A: Only one swap partition, sorry.
323 323
324 Q: If my application(s) causes lots of memory & swap space to be used 324 Q: If my application(s) causes lots of memory & swap space to be used
325 (over half of the total system RAM), is it correct that it is likely 325 (over half of the total system RAM), is it correct that it is likely
326 to be useless to try to suspend to disk while that app is running? 326 to be useless to try to suspend to disk while that app is running?
327 327
328 A: No, it should work okay, as long as your app does not mlock() 328 A: No, it should work okay, as long as your app does not mlock()
329 it. Just prepare big enough swap partition. 329 it. Just prepare big enough swap partition.
330 330
331 Q: What information is useful for debugging suspend-to-disk problems? 331 Q: What information is useful for debugging suspend-to-disk problems?
332 332
333 A: Well, last messages on the screen are always useful. If something 333 A: Well, last messages on the screen are always useful. If something
334 is broken, it is usually some kernel driver, therefore trying with as 334 is broken, it is usually some kernel driver, therefore trying with as
335 little as possible modules loaded helps a lot. I also prefer people to 335 little as possible modules loaded helps a lot. I also prefer people to
336 suspend from console, preferably without X running. Booting with 336 suspend from console, preferably without X running. Booting with
337 init=/bin/bash, then swapon and starting suspend sequence manually 337 init=/bin/bash, then swapon and starting suspend sequence manually
338 usually does the trick. Then it is good idea to try with latest 338 usually does the trick. Then it is good idea to try with latest
339 vanilla kernel. 339 vanilla kernel.
340 340
341 Q: How can distributions ship a swsusp-supporting kernel with modular 341 Q: How can distributions ship a swsusp-supporting kernel with modular
342 disk drivers (especially SATA)? 342 disk drivers (especially SATA)?
343 343
344 A: Well, it can be done, load the drivers, then do echo into 344 A: Well, it can be done, load the drivers, then do echo into
345 /sys/power/disk/resume file from initrd. Be sure not to mount 345 /sys/power/disk/resume file from initrd. Be sure not to mount
346 anything, not even read-only mount, or you are going to lose your 346 anything, not even read-only mount, or you are going to lose your
347 data. 347 data.
348 348
349 Q: How do I make suspend more verbose? 349 Q: How do I make suspend more verbose?
350 350
351 A: If you want to see any non-error kernel messages on the virtual 351 A: If you want to see any non-error kernel messages on the virtual
352 terminal the kernel switches to during suspend, you have to set the 352 terminal the kernel switches to during suspend, you have to set the
353 kernel console loglevel to at least 4 (KERN_WARNING), for example by 353 kernel console loglevel to at least 4 (KERN_WARNING), for example by
354 doing 354 doing
355 355
356 # save the old loglevel 356 # save the old loglevel
357 read LOGLEVEL DUMMY < /proc/sys/kernel/printk 357 read LOGLEVEL DUMMY < /proc/sys/kernel/printk
358 # set the loglevel so we see the progress bar. 358 # set the loglevel so we see the progress bar.
359 # if the level is higher than needed, we leave it alone. 359 # if the level is higher than needed, we leave it alone.
360 if [ $LOGLEVEL -lt 5 ]; then 360 if [ $LOGLEVEL -lt 5 ]; then
361 echo 5 > /proc/sys/kernel/printk 361 echo 5 > /proc/sys/kernel/printk
362 fi 362 fi
363 363
364 IMG_SZ=0 364 IMG_SZ=0
365 read IMG_SZ < /sys/power/image_size 365 read IMG_SZ < /sys/power/image_size
366 echo -n disk > /sys/power/state 366 echo -n disk > /sys/power/state
367 RET=$? 367 RET=$?
368 # 368 #
369 # the logic here is: 369 # the logic here is:
370 # if image_size > 0 (without kernel support, IMG_SZ will be zero), 370 # if image_size > 0 (without kernel support, IMG_SZ will be zero),
371 # then try again with image_size set to zero. 371 # then try again with image_size set to zero.
372 if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size 372 if [ $RET -ne 0 -a $IMG_SZ -ne 0 ]; then # try again with minimal image size
373 echo 0 > /sys/power/image_size 373 echo 0 > /sys/power/image_size
374 echo -n disk > /sys/power/state 374 echo -n disk > /sys/power/state
375 RET=$? 375 RET=$?
376 fi 376 fi
377 377
378 # restore previous loglevel 378 # restore previous loglevel
379 echo $LOGLEVEL > /proc/sys/kernel/printk 379 echo $LOGLEVEL > /proc/sys/kernel/printk
380 exit $RET 380 exit $RET
381 381
382 Q: Is this true that if I have a mounted filesystem on a USB device and 382 Q: Is this true that if I have a mounted filesystem on a USB device and
383 I suspend to disk, I can lose data unless the filesystem has been mounted 383 I suspend to disk, I can lose data unless the filesystem has been mounted
384 with "sync"? 384 with "sync"?
385 385
386 A: That's right ... if you disconnect that device, you may lose data. 386 A: That's right ... if you disconnect that device, you may lose data.
387 In fact, even with "-o sync" you can lose data if your programs have 387 In fact, even with "-o sync" you can lose data if your programs have
388 information in buffers they haven't written out to a disk you disconnect, 388 information in buffers they haven't written out to a disk you disconnect,
389 or if you disconnect before the device finished saving data you wrote. 389 or if you disconnect before the device finished saving data you wrote.
390 390
391 Software suspend normally powers down USB controllers, which is equivalent 391 Software suspend normally powers down USB controllers, which is equivalent
392 to disconnecting all USB devices attached to your system. 392 to disconnecting all USB devices attached to your system.
393 393
394 Your system might well support low-power modes for its USB controllers 394 Your system might well support low-power modes for its USB controllers
395 while the system is asleep, maintaining the connection, using true sleep 395 while the system is asleep, maintaining the connection, using true sleep
396 modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the 396 modes like "suspend-to-RAM" or "standby". (Don't write "disk" to the
397 /sys/power/state file; write "standby" or "mem".) We've not seen any 397 /sys/power/state file; write "standby" or "mem".) We've not seen any
398 hardware that can use these modes through software suspend, although in 398 hardware that can use these modes through software suspend, although in
399 theory some systems might support "platform" or "firmware" modes that 399 theory some systems might support "platform" or "firmware" modes that
400 won't break the USB connections. 400 won't break the USB connections.
401 401
402 Remember that it's always a bad idea to unplug a disk drive containing a 402 Remember that it's always a bad idea to unplug a disk drive containing a
403 mounted filesystem. That's true even when your system is asleep! The 403 mounted filesystem. That's true even when your system is asleep! The
404 safest thing is to unmount all filesystems on removable media (such USB, 404 safest thing is to unmount all filesystems on removable media (such USB,
405 Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays) 405 Firewire, CompactFlash, MMC, external SATA, or even IDE hotplug bays)
406 before suspending; then remount them after resuming. 406 before suspending; then remount them after resuming.
407 407
408 Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were 408 Q: I upgraded the kernel from 2.6.15 to 2.6.16. Both kernels were
409 compiled with the similar configuration files. Anyway I found that 409 compiled with the similar configuration files. Anyway I found that
410 suspend to disk (and resume) is much slower on 2.6.16 compared to 410 suspend to disk (and resume) is much slower on 2.6.16 compared to
411 2.6.15. Any idea for why that might happen or how can I speed it up? 411 2.6.15. Any idea for why that might happen or how can I speed it up?
412 412
413 A: This is because the size of the suspend image is now greater than 413 A: This is because the size of the suspend image is now greater than
414 for 2.6.15 (by saving more data we can get more responsive system 414 for 2.6.15 (by saving more data we can get more responsive system
415 after resume). 415 after resume).
416 416
417 There's the /sys/power/image_size knob that controls the size of the 417 There's the /sys/power/image_size knob that controls the size of the
418 image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as 418 image. If you set it to 0 (eg. by echo 0 > /sys/power/image_size as
419 root), the 2.6.15 behavior should be restored. If it is still too 419 root), the 2.6.15 behavior should be restored. If it is still too
420 slow, take a look at suspend.sf.net -- userland suspend is faster and 420 slow, take a look at suspend.sf.net -- userland suspend is faster and
421 supports LZF compression to speed it up further. 421 supports LZF compression to speed it up further.
422 422
Documentation/prio_tree.txt
1 The prio_tree.c code indexes vmas using 3 different indexes: 1 The prio_tree.c code indexes vmas using 3 different indexes:
2 * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff 2 * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff
3 * radix_index = vm_pgoff : start_vm_pgoff 3 * radix_index = vm_pgoff : start_vm_pgoff
4 * size_index = vm_size_in_pages 4 * size_index = vm_size_in_pages
5 5
6 A regular radix-priority-search-tree indexes vmas using only heap_index and 6 A regular radix-priority-search-tree indexes vmas using only heap_index and
7 radix_index. The conditions for indexing are: 7 radix_index. The conditions for indexing are:
8 * ->heap_index >= ->left->heap_index && 8 * ->heap_index >= ->left->heap_index &&
9 ->heap_index >= ->right->heap_index 9 ->heap_index >= ->right->heap_index
10 * if (->heap_index == ->left->heap_index) 10 * if (->heap_index == ->left->heap_index)
11 then ->radix_index < ->left->radix_index; 11 then ->radix_index < ->left->radix_index;
12 * if (->heap_index == ->right->heap_index) 12 * if (->heap_index == ->right->heap_index)
13 then ->radix_index < ->right->radix_index; 13 then ->radix_index < ->right->radix_index;
14 * nodes are hashed to left or right subtree using radix_index 14 * nodes are hashed to left or right subtree using radix_index
15 similar to a pure binary radix tree. 15 similar to a pure binary radix tree.
16 16
17 A regular radix-priority-search-tree helps to store and query 17 A regular radix-priority-search-tree helps to store and query
18 intervals (vmas). However, a regular radix-priority-search-tree is only 18 intervals (vmas). However, a regular radix-priority-search-tree is only
19 suitable for storing vmas with different radix indices (vm_pgoff). 19 suitable for storing vmas with different radix indices (vm_pgoff).
20 20
21 Therefore, the prio_tree.c extends the regular radix-priority-search-tree 21 Therefore, the prio_tree.c extends the regular radix-priority-search-tree
22 to handle many vmas with the same vm_pgoff. Such vmas are handled in 22 to handle many vmas with the same vm_pgoff. Such vmas are handled in
23 2 different ways: 1) All vmas with the same radix _and_ heap indices are 23 2 different ways: 1) All vmas with the same radix _and_ heap indices are
24 linked using vm_set.list, 2) if there are many vmas with the same radix 24 linked using vm_set.list, 2) if there are many vmas with the same radix
25 index, but different heap indices and if the regular radix-priority-search 25 index, but different heap indices and if the regular radix-priority-search
26 tree cannot index them all, we build an overflow-sub-tree that indexes such 26 tree cannot index them all, we build an overflow-sub-tree that indexes such
27 vmas using heap and size indices instead of heap and radix indices. For 27 vmas using heap and size indices instead of heap and radix indices. For
28 example, in the figure below some vmas with vm_pgoff = 0 (zero) are 28 example, in the figure below some vmas with vm_pgoff = 0 (zero) are
29 indexed by regular radix-priority-search-tree whereas others are pushed 29 indexed by regular radix-priority-search-tree whereas others are pushed
30 into an overflow-subtree. Note that all vmas in an overflow-sub-tree have 30 into an overflow-subtree. Note that all vmas in an overflow-sub-tree have
31 the same vm_pgoff (radix_index) and if necessary we build different 31 the same vm_pgoff (radix_index) and if necessary we build different
32 overflow-sub-trees to handle each possible radix_index. For example, 32 overflow-sub-trees to handle each possible radix_index. For example,
33 in figure we have 3 overflow-sub-trees corresponding to radix indices 33 in figure we have 3 overflow-sub-trees corresponding to radix indices
34 0, 2, and 4. 34 0, 2, and 4.
35 35
36 In the final tree the first few (prio_tree_root->index_bits) levels 36 In the final tree the first few (prio_tree_root->index_bits) levels
37 are indexed using heap and radix indices whereas the overflow-sub-trees below 37 are indexed using heap and radix indices whereas the overflow-sub-trees below
38 those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are 38 those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are
39 indexed using heap and size indices. In overflow-sub-trees the size_index 39 indexed using heap and size indices. In overflow-sub-trees the size_index
40 is used for hashing the nodes to appropriate places. 40 is used for hashing the nodes to appropriate places.
41 41
42 Now, an example prio_tree: 42 Now, an example prio_tree:
43 43
44 vmas are represented [radix_index, size_index, heap_index] 44 vmas are represented [radix_index, size_index, heap_index]
45 i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff] 45 i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff]
46 46
47 level prio_tree_root->index_bits = 3 47 level prio_tree_root->index_bits = 3
48 ----- 48 -----
49 _ 49 _
50 0 [0,7,7] | 50 0 [0,7,7] |
51 / \ | 51 / \ |
52 ------------------ ------------ | Regular 52 ------------------ ------------ | Regular
53 / \ | radix priority 53 / \ | radix priority
54 1 [1,6,7] [4,3,7] | search tree 54 1 [1,6,7] [4,3,7] | search tree
55 / \ / \ | 55 / \ / \ |
56 ------- ----- ------ ----- | heap-and-radix 56 ------- ----- ------ ----- | heap-and-radix
57 / \ / \ | indexed 57 / \ / \ | indexed
58 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] | 58 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] |
59 / \ / \ / \ / \ | 59 / \ / \ / \ / \ |
60 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] | 60 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] |
61 / / / _ 61 / / / _
62 / / / _ 62 / / / _
63 4 [0,4,4] [2,3,5] [4,1,5] | 63 4 [0,4,4] [2,3,5] [4,1,5] |
64 / / / | 64 / / / |
65 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees 65 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees
66 / / | 66 / / |
67 6 [0,2,2] [2,1,3] | heap-and-size 67 6 [0,2,2] [2,1,3] | heap-and-size
68 / / | indexed 68 / / | indexed
69 7 [0,1,1] [2,0,2] | 69 7 [0,1,1] [2,0,2] |
70 / | 70 / |
71 8 [0,0,0] | 71 8 [0,0,0] |
72 _ 72 _
73 73
74 Note that we use prio_tree_root->index_bits to optimize the height 74 Note that we use prio_tree_root->index_bits to optimize the height
75 of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is 75 of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is
76 set according to the maximum end_vm_pgoff mapped, we are sure that all 76 set according to the maximum end_vm_pgoff mapped, we are sure that all
77 bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore, 77 bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore,
78 we only use the first prio_tree_root->index_bits as radix_index. 78 we only use the first prio_tree_root->index_bits as radix_index.
79 Whenever index_bits is increased in prio_tree_expand, we shuffle the tree 79 Whenever index_bits is increased in prio_tree_expand, we shuffle the tree
80 to make sure that the first prio_tree_root->index_bits levels of the tree 80 to make sure that the first prio_tree_root->index_bits levels of the tree
81 is indexed properly using heap and radix indices. 81 is indexed properly using heap and radix indices.
82 82
83 We do not optimize the height of overflow-sub-trees using index_bits. 83 We do not optimize the height of overflow-sub-trees using index_bits.
84 The reason is: there can be many such overflow-sub-trees and all of 84 The reason is: there can be many such overflow-sub-trees and all of
85 them have to be suffled whenever the index_bits increases. This may involve 85 them have to be suffled whenever the index_bits increases. This may involve
86 walking the whole prio_tree in prio_tree_insert->prio_tree_expand code 86 walking the whole prio_tree in prio_tree_insert->prio_tree_expand code
87 path which is not desirable. Hence, we do not optimize the height of the 87 path which is not desirable. Hence, we do not optimize the height of the
88 heap-and-size indexed overflow-sub-trees using prio_tree->index_bits. 88 heap-and-size indexed overflow-sub-trees using prio_tree->index_bits.
89 Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits 89 Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits
90 of size_index. This may lead to skewed sub-trees because most of the 90 of size_index. This may lead to skewed sub-trees because most of the
91 higher significant bits of the size_index are likely to be be 0 (zero). In 91 higher significant bits of the size_index are likely to be 0 (zero). In
92 the example above, all 3 overflow-sub-trees are skewed. This may marginally 92 the example above, all 3 overflow-sub-trees are skewed. This may marginally
93 affect the performance. However, processes rarely map many vmas with the 93 affect the performance. However, processes rarely map many vmas with the
94 same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally 94 same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally
95 do not require overflow-sub-trees to index all vmas. 95 do not require overflow-sub-trees to index all vmas.
96 96
97 From the above discussion it is clear that the maximum height of 97 From the above discussion it is clear that the maximum height of
98 a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG. 98 a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG.
99 However, in most of the common cases we do not need overflow-sub-trees, 99 However, in most of the common cases we do not need overflow-sub-trees,
100 so the tree height in the common cases will be prio_tree_root->index_bits. 100 so the tree height in the common cases will be prio_tree_root->index_bits.
101 101
102 It is fair to mention here that the prio_tree_root->index_bits 102 It is fair to mention here that the prio_tree_root->index_bits
103 is increased on demand, however, the index_bits is not decreased when 103 is increased on demand, however, the index_bits is not decreased when
104 vmas are removed from the prio_tree. That's tricky to do. Hence, it's 104 vmas are removed from the prio_tree. That's tricky to do. Hence, it's
105 left as a home work problem. 105 left as a home work problem.
106 106
107 107
108 108
Documentation/rpc-cache.txt
1 This document gives a brief introduction to the caching 1 This document gives a brief introduction to the caching
2 mechanisms in the sunrpc layer that is used, in particular, 2 mechanisms in the sunrpc layer that is used, in particular,
3 for NFS authentication. 3 for NFS authentication.
4 4
5 CACHES 5 CACHES
6 ====== 6 ======
7 The caching replaces the old exports table and allows for 7 The caching replaces the old exports table and allows for
8 a wide variety of values to be caches. 8 a wide variety of values to be caches.
9 9
10 There are a number of caches that are similar in structure though 10 There are a number of caches that are similar in structure though
11 quite possibly very different in content and use. There is a corpus 11 quite possibly very different in content and use. There is a corpus
12 of common code for managing these caches. 12 of common code for managing these caches.
13 13
14 Examples of caches that are likely to be needed are: 14 Examples of caches that are likely to be needed are:
15 - mapping from IP address to client name 15 - mapping from IP address to client name
16 - mapping from client name and filesystem to export options 16 - mapping from client name and filesystem to export options
17 - mapping from UID to list of GIDs, to work around NFS's limitation 17 - mapping from UID to list of GIDs, to work around NFS's limitation
18 of 16 gids. 18 of 16 gids.
19 - mappings between local UID/GID and remote UID/GID for sites that 19 - mappings between local UID/GID and remote UID/GID for sites that
20 do not have uniform uid assignment 20 do not have uniform uid assignment
21 - mapping from network identify to public key for crypto authentication. 21 - mapping from network identify to public key for crypto authentication.
22 22
23 The common code handles such things as: 23 The common code handles such things as:
24 - general cache lookup with correct locking 24 - general cache lookup with correct locking
25 - supporting 'NEGATIVE' as well as positive entries 25 - supporting 'NEGATIVE' as well as positive entries
26 - allowing an EXPIRED time on cache items, and removing 26 - allowing an EXPIRED time on cache items, and removing
27 items after they expire, and are no longer in-use. 27 items after they expire, and are no longer in-use.
28 - making requests to user-space to fill in cache entries 28 - making requests to user-space to fill in cache entries
29 - allowing user-space to directly set entries in the cache 29 - allowing user-space to directly set entries in the cache
30 - delaying RPC requests that depend on as-yet incomplete 30 - delaying RPC requests that depend on as-yet incomplete
31 cache entries, and replaying those requests when the cache entry 31 cache entries, and replaying those requests when the cache entry
32 is complete. 32 is complete.
33 - clean out old entries as they expire. 33 - clean out old entries as they expire.
34 34
35 Creating a Cache 35 Creating a Cache
36 ---------------- 36 ----------------
37 37
38 1/ A cache needs a datum to store. This is in the form of a 38 1/ A cache needs a datum to store. This is in the form of a
39 structure definition that must contain a 39 structure definition that must contain a
40 struct cache_head 40 struct cache_head
41 as an element, usually the first. 41 as an element, usually the first.
42 It will also contain a key and some content. 42 It will also contain a key and some content.
43 Each cache element is reference counted and contains 43 Each cache element is reference counted and contains
44 expiry and update times for use in cache management. 44 expiry and update times for use in cache management.
45 2/ A cache needs a "cache_detail" structure that 45 2/ A cache needs a "cache_detail" structure that
46 describes the cache. This stores the hash table, some 46 describes the cache. This stores the hash table, some
47 parameters for cache management, and some operations detailing how 47 parameters for cache management, and some operations detailing how
48 to work with particular cache items. 48 to work with particular cache items.
49 The operations requires are: 49 The operations requires are:
50 struct cache_head *alloc(void) 50 struct cache_head *alloc(void)
51 This simply allocates appropriate memory and returns 51 This simply allocates appropriate memory and returns
52 a pointer to the cache_detail embedded within the 52 a pointer to the cache_detail embedded within the
53 structure 53 structure
54 void cache_put(struct kref *) 54 void cache_put(struct kref *)
55 This is called when the last reference to an item is 55 This is called when the last reference to an item is
56 is dropped. The pointer passed is to the 'ref' field 56 dropped. The pointer passed is to the 'ref' field
57 in the cache_head. cache_put should release any 57 in the cache_head. cache_put should release any
58 references create by 'cache_init' and, if CACHE_VALID 58 references create by 'cache_init' and, if CACHE_VALID
59 is set, any references created by cache_update. 59 is set, any references created by cache_update.
60 It should then release the memory allocated by 60 It should then release the memory allocated by
61 'alloc'. 61 'alloc'.
62 int match(struct cache_head *orig, struct cache_head *new) 62 int match(struct cache_head *orig, struct cache_head *new)
63 test if the keys in the two structures match. Return 63 test if the keys in the two structures match. Return
64 1 if they do, 0 if they don't. 64 1 if they do, 0 if they don't.
65 void init(struct cache_head *orig, struct cache_head *new) 65 void init(struct cache_head *orig, struct cache_head *new)
66 Set the 'key' fields in 'new' from 'orig'. This may 66 Set the 'key' fields in 'new' from 'orig'. This may
67 include taking references to shared objects. 67 include taking references to shared objects.
68 void update(struct cache_head *orig, struct cache_head *new) 68 void update(struct cache_head *orig, struct cache_head *new)
69 Set the 'content' fileds in 'new' from 'orig'. 69 Set the 'content' fileds in 'new' from 'orig'.
70 int cache_show(struct seq_file *m, struct cache_detail *cd, 70 int cache_show(struct seq_file *m, struct cache_detail *cd,
71 struct cache_head *h) 71 struct cache_head *h)
72 Optional. Used to provide a /proc file that lists the 72 Optional. Used to provide a /proc file that lists the
73 contents of a cache. This should show one item, 73 contents of a cache. This should show one item,
74 usually on just one line. 74 usually on just one line.
75 int cache_request(struct cache_detail *cd, struct cache_head *h, 75 int cache_request(struct cache_detail *cd, struct cache_head *h,
76 char **bpp, int *blen) 76 char **bpp, int *blen)
77 Format a request to be send to user-space for an item 77 Format a request to be send to user-space for an item
78 to be instantiated. *bpp is a buffer of size *blen. 78 to be instantiated. *bpp is a buffer of size *blen.
79 bpp should be moved forward over the encoded message, 79 bpp should be moved forward over the encoded message,
80 and *blen should be reduced to show how much free 80 and *blen should be reduced to show how much free
81 space remains. Return 0 on success or <0 if not 81 space remains. Return 0 on success or <0 if not
82 enough room or other problem. 82 enough room or other problem.
83 int cache_parse(struct cache_detail *cd, char *buf, int len) 83 int cache_parse(struct cache_detail *cd, char *buf, int len)
84 A message from user space has arrived to fill out a 84 A message from user space has arrived to fill out a
85 cache entry. It is in 'buf' of length 'len'. 85 cache entry. It is in 'buf' of length 'len'.
86 cache_parse should parse this, find the item in the 86 cache_parse should parse this, find the item in the
87 cache with sunrpc_cache_lookup, and update the item 87 cache with sunrpc_cache_lookup, and update the item
88 with sunrpc_cache_update. 88 with sunrpc_cache_update.
89 89
90 90
91 3/ A cache needs to be registered using cache_register(). This 91 3/ A cache needs to be registered using cache_register(). This
92 includes it on a list of caches that will be regularly 92 includes it on a list of caches that will be regularly
93 cleaned to discard old data. 93 cleaned to discard old data.
94 94
95 Using a cache 95 Using a cache
96 ------------- 96 -------------
97 97
98 To find a value in a cache, call sunrpc_cache_lookup passing a pointer 98 To find a value in a cache, call sunrpc_cache_lookup passing a pointer
99 to the cache_head in a sample item with the 'key' fields filled in. 99 to the cache_head in a sample item with the 'key' fields filled in.
100 This will be passed to ->match to identify the target entry. If no 100 This will be passed to ->match to identify the target entry. If no
101 entry is found, a new entry will be create, added to the cache, and 101 entry is found, a new entry will be create, added to the cache, and
102 marked as not containing valid data. 102 marked as not containing valid data.
103 103
104 The item returned is typically passed to cache_check which will check 104 The item returned is typically passed to cache_check which will check
105 if the data is valid, and may initiate an up-call to get fresh data. 105 if the data is valid, and may initiate an up-call to get fresh data.
106 cache_check will return -ENOENT in the entry is negative or if an up 106 cache_check will return -ENOENT in the entry is negative or if an up
107 call is needed but not possible, -EAGAIN if an upcall is pending, 107 call is needed but not possible, -EAGAIN if an upcall is pending,
108 or 0 if the data is valid; 108 or 0 if the data is valid;
109 109
110 cache_check can be passed a "struct cache_req *". This structure is 110 cache_check can be passed a "struct cache_req *". This structure is
111 typically embedded in the actual request and can be used to create a 111 typically embedded in the actual request and can be used to create a
112 deferred copy of the request (struct cache_deferred_req). This is 112 deferred copy of the request (struct cache_deferred_req). This is
113 done when the found cache item is not uptodate, but the is reason to 113 done when the found cache item is not uptodate, but the is reason to
114 believe that userspace might provide information soon. When the cache 114 believe that userspace might provide information soon. When the cache
115 item does become valid, the deferred copy of the request will be 115 item does become valid, the deferred copy of the request will be
116 revisited (->revisit). It is expected that this method will 116 revisited (->revisit). It is expected that this method will
117 reschedule the request for processing. 117 reschedule the request for processing.
118 118
119 The value returned by sunrpc_cache_lookup can also be passed to 119 The value returned by sunrpc_cache_lookup can also be passed to
120 sunrpc_cache_update to set the content for the item. A second item is 120 sunrpc_cache_update to set the content for the item. A second item is
121 passed which should hold the content. If the item found by _lookup 121 passed which should hold the content. If the item found by _lookup
122 has valid data, then it is discarded and a new item is created. This 122 has valid data, then it is discarded and a new item is created. This
123 saves any user of an item from worrying about content changing while 123 saves any user of an item from worrying about content changing while
124 it is being inspected. If the item found by _lookup does not contain 124 it is being inspected. If the item found by _lookup does not contain
125 valid data, then the content is copied across and CACHE_VALID is set. 125 valid data, then the content is copied across and CACHE_VALID is set.
126 126
127 Populating a cache 127 Populating a cache
128 ------------------ 128 ------------------
129 129
130 Each cache has a name, and when the cache is registered, a directory 130 Each cache has a name, and when the cache is registered, a directory
131 with that name is created in /proc/net/rpc 131 with that name is created in /proc/net/rpc
132 132
133 This directory contains a file called 'channel' which is a channel 133 This directory contains a file called 'channel' which is a channel
134 for communicating between kernel and user for populating the cache. 134 for communicating between kernel and user for populating the cache.
135 This directory may later contain other files of interacting 135 This directory may later contain other files of interacting
136 with the cache. 136 with the cache.
137 137
138 The 'channel' works a bit like a datagram socket. Each 'write' is 138 The 'channel' works a bit like a datagram socket. Each 'write' is
139 passed as a whole to the cache for parsing and interpretation. 139 passed as a whole to the cache for parsing and interpretation.
140 Each cache can treat the write requests differently, but it is 140 Each cache can treat the write requests differently, but it is
141 expected that a message written will contain: 141 expected that a message written will contain:
142 - a key 142 - a key
143 - an expiry time 143 - an expiry time
144 - a content. 144 - a content.
145 with the intention that an item in the cache with the give key 145 with the intention that an item in the cache with the give key
146 should be create or updated to have the given content, and the 146 should be create or updated to have the given content, and the
147 expiry time should be set on that item. 147 expiry time should be set on that item.
148 148
149 Reading from a channel is a bit more interesting. When a cache 149 Reading from a channel is a bit more interesting. When a cache
150 lookup fails, or when it succeeds but finds an entry that may soon 150 lookup fails, or when it succeeds but finds an entry that may soon
151 expire, a request is lodged for that cache item to be updated by 151 expire, a request is lodged for that cache item to be updated by
152 user-space. These requests appear in the channel file. 152 user-space. These requests appear in the channel file.
153 153
154 Successive reads will return successive requests. 154 Successive reads will return successive requests.
155 If there are no more requests to return, read will return EOF, but a 155 If there are no more requests to return, read will return EOF, but a
156 select or poll for read will block waiting for another request to be 156 select or poll for read will block waiting for another request to be
157 added. 157 added.
158 158
159 Thus a user-space helper is likely to: 159 Thus a user-space helper is likely to:
160 open the channel. 160 open the channel.
161 select for readable 161 select for readable
162 read a request 162 read a request
163 write a response 163 write a response
164 loop. 164 loop.
165 165
166 If it dies and needs to be restarted, any requests that have not been 166 If it dies and needs to be restarted, any requests that have not been
167 answered will still appear in the file and will be read by the new 167 answered will still appear in the file and will be read by the new
168 instance of the helper. 168 instance of the helper.
169 169
170 Each cache should define a "cache_parse" method which takes a message 170 Each cache should define a "cache_parse" method which takes a message
171 written from user-space and processes it. It should return an error 171 written from user-space and processes it. It should return an error
172 (which propagates back to the write syscall) or 0. 172 (which propagates back to the write syscall) or 0.
173 173
174 Each cache should also define a "cache_request" method which 174 Each cache should also define a "cache_request" method which
175 takes a cache item and encodes a request into the buffer 175 takes a cache item and encodes a request into the buffer
176 provided. 176 provided.
177 177
178 Note: If a cache has no active readers on the channel, and has had not 178 Note: If a cache has no active readers on the channel, and has had not
179 active readers for more than 60 seconds, further requests will not be 179 active readers for more than 60 seconds, further requests will not be
180 added to the channel but instead all lookups that do not find a valid 180 added to the channel but instead all lookups that do not find a valid
181 entry will fail. This is partly for backward compatibility: The 181 entry will fail. This is partly for backward compatibility: The
182 previous nfs exports table was deemed to be authoritative and a 182 previous nfs exports table was deemed to be authoritative and a
183 failed lookup meant a definite 'no'. 183 failed lookup meant a definite 'no'.
184 184
185 request/response format 185 request/response format
186 ----------------------- 186 -----------------------
187 187
188 While each cache is free to use it's own format for requests 188 While each cache is free to use it's own format for requests
189 and responses over channel, the following is recommended as 189 and responses over channel, the following is recommended as
190 appropriate and support routines are available to help: 190 appropriate and support routines are available to help:
191 Each request or response record should be printable ASCII 191 Each request or response record should be printable ASCII
192 with precisely one newline character which should be at the end. 192 with precisely one newline character which should be at the end.
193 Fields within the record should be separated by spaces, normally one. 193 Fields within the record should be separated by spaces, normally one.
194 If spaces, newlines, or nul characters are needed in a field they 194 If spaces, newlines, or nul characters are needed in a field they
195 much be quoted. two mechanisms are available: 195 much be quoted. two mechanisms are available:
196 1/ If a field begins '\x' then it must contain an even number of 196 1/ If a field begins '\x' then it must contain an even number of
197 hex digits, and pairs of these digits provide the bytes in the 197 hex digits, and pairs of these digits provide the bytes in the
198 field. 198 field.
199 2/ otherwise a \ in the field must be followed by 3 octal digits 199 2/ otherwise a \ in the field must be followed by 3 octal digits
200 which give the code for a byte. Other characters are treated 200 which give the code for a byte. Other characters are treated
201 as them selves. At the very least, space, newline, nul, and 201 as them selves. At the very least, space, newline, nul, and
202 '\' must be quoted in this way. 202 '\' must be quoted in this way.
203 203
Documentation/s390/Debugging390.txt
1 1
2 Debugging on Linux for s/390 & z/Architecture 2 Debugging on Linux for s/390 & z/Architecture
3 by 3 by
4 Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) 4 Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com)
5 Copyright (C) 2000-2001 IBM Deutschland Entwicklung GmbH, IBM Corporation 5 Copyright (C) 2000-2001 IBM Deutschland Entwicklung GmbH, IBM Corporation
6 Best viewed with fixed width fonts 6 Best viewed with fixed width fonts
7 7
8 Overview of Document: 8 Overview of Document:
9 ===================== 9 =====================
10 This document is intended to give an good overview of how to debug 10 This document is intended to give an good overview of how to debug
11 Linux for s/390 & z/Architecture. It isn't intended as a complete reference & not a 11 Linux for s/390 & z/Architecture. It isn't intended as a complete reference & not a
12 tutorial on the fundamentals of C & assembly. It doesn't go into 12 tutorial on the fundamentals of C & assembly. It doesn't go into
13 390 IO in any detail. It is intended to complement the documents in the 13 390 IO in any detail. It is intended to complement the documents in the
14 reference section below & any other worthwhile references you get. 14 reference section below & any other worthwhile references you get.
15 15
16 It is intended like the Enterprise Systems Architecture/390 Reference Summary 16 It is intended like the Enterprise Systems Architecture/390 Reference Summary
17 to be printed out & used as a quick cheat sheet self help style reference when 17 to be printed out & used as a quick cheat sheet self help style reference when
18 problems occur. 18 problems occur.
19 19
20 Contents 20 Contents
21 ======== 21 ========
22 Register Set 22 Register Set
23 Address Spaces on Intel Linux 23 Address Spaces on Intel Linux
24 Address Spaces on Linux for s/390 & z/Architecture 24 Address Spaces on Linux for s/390 & z/Architecture
25 The Linux for s/390 & z/Architecture Kernel Task Structure 25 The Linux for s/390 & z/Architecture Kernel Task Structure
26 Register Usage & Stackframes on Linux for s/390 & z/Architecture 26 Register Usage & Stackframes on Linux for s/390 & z/Architecture
27 A sample program with comments 27 A sample program with comments
28 Compiling programs for debugging on Linux for s/390 & z/Architecture 28 Compiling programs for debugging on Linux for s/390 & z/Architecture
29 Figuring out gcc compile errors 29 Figuring out gcc compile errors
30 Debugging Tools 30 Debugging Tools
31 objdump 31 objdump
32 strace 32 strace
33 Performance Debugging 33 Performance Debugging
34 Debugging under VM 34 Debugging under VM
35 s/390 & z/Architecture IO Overview 35 s/390 & z/Architecture IO Overview
36 Debugging IO on s/390 & z/Architecture under VM 36 Debugging IO on s/390 & z/Architecture under VM
37 GDB on s/390 & z/Architecture 37 GDB on s/390 & z/Architecture
38 Stack chaining in gdb by hand 38 Stack chaining in gdb by hand
39 Examining core dumps 39 Examining core dumps
40 ldd 40 ldd
41 Debugging modules 41 Debugging modules
42 The proc file system 42 The proc file system
43 Starting points for debugging scripting languages etc. 43 Starting points for debugging scripting languages etc.
44 Dumptool & Lcrash 44 Dumptool & Lcrash
45 SysRq 45 SysRq
46 References 46 References
47 Special Thanks 47 Special Thanks
48 48
49 Register Set 49 Register Set
50 ============ 50 ============
51 The current architectures have the following registers. 51 The current architectures have the following registers.
52 52
53 16 General propose registers, 32 bit on s/390 64 bit on z/Architecture, r0-r15 or gpr0-gpr15 used for arithmetic & addressing. 53 16 General propose registers, 32 bit on s/390 64 bit on z/Architecture, r0-r15 or gpr0-gpr15 used for arithmetic & addressing.
54 54
55 16 Control registers, 32 bit on s/390 64 bit on z/Architecture, ( cr0-cr15 kernel usage only ) used for memory management, 55 16 Control registers, 32 bit on s/390 64 bit on z/Architecture, ( cr0-cr15 kernel usage only ) used for memory management,
56 interrupt control,debugging control etc. 56 interrupt control,debugging control etc.
57 57
58 16 Access registers ( ar0-ar15 ) 32 bit on s/390 & z/Architecture 58 16 Access registers ( ar0-ar15 ) 32 bit on s/390 & z/Architecture
59 not used by normal programs but potentially could 59 not used by normal programs but potentially could
60 be used as temporary storage. Their main purpose is their 1 to 1 60 be used as temporary storage. Their main purpose is their 1 to 1
61 association with general purpose registers and are used in 61 association with general purpose registers and are used in
62 the kernel for copying data between kernel & user address spaces. 62 the kernel for copying data between kernel & user address spaces.
63 Access register 0 ( & access register 1 on z/Architecture ( needs 64 bit 63 Access register 0 ( & access register 1 on z/Architecture ( needs 64 bit
64 pointer ) ) is currently used by the pthread library as a pointer to 64 pointer ) ) is currently used by the pthread library as a pointer to
65 the current running threads private area. 65 the current running threads private area.
66 66
67 16 64 bit floating point registers (fp0-fp15 ) IEEE & HFP floating 67 16 64 bit floating point registers (fp0-fp15 ) IEEE & HFP floating
68 point format compliant on G5 upwards & a Floating point control reg (FPC) 68 point format compliant on G5 upwards & a Floating point control reg (FPC)
69 4 64 bit registers (fp0,fp2,fp4 & fp6) HFP only on older machines. 69 4 64 bit registers (fp0,fp2,fp4 & fp6) HFP only on older machines.
70 Note: 70 Note:
71 Linux (currently) always uses IEEE & emulates G5 IEEE format on older machines, 71 Linux (currently) always uses IEEE & emulates G5 IEEE format on older machines,
72 ( provided the kernel is configured for this ). 72 ( provided the kernel is configured for this ).
73 73
74 74
75 The PSW is the most important register on the machine it 75 The PSW is the most important register on the machine it
76 is 64 bit on s/390 & 128 bit on z/Architecture & serves the roles of 76 is 64 bit on s/390 & 128 bit on z/Architecture & serves the roles of
77 a program counter (pc), condition code register,memory space designator. 77 a program counter (pc), condition code register,memory space designator.
78 In IBM standard notation I am counting bit 0 as the MSB. 78 In IBM standard notation I am counting bit 0 as the MSB.
79 It has several advantages over a normal program counter 79 It has several advantages over a normal program counter
80 in that you can change address translation & program counter 80 in that you can change address translation & program counter
81 in a single instruction. To change address translation, 81 in a single instruction. To change address translation,
82 e.g. switching address translation off requires that you 82 e.g. switching address translation off requires that you
83 have a logical=physical mapping for the address you are 83 have a logical=physical mapping for the address you are
84 currently running at. 84 currently running at.
85 85
86 Bit Value 86 Bit Value
87 s/390 z/Architecture 87 s/390 z/Architecture
88 0 0 Reserved ( must be 0 ) otherwise specification exception occurs. 88 0 0 Reserved ( must be 0 ) otherwise specification exception occurs.
89 89
90 1 1 Program Event Recording 1 PER enabled, 90 1 1 Program Event Recording 1 PER enabled,
91 PER is used to facilitate debugging e.g. single stepping. 91 PER is used to facilitate debugging e.g. single stepping.
92 92
93 2-4 2-4 Reserved ( must be 0 ). 93 2-4 2-4 Reserved ( must be 0 ).
94 94
95 5 5 Dynamic address translation 1=DAT on. 95 5 5 Dynamic address translation 1=DAT on.
96 96
97 6 6 Input/Output interrupt Mask 97 6 6 Input/Output interrupt Mask
98 98
99 7 7 External interrupt Mask used primarily for interprocessor signalling & 99 7 7 External interrupt Mask used primarily for interprocessor signalling &
100 clock interrupts. 100 clock interrupts.
101 101
102 8-11 8-11 PSW Key used for complex memory protection mechanism not used under linux 102 8-11 8-11 PSW Key used for complex memory protection mechanism not used under linux
103 103
104 12 12 1 on s/390 0 on z/Architecture 104 12 12 1 on s/390 0 on z/Architecture
105 105
106 13 13 Machine Check Mask 1=enable machine check interrupts 106 13 13 Machine Check Mask 1=enable machine check interrupts
107 107
108 14 14 Wait State set this to 1 to stop the processor except for interrupts & give 108 14 14 Wait State set this to 1 to stop the processor except for interrupts & give
109 time to other LPARS used in CPU idle in the kernel to increase overall 109 time to other LPARS used in CPU idle in the kernel to increase overall
110 usage of processor resources. 110 usage of processor resources.
111 111
112 15 15 Problem state ( if set to 1 certain instructions are disabled ) 112 15 15 Problem state ( if set to 1 certain instructions are disabled )
113 all linux user programs run with this bit 1 113 all linux user programs run with this bit 1
114 ( useful info for debugging under VM ). 114 ( useful info for debugging under VM ).
115 115
116 16-17 16-17 Address Space Control 116 16-17 16-17 Address Space Control
117 117
118 00 Primary Space Mode when DAT on 118 00 Primary Space Mode when DAT on
119 The linux kernel currently runs in this mode, CR1 is affiliated with 119 The linux kernel currently runs in this mode, CR1 is affiliated with
120 this mode & points to the primary segment table origin etc. 120 this mode & points to the primary segment table origin etc.
121 121
122 01 Access register mode this mode is used in functions to 122 01 Access register mode this mode is used in functions to
123 copy data between kernel & user space. 123 copy data between kernel & user space.
124 124
125 10 Secondary space mode not used in linux however CR7 the 125 10 Secondary space mode not used in linux however CR7 the
126 register affiliated with this mode is & this & normally 126 register affiliated with this mode is & this & normally
127 CR13=CR7 to allow us to copy data between kernel & user space. 127 CR13=CR7 to allow us to copy data between kernel & user space.
128 We do this as follows: 128 We do this as follows:
129 We set ar2 to 0 to designate its 129 We set ar2 to 0 to designate its
130 affiliated gpr ( gpr2 )to point to primary=kernel space. 130 affiliated gpr ( gpr2 )to point to primary=kernel space.
131 We set ar4 to 1 to designate its 131 We set ar4 to 1 to designate its
132 affiliated gpr ( gpr4 ) to point to secondary=home=user space 132 affiliated gpr ( gpr4 ) to point to secondary=home=user space
133 & then essentially do a memcopy(gpr2,gpr4,size) to 133 & then essentially do a memcopy(gpr2,gpr4,size) to
134 copy data between the address spaces, the reason we use home space for the 134 copy data between the address spaces, the reason we use home space for the
135 kernel & don't keep secondary space free is that code will not run in 135 kernel & don't keep secondary space free is that code will not run in
136 secondary space. 136 secondary space.
137 137
138 11 Home Space Mode all user programs run in this mode. 138 11 Home Space Mode all user programs run in this mode.
139 it is affiliated with CR13. 139 it is affiliated with CR13.
140 140
141 18-19 18-19 Condition codes (CC) 141 18-19 18-19 Condition codes (CC)
142 142
143 20 20 Fixed point overflow mask if 1=FPU exceptions for this event 143 20 20 Fixed point overflow mask if 1=FPU exceptions for this event
144 occur ( normally 0 ) 144 occur ( normally 0 )
145 145
146 21 21 Decimal overflow mask if 1=FPU exceptions for this event occur 146 21 21 Decimal overflow mask if 1=FPU exceptions for this event occur
147 ( normally 0 ) 147 ( normally 0 )
148 148
149 22 22 Exponent underflow mask if 1=FPU exceptions for this event occur 149 22 22 Exponent underflow mask if 1=FPU exceptions for this event occur
150 ( normally 0 ) 150 ( normally 0 )
151 151
152 23 23 Significance Mask if 1=FPU exceptions for this event occur 152 23 23 Significance Mask if 1=FPU exceptions for this event occur
153 ( normally 0 ) 153 ( normally 0 )
154 154
155 24-31 24-30 Reserved Must be 0. 155 24-31 24-30 Reserved Must be 0.
156 156
157 31 Extended Addressing Mode 157 31 Extended Addressing Mode
158 32 Basic Addressing Mode 158 32 Basic Addressing Mode
159 Used to set addressing mode 159 Used to set addressing mode
160 PSW 31 PSW 32 160 PSW 31 PSW 32
161 0 0 24 bit 161 0 0 24 bit
162 0 1 31 bit 162 0 1 31 bit
163 1 1 64 bit 163 1 1 64 bit
164 164
165 32 1=31 bit addressing mode 0=24 bit addressing mode (for backward 165 32 1=31 bit addressing mode 0=24 bit addressing mode (for backward
166 compatibility), linux always runs with this bit set to 1 166 compatibility), linux always runs with this bit set to 1
167 167
168 33-64 Instruction address. 168 33-64 Instruction address.
169 33-63 Reserved must be 0 169 33-63 Reserved must be 0
170 64-127 Address 170 64-127 Address
171 In 24 bits mode bits 64-103=0 bits 104-127 Address 171 In 24 bits mode bits 64-103=0 bits 104-127 Address
172 In 31 bits mode bits 64-96=0 bits 97-127 Address 172 In 31 bits mode bits 64-96=0 bits 97-127 Address
173 Note: unlike 31 bit mode on s/390 bit 96 must be zero 173 Note: unlike 31 bit mode on s/390 bit 96 must be zero
174 when loading the address with LPSWE otherwise a 174 when loading the address with LPSWE otherwise a
175 specification exception occurs, LPSW is fully backward 175 specification exception occurs, LPSW is fully backward
176 compatible. 176 compatible.
177 177
178 178
179 Prefix Page(s) 179 Prefix Page(s)
180 -------------- 180 --------------
181 This per cpu memory area is too intimately tied to the processor not to mention. 181 This per cpu memory area is too intimately tied to the processor not to mention.
182 It exists between the real addresses 0-4096 on s/390 & 0-8192 z/Architecture & is exchanged 182 It exists between the real addresses 0-4096 on s/390 & 0-8192 z/Architecture & is exchanged
183 with a 1 page on s/390 or 2 pages on z/Architecture in absolute storage by the set 183 with a 1 page on s/390 or 2 pages on z/Architecture in absolute storage by the set
184 prefix instruction in linux'es startup. 184 prefix instruction in linux'es startup.
185 This page is mapped to a different prefix for each processor in an SMP configuration 185 This page is mapped to a different prefix for each processor in an SMP configuration
186 ( assuming the os designer is sane of course :-) ). 186 ( assuming the os designer is sane of course :-) ).
187 Bytes 0-512 ( 200 hex ) on s/390 & 0-512,4096-4544,4604-5119 currently on z/Architecture 187 Bytes 0-512 ( 200 hex ) on s/390 & 0-512,4096-4544,4604-5119 currently on z/Architecture
188 are used by the processor itself for holding such information as exception indications & 188 are used by the processor itself for holding such information as exception indications &
189 entry points for exceptions. 189 entry points for exceptions.
190 Bytes after 0xc00 hex are used by linux for per processor globals on s/390 & z/Architecture 190 Bytes after 0xc00 hex are used by linux for per processor globals on s/390 & z/Architecture
191 ( there is a gap on z/Architecture too currently between 0xc00 & 1000 which linux uses ). 191 ( there is a gap on z/Architecture too currently between 0xc00 & 1000 which linux uses ).
192 The closest thing to this on traditional architectures is the interrupt 192 The closest thing to this on traditional architectures is the interrupt
193 vector table. This is a good thing & does simplify some of the kernel coding 193 vector table. This is a good thing & does simplify some of the kernel coding
194 however it means that we now cannot catch stray NULL pointers in the 194 however it means that we now cannot catch stray NULL pointers in the
195 kernel without hard coded checks. 195 kernel without hard coded checks.
196 196
197 197
198 198
199 Address Spaces on Intel Linux 199 Address Spaces on Intel Linux
200 ============================= 200 =============================
201 201
202 The traditional Intel Linux is approximately mapped as follows forgive 202 The traditional Intel Linux is approximately mapped as follows forgive
203 the ascii art. 203 the ascii art.
204 0xFFFFFFFF 4GB Himem ***************** 204 0xFFFFFFFF 4GB Himem *****************
205 * * 205 * *
206 * Kernel Space * 206 * Kernel Space *
207 * * 207 * *
208 ***************** **************** 208 ***************** ****************
209 User Space Himem (typically 0xC0000000 3GB )* User Stack * * * 209 User Space Himem (typically 0xC0000000 3GB )* User Stack * * *
210 ***************** * * 210 ***************** * *
211 * Shared Libs * * Next Process * 211 * Shared Libs * * Next Process *
212 ***************** * to * 212 ***************** * to *
213 * * <== * Run * <== 213 * * <== * Run * <==
214 * User Program * * * 214 * User Program * * *
215 * Data BSS * * * 215 * Data BSS * * *
216 * Text * * * 216 * Text * * *
217 * Sections * * * 217 * Sections * * *
218 0x00000000 ***************** **************** 218 0x00000000 ***************** ****************
219 219
220 Now it is easy to see that on Intel it is quite easy to recognise a kernel address 220 Now it is easy to see that on Intel it is quite easy to recognise a kernel address
221 as being one greater than user space himem ( in this case 0xC0000000). 221 as being one greater than user space himem ( in this case 0xC0000000).
222 & addresses of less than this are the ones in the current running program on this 222 & addresses of less than this are the ones in the current running program on this
223 processor ( if an smp box ). 223 processor ( if an smp box ).
224 If using the virtual machine ( VM ) as a debugger it is quite difficult to 224 If using the virtual machine ( VM ) as a debugger it is quite difficult to
225 know which user process is running as the address space you are looking at 225 know which user process is running as the address space you are looking at
226 could be from any process in the run queue. 226 could be from any process in the run queue.
227 227
228 The limitation of Intels addressing technique is that the linux 228 The limitation of Intels addressing technique is that the linux
229 kernel uses a very simple real address to virtual addressing technique 229 kernel uses a very simple real address to virtual addressing technique
230 of Real Address=Virtual Address-User Space Himem. 230 of Real Address=Virtual Address-User Space Himem.
231 This means that on Intel the kernel linux can typically only address 231 This means that on Intel the kernel linux can typically only address
232 Himem=0xFFFFFFFF-0xC0000000=1GB & this is all the RAM these machines 232 Himem=0xFFFFFFFF-0xC0000000=1GB & this is all the RAM these machines
233 can typically use. 233 can typically use.
234 They can lower User Himem to 2GB or lower & thus be 234 They can lower User Himem to 2GB or lower & thus be
235 able to use 2GB of RAM however this shrinks the maximum size 235 able to use 2GB of RAM however this shrinks the maximum size
236 of User Space from 3GB to 2GB they have a no win limit of 4GB unless 236 of User Space from 3GB to 2GB they have a no win limit of 4GB unless
237 they go to 64 Bit. 237 they go to 64 Bit.
238 238
239 239
240 On 390 our limitations & strengths make us slightly different. 240 On 390 our limitations & strengths make us slightly different.
241 For backward compatibility we are only allowed use 31 bits (2GB) 241 For backward compatibility we are only allowed use 31 bits (2GB)
242 of our 32 bit addresses, however, we use entirely separate address 242 of our 32 bit addresses, however, we use entirely separate address
243 spaces for the user & kernel. 243 spaces for the user & kernel.
244 244
245 This means we can support 2GB of non Extended RAM on s/390, & more 245 This means we can support 2GB of non Extended RAM on s/390, & more
246 with the Extended memory management swap device & 246 with the Extended memory management swap device &
247 currently 4TB of physical memory currently on z/Architecture. 247 currently 4TB of physical memory currently on z/Architecture.
248 248
249 249
250 Address Spaces on Linux for s/390 & z/Architecture 250 Address Spaces on Linux for s/390 & z/Architecture
251 ================================================== 251 ==================================================
252 252
253 Our addressing scheme is as follows 253 Our addressing scheme is as follows
254 254
255 255
256 Himem 0x7fffffff 2GB on s/390 ***************** **************** 256 Himem 0x7fffffff 2GB on s/390 ***************** ****************
257 currently 0x3ffffffffff (2^42)-1 * User Stack * * * 257 currently 0x3ffffffffff (2^42)-1 * User Stack * * *
258 on z/Architecture. ***************** * * 258 on z/Architecture. ***************** * *
259 * Shared Libs * * * 259 * Shared Libs * * *
260 ***************** * * 260 ***************** * *
261 * * * Kernel * 261 * * * Kernel *
262 * User Program * * * 262 * User Program * * *
263 * Data BSS * * * 263 * Data BSS * * *
264 * Text * * * 264 * Text * * *
265 * Sections * * * 265 * Sections * * *
266 0x00000000 ***************** **************** 266 0x00000000 ***************** ****************
267 267
268 This also means that we need to look at the PSW problem state bit 268 This also means that we need to look at the PSW problem state bit
269 or the addressing mode to decide whether we are looking at 269 or the addressing mode to decide whether we are looking at
270 user or kernel space. 270 user or kernel space.
271 271
272 Virtual Addresses on s/390 & z/Architecture 272 Virtual Addresses on s/390 & z/Architecture
273 =========================================== 273 ===========================================
274 274
275 A virtual address on s/390 is made up of 3 parts 275 A virtual address on s/390 is made up of 3 parts
276 The SX ( segment index, roughly corresponding to the PGD & PMD in linux terminology ) 276 The SX ( segment index, roughly corresponding to the PGD & PMD in linux terminology )
277 being bits 1-11. 277 being bits 1-11.
278 The PX ( page index, corresponding to the page table entry (pte) in linux terminology ) 278 The PX ( page index, corresponding to the page table entry (pte) in linux terminology )
279 being bits 12-19. 279 being bits 12-19.
280 The remaining bits BX (the byte index are the offset in the page ) 280 The remaining bits BX (the byte index are the offset in the page )
281 i.e. bits 20 to 31. 281 i.e. bits 20 to 31.
282 282
283 On z/Architecture in linux we currently make up an address from 4 parts. 283 On z/Architecture in linux we currently make up an address from 4 parts.
284 The region index bits (RX) 0-32 we currently use bits 22-32 284 The region index bits (RX) 0-32 we currently use bits 22-32
285 The segment index (SX) being bits 33-43 285 The segment index (SX) being bits 33-43
286 The page index (PX) being bits 44-51 286 The page index (PX) being bits 44-51
287 The byte index (BX) being bits 52-63 287 The byte index (BX) being bits 52-63
288 288
289 Notes: 289 Notes:
290 1) s/390 has no PMD so the PMD is really the PGD also. 290 1) s/390 has no PMD so the PMD is really the PGD also.
291 A lot of this stuff is defined in pgtable.h. 291 A lot of this stuff is defined in pgtable.h.
292 292
293 2) Also seeing as s/390's page indexes are only 1k in size 293 2) Also seeing as s/390's page indexes are only 1k in size
294 (bits 12-19 x 4 bytes per pte ) we use 1 ( page 4k ) 294 (bits 12-19 x 4 bytes per pte ) we use 1 ( page 4k )
295 to make the best use of memory by updating 4 segment indices 295 to make the best use of memory by updating 4 segment indices
296 entries each time we mess with a PMD & use offsets 296 entries each time we mess with a PMD & use offsets
297 0,1024,2048 & 3072 in this page as for our segment indexes. 297 0,1024,2048 & 3072 in this page as for our segment indexes.
298 On z/Architecture our page indexes are now 2k in size 298 On z/Architecture our page indexes are now 2k in size
299 ( bits 12-19 x 8 bytes per pte ) we do a similar trick 299 ( bits 12-19 x 8 bytes per pte ) we do a similar trick
300 but only mess with 2 segment indices each time we mess with 300 but only mess with 2 segment indices each time we mess with
301 a PMD. 301 a PMD.
302 302
303 3) As z/Architecture supports upto a massive 5-level page table lookup we 303 3) As z/Architecture supports upto a massive 5-level page table lookup we
304 can only use 3 currently on Linux ( as this is all the generic kernel 304 can only use 3 currently on Linux ( as this is all the generic kernel
305 currently supports ) however this may change in future 305 currently supports ) however this may change in future
306 this allows us to access ( according to my sums ) 306 this allows us to access ( according to my sums )
307 4TB of virtual storage per process i.e. 307 4TB of virtual storage per process i.e.
308 4096*512(PTES)*1024(PMDS)*2048(PGD) = 4398046511104 bytes, 308 4096*512(PTES)*1024(PMDS)*2048(PGD) = 4398046511104 bytes,
309 enough for another 2 or 3 of years I think :-). 309 enough for another 2 or 3 of years I think :-).
310 to do this we use a region-third-table designation type in 310 to do this we use a region-third-table designation type in
311 our address space control registers. 311 our address space control registers.
312 312
313 313
314 The Linux for s/390 & z/Architecture Kernel Task Structure 314 The Linux for s/390 & z/Architecture Kernel Task Structure
315 ========================================================== 315 ==========================================================
316 Each process/thread under Linux for S390 has its own kernel task_struct 316 Each process/thread under Linux for S390 has its own kernel task_struct
317 defined in linux/include/linux/sched.h 317 defined in linux/include/linux/sched.h
318 The S390 on initialisation & resuming of a process on a cpu sets 318 The S390 on initialisation & resuming of a process on a cpu sets
319 the __LC_KERNEL_STACK variable in the spare prefix area for this cpu 319 the __LC_KERNEL_STACK variable in the spare prefix area for this cpu
320 (which we use for per-processor globals). 320 (which we use for per-processor globals).
321 321
322 The kernel stack pointer is intimately tied with the task structure for 322 The kernel stack pointer is intimately tied with the task structure for
323 each processor as follows. 323 each processor as follows.
324 324
325 s/390 325 s/390
326 ************************ 326 ************************
327 * 1 page kernel stack * 327 * 1 page kernel stack *
328 * ( 4K ) * 328 * ( 4K ) *
329 ************************ 329 ************************
330 * 1 page task_struct * 330 * 1 page task_struct *
331 * ( 4K ) * 331 * ( 4K ) *
332 8K aligned ************************ 332 8K aligned ************************
333 333
334 z/Architecture 334 z/Architecture
335 ************************ 335 ************************
336 * 2 page kernel stack * 336 * 2 page kernel stack *
337 * ( 8K ) * 337 * ( 8K ) *
338 ************************ 338 ************************
339 * 2 page task_struct * 339 * 2 page task_struct *
340 * ( 8K ) * 340 * ( 8K ) *
341 16K aligned ************************ 341 16K aligned ************************
342 342
343 What this means is that we don't need to dedicate any register or global variable 343 What this means is that we don't need to dedicate any register or global variable
344 to point to the current running process & can retrieve it with the following 344 to point to the current running process & can retrieve it with the following
345 very simple construct for s/390 & one very similar for z/Architecture. 345 very simple construct for s/390 & one very similar for z/Architecture.
346 346
347 static inline struct task_struct * get_current(void) 347 static inline struct task_struct * get_current(void)
348 { 348 {
349 struct task_struct *current; 349 struct task_struct *current;
350 __asm__("lhi %0,-8192\n\t" 350 __asm__("lhi %0,-8192\n\t"
351 "nr %0,15" 351 "nr %0,15"
352 : "=r" (current) ); 352 : "=r" (current) );
353 return current; 353 return current;
354 } 354 }
355 355
356 i.e. just anding the current kernel stack pointer with the mask -8192. 356 i.e. just anding the current kernel stack pointer with the mask -8192.
357 Thankfully because Linux doesn't have support for nested IO interrupts 357 Thankfully because Linux doesn't have support for nested IO interrupts
358 & our devices have large buffers can survive interrupts being shut for 358 & our devices have large buffers can survive interrupts being shut for
359 short amounts of time we don't need a separate stack for interrupts. 359 short amounts of time we don't need a separate stack for interrupts.
360 360
361 361
362 362
363 363
364 Register Usage & Stackframes on Linux for s/390 & z/Architecture 364 Register Usage & Stackframes on Linux for s/390 & z/Architecture
365 ================================================================= 365 =================================================================
366 Overview: 366 Overview:
367 --------- 367 ---------
368 This is the code that gcc produces at the top & the bottom of 368 This is the code that gcc produces at the top & the bottom of
369 each function. It usually is fairly consistent & similar from 369 each function. It usually is fairly consistent & similar from
370 function to function & if you know its layout you can probably 370 function to function & if you know its layout you can probably
371 make some headway in finding the ultimate cause of a problem 371 make some headway in finding the ultimate cause of a problem
372 after a crash without a source level debugger. 372 after a crash without a source level debugger.
373 373
374 Note: To follow stackframes requires a knowledge of C or Pascal & 374 Note: To follow stackframes requires a knowledge of C or Pascal &
375 limited knowledge of one assembly language. 375 limited knowledge of one assembly language.
376 376
377 It should be noted that there are some differences between the 377 It should be noted that there are some differences between the
378 s/390 & z/Architecture stack layouts as the z/Architecture stack layout didn't have 378 s/390 & z/Architecture stack layouts as the z/Architecture stack layout didn't have
379 to maintain compatibility with older linkage formats. 379 to maintain compatibility with older linkage formats.
380 380
381 Glossary: 381 Glossary:
382 --------- 382 ---------
383 alloca: 383 alloca:
384 This is a built in compiler function for runtime allocation 384 This is a built in compiler function for runtime allocation
385 of extra space on the callers stack which is obviously freed 385 of extra space on the callers stack which is obviously freed
386 up on function exit ( e.g. the caller may choose to allocate nothing 386 up on function exit ( e.g. the caller may choose to allocate nothing
387 of a buffer of 4k if required for temporary purposes ), it generates 387 of a buffer of 4k if required for temporary purposes ), it generates
388 very efficient code ( a few cycles ) when compared to alternatives 388 very efficient code ( a few cycles ) when compared to alternatives
389 like malloc. 389 like malloc.
390 390
391 automatics: These are local variables on the stack, 391 automatics: These are local variables on the stack,
392 i.e they aren't in registers & they aren't static. 392 i.e they aren't in registers & they aren't static.
393 393
394 back-chain: 394 back-chain:
395 This is a pointer to the stack pointer before entering a 395 This is a pointer to the stack pointer before entering a
396 framed functions ( see frameless function ) prologue got by 396 framed functions ( see frameless function ) prologue got by
397 dereferencing the address of the current stack pointer, 397 dereferencing the address of the current stack pointer,
398 i.e. got by accessing the 32 bit value at the stack pointers 398 i.e. got by accessing the 32 bit value at the stack pointers
399 current location. 399 current location.
400 400
401 base-pointer: 401 base-pointer:
402 This is a pointer to the back of the literal pool which 402 This is a pointer to the back of the literal pool which
403 is an area just behind each procedure used to store constants 403 is an area just behind each procedure used to store constants
404 in each function. 404 in each function.
405 405
406 call-clobbered: The caller probably needs to save these registers if there 406 call-clobbered: The caller probably needs to save these registers if there
407 is something of value in them, on the stack or elsewhere before making a 407 is something of value in them, on the stack or elsewhere before making a
408 call to another procedure so that it can restore it later. 408 call to another procedure so that it can restore it later.
409 409
410 epilogue: 410 epilogue:
411 The code generated by the compiler to return to the caller. 411 The code generated by the compiler to return to the caller.
412 412
413 frameless-function 413 frameless-function
414 A frameless function in Linux for s390 & z/Architecture is one which doesn't 414 A frameless function in Linux for s390 & z/Architecture is one which doesn't
415 need more than the register save area ( 96 bytes on s/390, 160 on z/Architecture ) 415 need more than the register save area ( 96 bytes on s/390, 160 on z/Architecture )
416 given to it by the caller. 416 given to it by the caller.
417 A frameless function never: 417 A frameless function never:
418 1) Sets up a back chain. 418 1) Sets up a back chain.
419 2) Calls alloca. 419 2) Calls alloca.
420 3) Calls other normal functions 420 3) Calls other normal functions
421 4) Has automatics. 421 4) Has automatics.
422 422
423 GOT-pointer: 423 GOT-pointer:
424 This is a pointer to the global-offset-table in ELF 424 This is a pointer to the global-offset-table in ELF
425 ( Executable Linkable Format, Linux'es most common executable format ), 425 ( Executable Linkable Format, Linux'es most common executable format ),
426 all globals & shared library objects are found using this pointer. 426 all globals & shared library objects are found using this pointer.
427 427
428 lazy-binding 428 lazy-binding
429 ELF shared libraries are typically only loaded when routines in the shared 429 ELF shared libraries are typically only loaded when routines in the shared
430 library are actually first called at runtime. This is lazy binding. 430 library are actually first called at runtime. This is lazy binding.
431 431
432 procedure-linkage-table 432 procedure-linkage-table
433 This is a table found from the GOT which contains pointers to routines 433 This is a table found from the GOT which contains pointers to routines
434 in other shared libraries which can't be called to by easier means. 434 in other shared libraries which can't be called to by easier means.
435 435
436 prologue: 436 prologue:
437 The code generated by the compiler to set up the stack frame. 437 The code generated by the compiler to set up the stack frame.
438 438
439 outgoing-args: 439 outgoing-args:
440 This is extra area allocated on the stack of the calling function if the 440 This is extra area allocated on the stack of the calling function if the
441 parameters for the callee's cannot all be put in registers, the same 441 parameters for the callee's cannot all be put in registers, the same
442 area can be reused by each function the caller calls. 442 area can be reused by each function the caller calls.
443 443
444 routine-descriptor: 444 routine-descriptor:
445 A COFF executable format based concept of a procedure reference 445 A COFF executable format based concept of a procedure reference
446 actually being 8 bytes or more as opposed to a simple pointer to the routine. 446 actually being 8 bytes or more as opposed to a simple pointer to the routine.
447 This is typically defined as follows 447 This is typically defined as follows
448 Routine Descriptor offset 0=Pointer to Function 448 Routine Descriptor offset 0=Pointer to Function
449 Routine Descriptor offset 4=Pointer to Table of Contents 449 Routine Descriptor offset 4=Pointer to Table of Contents
450 The table of contents/TOC is roughly equivalent to a GOT pointer. 450 The table of contents/TOC is roughly equivalent to a GOT pointer.
451 & it means that shared libraries etc. can be shared between several 451 & it means that shared libraries etc. can be shared between several
452 environments each with their own TOC. 452 environments each with their own TOC.
453 453
454 454
455 static-chain: This is used in nested functions a concept adopted from pascal 455 static-chain: This is used in nested functions a concept adopted from pascal
456 by gcc not used in ansi C or C++ ( although quite useful ), basically it 456 by gcc not used in ansi C or C++ ( although quite useful ), basically it
457 is a pointer used to reference local variables of enclosing functions. 457 is a pointer used to reference local variables of enclosing functions.
458 You might come across this stuff once or twice in your lifetime. 458 You might come across this stuff once or twice in your lifetime.
459 459
460 e.g. 460 e.g.
461 The function below should return 11 though gcc may get upset & toss warnings 461 The function below should return 11 though gcc may get upset & toss warnings
462 about unused variables. 462 about unused variables.
463 int FunctionA(int a) 463 int FunctionA(int a)
464 { 464 {
465 int b; 465 int b;
466 FunctionC(int c) 466 FunctionC(int c)
467 { 467 {
468 b=c+1; 468 b=c+1;
469 } 469 }
470 FunctionC(10); 470 FunctionC(10);
471 return(b); 471 return(b);
472 } 472 }
473 473
474 474
475 s/390 & z/Architecture Register usage 475 s/390 & z/Architecture Register usage
476 ===================================== 476 =====================================
477 r0 used by syscalls/assembly call-clobbered 477 r0 used by syscalls/assembly call-clobbered
478 r1 used by syscalls/assembly call-clobbered 478 r1 used by syscalls/assembly call-clobbered
479 r2 argument 0 / return value 0 call-clobbered 479 r2 argument 0 / return value 0 call-clobbered
480 r3 argument 1 / return value 1 (if long long) call-clobbered 480 r3 argument 1 / return value 1 (if long long) call-clobbered
481 r4 argument 2 call-clobbered 481 r4 argument 2 call-clobbered
482 r5 argument 3 call-clobbered 482 r5 argument 3 call-clobbered
483 r6 argument 5 saved 483 r6 argument 5 saved
484 r7 pointer-to arguments 5 to ... saved 484 r7 pointer-to arguments 5 to ... saved
485 r8 this & that saved 485 r8 this & that saved
486 r9 this & that saved 486 r9 this & that saved
487 r10 static-chain ( if nested function ) saved 487 r10 static-chain ( if nested function ) saved
488 r11 frame-pointer ( if function used alloca ) saved 488 r11 frame-pointer ( if function used alloca ) saved
489 r12 got-pointer saved 489 r12 got-pointer saved
490 r13 base-pointer saved 490 r13 base-pointer saved
491 r14 return-address saved 491 r14 return-address saved
492 r15 stack-pointer saved 492 r15 stack-pointer saved
493 493
494 f0 argument 0 / return value ( float/double ) call-clobbered 494 f0 argument 0 / return value ( float/double ) call-clobbered
495 f2 argument 1 call-clobbered 495 f2 argument 1 call-clobbered
496 f4 z/Architecture argument 2 saved 496 f4 z/Architecture argument 2 saved
497 f6 z/Architecture argument 3 saved 497 f6 z/Architecture argument 3 saved
498 The remaining floating points 498 The remaining floating points
499 f1,f3,f5 f7-f15 are call-clobbered. 499 f1,f3,f5 f7-f15 are call-clobbered.
500 500
501 Notes: 501 Notes:
502 ------ 502 ------
503 1) The only requirement is that registers which are used 503 1) The only requirement is that registers which are used
504 by the callee are saved, e.g. the compiler is perfectly 504 by the callee are saved, e.g. the compiler is perfectly
505 capible of using r11 for purposes other than a frame a 505 capible of using r11 for purposes other than a frame a
506 frame pointer if a frame pointer is not needed. 506 frame pointer if a frame pointer is not needed.
507 2) In functions with variable arguments e.g. printf the calling procedure 507 2) In functions with variable arguments e.g. printf the calling procedure
508 is identical to one without variable arguments & the same number of 508 is identical to one without variable arguments & the same number of
509 parameters. However, the prologue of this function is somewhat more 509 parameters. However, the prologue of this function is somewhat more
510 hairy owing to it having to move these parameters to the stack to 510 hairy owing to it having to move these parameters to the stack to
511 get va_start, va_arg & va_end to work. 511 get va_start, va_arg & va_end to work.
512 3) Access registers are currently unused by gcc but are used in 512 3) Access registers are currently unused by gcc but are used in
513 the kernel. Possibilities exist to use them at the moment for 513 the kernel. Possibilities exist to use them at the moment for
514 temporary storage but it isn't recommended. 514 temporary storage but it isn't recommended.
515 4) Only 4 of the floating point registers are used for 515 4) Only 4 of the floating point registers are used for
516 parameter passing as older machines such as G3 only have only 4 516 parameter passing as older machines such as G3 only have only 4
517 & it keeps the stack frame compatible with other compilers. 517 & it keeps the stack frame compatible with other compilers.
518 However with IEEE floating point emulation under linux on the 518 However with IEEE floating point emulation under linux on the
519 older machines you are free to use the other 12. 519 older machines you are free to use the other 12.
520 5) A long long or double parameter cannot be have the 520 5) A long long or double parameter cannot be have the
521 first 4 bytes in a register & the second four bytes in the 521 first 4 bytes in a register & the second four bytes in the
522 outgoing args area. It must be purely in the outgoing args 522 outgoing args area. It must be purely in the outgoing args
523 area if crossing this boundary. 523 area if crossing this boundary.
524 6) Floating point parameters are mixed with outgoing args 524 6) Floating point parameters are mixed with outgoing args
525 on the outgoing args area in the order the are passed in as parameters. 525 on the outgoing args area in the order the are passed in as parameters.
526 7) Floating point arguments 2 & 3 are saved in the outgoing args area for 526 7) Floating point arguments 2 & 3 are saved in the outgoing args area for
527 z/Architecture 527 z/Architecture
528 528
529 529
530 Stack Frame Layout 530 Stack Frame Layout
531 ------------------ 531 ------------------
532 s/390 z/Architecture 532 s/390 z/Architecture
533 0 0 back chain ( a 0 here signifies end of back chain ) 533 0 0 back chain ( a 0 here signifies end of back chain )
534 4 8 eos ( end of stack, not used on Linux for S390 used in other linkage formats ) 534 4 8 eos ( end of stack, not used on Linux for S390 used in other linkage formats )
535 8 16 glue used in other s/390 linkage formats for saved routine descriptors etc. 535 8 16 glue used in other s/390 linkage formats for saved routine descriptors etc.
536 12 24 glue used in other s/390 linkage formats for saved routine descriptors etc. 536 12 24 glue used in other s/390 linkage formats for saved routine descriptors etc.
537 16 32 scratch area 537 16 32 scratch area
538 20 40 scratch area 538 20 40 scratch area
539 24 48 saved r6 of caller function 539 24 48 saved r6 of caller function
540 28 56 saved r7 of caller function 540 28 56 saved r7 of caller function
541 32 64 saved r8 of caller function 541 32 64 saved r8 of caller function
542 36 72 saved r9 of caller function 542 36 72 saved r9 of caller function
543 40 80 saved r10 of caller function 543 40 80 saved r10 of caller function
544 44 88 saved r11 of caller function 544 44 88 saved r11 of caller function
545 48 96 saved r12 of caller function 545 48 96 saved r12 of caller function
546 52 104 saved r13 of caller function 546 52 104 saved r13 of caller function
547 56 112 saved r14 of caller function 547 56 112 saved r14 of caller function
548 60 120 saved r15 of caller function 548 60 120 saved r15 of caller function
549 64 128 saved f4 of caller function 549 64 128 saved f4 of caller function
550 72 132 saved f6 of caller function 550 72 132 saved f6 of caller function
551 80 undefined 551 80 undefined
552 96 160 outgoing args passed from caller to callee 552 96 160 outgoing args passed from caller to callee
553 96+x 160+x possible stack alignment ( 8 bytes desirable ) 553 96+x 160+x possible stack alignment ( 8 bytes desirable )
554 96+x+y 160+x+y alloca space of caller ( if used ) 554 96+x+y 160+x+y alloca space of caller ( if used )
555 96+x+y+z 160+x+y+z automatics of caller ( if used ) 555 96+x+y+z 160+x+y+z automatics of caller ( if used )
556 0 back-chain 556 0 back-chain
557 557
558 A sample program with comments. 558 A sample program with comments.
559 =============================== 559 ===============================
560 560
561 Comments on the function test 561 Comments on the function test
562 ----------------------------- 562 -----------------------------
563 1) It didn't need to set up a pointer to the constant pool gpr13 as it isn't used 563 1) It didn't need to set up a pointer to the constant pool gpr13 as it isn't used
564 ( :-( ). 564 ( :-( ).
565 2) This is a frameless function & no stack is bought. 565 2) This is a frameless function & no stack is bought.
566 3) The compiler was clever enough to recognise that it could return the 566 3) The compiler was clever enough to recognise that it could return the
567 value in r2 as well as use it for the passed in parameter ( :-) ). 567 value in r2 as well as use it for the passed in parameter ( :-) ).
568 4) The basr ( branch relative & save ) trick works as follows the instruction 568 4) The basr ( branch relative & save ) trick works as follows the instruction
569 has a special case with r0,r0 with some instruction operands is understood as 569 has a special case with r0,r0 with some instruction operands is understood as
570 the literal value 0, some risc architectures also do this ). So now 570 the literal value 0, some risc architectures also do this ). So now
571 we are branching to the next address & the address new program counter is 571 we are branching to the next address & the address new program counter is
572 in r13,so now we subtract the size of the function prologue we have executed 572 in r13,so now we subtract the size of the function prologue we have executed
573 + the size of the literal pool to get to the top of the literal pool 573 + the size of the literal pool to get to the top of the literal pool
574 0040037c int test(int b) 574 0040037c int test(int b)
575 { # Function prologue below 575 { # Function prologue below
576 40037c: 90 de f0 34 stm %r13,%r14,52(%r15) # Save registers r13 & r14 576 40037c: 90 de f0 34 stm %r13,%r14,52(%r15) # Save registers r13 & r14
577 400380: 0d d0 basr %r13,%r0 # Set up pointer to constant pool using 577 400380: 0d d0 basr %r13,%r0 # Set up pointer to constant pool using
578 400382: a7 da ff fa ahi %r13,-6 # basr trick 578 400382: a7 da ff fa ahi %r13,-6 # basr trick
579 return(5+b); 579 return(5+b);
580 # Huge main program 580 # Huge main program
581 400386: a7 2a 00 05 ahi %r2,5 # add 5 to r2 581 400386: a7 2a 00 05 ahi %r2,5 # add 5 to r2
582 582
583 # Function epilogue below 583 # Function epilogue below
584 40038a: 98 de f0 34 lm %r13,%r14,52(%r15) # restore registers r13 & 14 584 40038a: 98 de f0 34 lm %r13,%r14,52(%r15) # restore registers r13 & 14
585 40038e: 07 fe br %r14 # return 585 40038e: 07 fe br %r14 # return
586 } 586 }
587 587
588 Comments on the function main 588 Comments on the function main
589 ----------------------------- 589 -----------------------------
590 1) The compiler did this function optimally ( 8-) ) 590 1) The compiler did this function optimally ( 8-) )
591 591
592 Literal pool for main. 592 Literal pool for main.
593 400390: ff ff ff ec .long 0xffffffec 593 400390: ff ff ff ec .long 0xffffffec
594 main(int argc,char *argv[]) 594 main(int argc,char *argv[])
595 { # Function prologue below 595 { # Function prologue below
596 400394: 90 bf f0 2c stm %r11,%r15,44(%r15) # Save necessary registers 596 400394: 90 bf f0 2c stm %r11,%r15,44(%r15) # Save necessary registers
597 400398: 18 0f lr %r0,%r15 # copy stack pointer to r0 597 400398: 18 0f lr %r0,%r15 # copy stack pointer to r0
598 40039a: a7 fa ff a0 ahi %r15,-96 # Make area for callee saving 598 40039a: a7 fa ff a0 ahi %r15,-96 # Make area for callee saving
599 40039e: 0d d0 basr %r13,%r0 # Set up r13 to point to 599 40039e: 0d d0 basr %r13,%r0 # Set up r13 to point to
600 4003a0: a7 da ff f0 ahi %r13,-16 # literal pool 600 4003a0: a7 da ff f0 ahi %r13,-16 # literal pool
601 4003a4: 50 00 f0 00 st %r0,0(%r15) # Save backchain 601 4003a4: 50 00 f0 00 st %r0,0(%r15) # Save backchain
602 602
603 return(test(5)); # Main Program Below 603 return(test(5)); # Main Program Below
604 4003a8: 58 e0 d0 00 l %r14,0(%r13) # load relative address of test from 604 4003a8: 58 e0 d0 00 l %r14,0(%r13) # load relative address of test from
605 # literal pool 605 # literal pool
606 4003ac: a7 28 00 05 lhi %r2,5 # Set first parameter to 5 606 4003ac: a7 28 00 05 lhi %r2,5 # Set first parameter to 5
607 4003b0: 4d ee d0 00 bas %r14,0(%r14,%r13) # jump to test setting r14 as return 607 4003b0: 4d ee d0 00 bas %r14,0(%r14,%r13) # jump to test setting r14 as return
608 # address using branch & save instruction. 608 # address using branch & save instruction.
609 609
610 # Function Epilogue below 610 # Function Epilogue below
611 4003b4: 98 bf f0 8c lm %r11,%r15,140(%r15)# Restore necessary registers. 611 4003b4: 98 bf f0 8c lm %r11,%r15,140(%r15)# Restore necessary registers.
612 4003b8: 07 fe br %r14 # return to do program exit 612 4003b8: 07 fe br %r14 # return to do program exit
613 } 613 }
614 614
615 615
616 Compiler updates 616 Compiler updates
617 ---------------- 617 ----------------
618 618
619 main(int argc,char *argv[]) 619 main(int argc,char *argv[])
620 { 620 {
621 4004fc: 90 7f f0 1c stm %r7,%r15,28(%r15) 621 4004fc: 90 7f f0 1c stm %r7,%r15,28(%r15)
622 400500: a7 d5 00 04 bras %r13,400508 <main+0xc> 622 400500: a7 d5 00 04 bras %r13,400508 <main+0xc>
623 400504: 00 40 04 f4 .long 0x004004f4 623 400504: 00 40 04 f4 .long 0x004004f4
624 # compiler now puts constant pool in code to so it saves an instruction 624 # compiler now puts constant pool in code to so it saves an instruction
625 400508: 18 0f lr %r0,%r15 625 400508: 18 0f lr %r0,%r15
626 40050a: a7 fa ff a0 ahi %r15,-96 626 40050a: a7 fa ff a0 ahi %r15,-96
627 40050e: 50 00 f0 00 st %r0,0(%r15) 627 40050e: 50 00 f0 00 st %r0,0(%r15)
628 return(test(5)); 628 return(test(5));
629 400512: 58 10 d0 00 l %r1,0(%r13) 629 400512: 58 10 d0 00 l %r1,0(%r13)
630 400516: a7 28 00 05 lhi %r2,5 630 400516: a7 28 00 05 lhi %r2,5
631 40051a: 0d e1 basr %r14,%r1 631 40051a: 0d e1 basr %r14,%r1
632 # compiler adds 1 extra instruction to epilogue this is done to 632 # compiler adds 1 extra instruction to epilogue this is done to
633 # avoid processor pipeline stalls owing to data dependencies on g5 & 633 # avoid processor pipeline stalls owing to data dependencies on g5 &
634 # above as register 14 in the old code was needed directly after being loaded 634 # above as register 14 in the old code was needed directly after being loaded
635 # by the lm %r11,%r15,140(%r15) for the br %14. 635 # by the lm %r11,%r15,140(%r15) for the br %14.
636 40051c: 58 40 f0 98 l %r4,152(%r15) 636 40051c: 58 40 f0 98 l %r4,152(%r15)
637 400520: 98 7f f0 7c lm %r7,%r15,124(%r15) 637 400520: 98 7f f0 7c lm %r7,%r15,124(%r15)
638 400524: 07 f4 br %r4 638 400524: 07 f4 br %r4
639 } 639 }
640 640
641 641
642 Hartmut ( our compiler developer ) also has been threatening to take out the 642 Hartmut ( our compiler developer ) also has been threatening to take out the
643 stack backchain in optimised code as this also causes pipeline stalls, you 643 stack backchain in optimised code as this also causes pipeline stalls, you
644 have been warned. 644 have been warned.
645 645
646 64 bit z/Architecture code disassembly 646 64 bit z/Architecture code disassembly
647 -------------------------------------- 647 --------------------------------------
648 648
649 If you understand the stuff above you'll understand the stuff 649 If you understand the stuff above you'll understand the stuff
650 below too so I'll avoid repeating myself & just say that 650 below too so I'll avoid repeating myself & just say that
651 some of the instructions have g's on the end of them to indicate 651 some of the instructions have g's on the end of them to indicate
652 they are 64 bit & the stack offsets are a bigger, 652 they are 64 bit & the stack offsets are a bigger,
653 the only other difference you'll find between 32 & 64 bit is that 653 the only other difference you'll find between 32 & 64 bit is that
654 we now use f4 & f6 for floating point arguments on 64 bit. 654 we now use f4 & f6 for floating point arguments on 64 bit.
655 00000000800005b0 <test>: 655 00000000800005b0 <test>:
656 int test(int b) 656 int test(int b)
657 { 657 {
658 return(5+b); 658 return(5+b);
659 800005b0: a7 2a 00 05 ahi %r2,5 659 800005b0: a7 2a 00 05 ahi %r2,5
660 800005b4: b9 14 00 22 lgfr %r2,%r2 # downcast to integer 660 800005b4: b9 14 00 22 lgfr %r2,%r2 # downcast to integer
661 800005b8: 07 fe br %r14 661 800005b8: 07 fe br %r14
662 800005ba: 07 07 bcr 0,%r7 662 800005ba: 07 07 bcr 0,%r7
663 663
664 664
665 } 665 }
666 666
667 00000000800005bc <main>: 667 00000000800005bc <main>:
668 main(int argc,char *argv[]) 668 main(int argc,char *argv[])
669 { 669 {
670 800005bc: eb bf f0 58 00 24 stmg %r11,%r15,88(%r15) 670 800005bc: eb bf f0 58 00 24 stmg %r11,%r15,88(%r15)
671 800005c2: b9 04 00 1f lgr %r1,%r15 671 800005c2: b9 04 00 1f lgr %r1,%r15
672 800005c6: a7 fb ff 60 aghi %r15,-160 672 800005c6: a7 fb ff 60 aghi %r15,-160
673 800005ca: e3 10 f0 00 00 24 stg %r1,0(%r15) 673 800005ca: e3 10 f0 00 00 24 stg %r1,0(%r15)
674 return(test(5)); 674 return(test(5));
675 800005d0: a7 29 00 05 lghi %r2,5 675 800005d0: a7 29 00 05 lghi %r2,5
676 # brasl allows jumps > 64k & is overkill here bras would do fune 676 # brasl allows jumps > 64k & is overkill here bras would do fune
677 800005d4: c0 e5 ff ff ff ee brasl %r14,800005b0 <test> 677 800005d4: c0 e5 ff ff ff ee brasl %r14,800005b0 <test>
678 800005da: e3 40 f1 10 00 04 lg %r4,272(%r15) 678 800005da: e3 40 f1 10 00 04 lg %r4,272(%r15)
679 800005e0: eb bf f0 f8 00 04 lmg %r11,%r15,248(%r15) 679 800005e0: eb bf f0 f8 00 04 lmg %r11,%r15,248(%r15)
680 800005e6: 07 f4 br %r4 680 800005e6: 07 f4 br %r4
681 } 681 }
682 682
683 683
684 684
685 Compiling programs for debugging on Linux for s/390 & z/Architecture 685 Compiling programs for debugging on Linux for s/390 & z/Architecture
686 ==================================================================== 686 ====================================================================
687 -gdwarf-2 now works it should be considered the default debugging 687 -gdwarf-2 now works it should be considered the default debugging
688 format for s/390 & z/Architecture as it is more reliable for debugging 688 format for s/390 & z/Architecture as it is more reliable for debugging
689 shared libraries, normal -g debugging works much better now 689 shared libraries, normal -g debugging works much better now
690 Thanks to the IBM java compiler developers bug reports. 690 Thanks to the IBM java compiler developers bug reports.
691 691
692 This is typically done adding/appending the flags -g or -gdwarf-2 to the 692 This is typically done adding/appending the flags -g or -gdwarf-2 to the
693 CFLAGS & LDFLAGS variables Makefile of the program concerned. 693 CFLAGS & LDFLAGS variables Makefile of the program concerned.
694 694
695 If using gdb & you would like accurate displays of registers & 695 If using gdb & you would like accurate displays of registers &
696 stack traces compile without optimisation i.e make sure 696 stack traces compile without optimisation i.e make sure
697 that there is no -O2 or similar on the CFLAGS line of the Makefile & 697 that there is no -O2 or similar on the CFLAGS line of the Makefile &
698 the emitted gcc commands, obviously this will produce worse code 698 the emitted gcc commands, obviously this will produce worse code
699 ( not advisable for shipment ) but it is an aid to the debugging process. 699 ( not advisable for shipment ) but it is an aid to the debugging process.
700 700
701 This aids debugging because the compiler will copy parameters passed in 701 This aids debugging because the compiler will copy parameters passed in
702 in registers onto the stack so backtracing & looking at passed in 702 in registers onto the stack so backtracing & looking at passed in
703 parameters will work, however some larger programs which use inline functions 703 parameters will work, however some larger programs which use inline functions
704 will not compile without optimisation. 704 will not compile without optimisation.
705 705
706 Debugging with optimisation has since much improved after fixing 706 Debugging with optimisation has since much improved after fixing
707 some bugs, please make sure you are using gdb-5.0 or later developed 707 some bugs, please make sure you are using gdb-5.0 or later developed
708 after Nov'2000. 708 after Nov'2000.
709 709
710 Figuring out gcc compile errors 710 Figuring out gcc compile errors
711 =============================== 711 ===============================
712 If you are getting a lot of syntax errors compiling a program & the problem 712 If you are getting a lot of syntax errors compiling a program & the problem
713 isn't blatantly obvious from the source. 713 isn't blatantly obvious from the source.
714 It often helps to just preprocess the file, this is done with the -E 714 It often helps to just preprocess the file, this is done with the -E
715 option in gcc. 715 option in gcc.
716 What this does is that it runs through the very first phase of compilation 716 What this does is that it runs through the very first phase of compilation
717 ( compilation in gcc is done in several stages & gcc calls many programs to 717 ( compilation in gcc is done in several stages & gcc calls many programs to
718 achieve its end result ) with the -E option gcc just calls the gcc preprocessor (cpp). 718 achieve its end result ) with the -E option gcc just calls the gcc preprocessor (cpp).
719 The c preprocessor does the following, it joins all the files #included together 719 The c preprocessor does the following, it joins all the files #included together
720 recursively ( #include files can #include other files ) & also the c file you wish to compile. 720 recursively ( #include files can #include other files ) & also the c file you wish to compile.
721 It puts a fully qualified path of the #included files in a comment & it 721 It puts a fully qualified path of the #included files in a comment & it
722 does macro expansion. 722 does macro expansion.
723 This is useful for debugging because 723 This is useful for debugging because
724 1) You can double check whether the files you expect to be included are the ones 724 1) You can double check whether the files you expect to be included are the ones
725 that are being included ( e.g. double check that you aren't going to the i386 asm directory ). 725 that are being included ( e.g. double check that you aren't going to the i386 asm directory ).
726 2) Check that macro definitions aren't clashing with typedefs, 726 2) Check that macro definitions aren't clashing with typedefs,
727 3) Check that definitions aren't being used before they are being included. 727 3) Check that definitions aren't being used before they are being included.
728 4) Helps put the line emitting the error under the microscope if it contains macros. 728 4) Helps put the line emitting the error under the microscope if it contains macros.
729 729
730 For convenience the Linux kernel's makefile will do preprocessing automatically for you 730 For convenience the Linux kernel's makefile will do preprocessing automatically for you
731 by suffixing the file you want built with .i ( instead of .o ) 731 by suffixing the file you want built with .i ( instead of .o )
732 732
733 e.g. 733 e.g.
734 from the linux directory type 734 from the linux directory type
735 make arch/s390/kernel/signal.i 735 make arch/s390/kernel/signal.i
736 this will build 736 this will build
737 737
738 s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer 738 s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
739 -fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -E arch/s390/kernel/signal.c 739 -fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -E arch/s390/kernel/signal.c
740 > arch/s390/kernel/signal.i 740 > arch/s390/kernel/signal.i
741 741
742 Now look at signal.i you should see something like. 742 Now look at signal.i you should see something like.
743 743
744 744
745 # 1 "/home1/barrow/linux/include/asm/types.h" 1 745 # 1 "/home1/barrow/linux/include/asm/types.h" 1
746 typedef unsigned short umode_t; 746 typedef unsigned short umode_t;
747 typedef __signed__ char __s8; 747 typedef __signed__ char __s8;
748 typedef unsigned char __u8; 748 typedef unsigned char __u8;
749 typedef __signed__ short __s16; 749 typedef __signed__ short __s16;
750 typedef unsigned short __u16; 750 typedef unsigned short __u16;
751 751
752 If instead you are getting errors further down e.g. 752 If instead you are getting errors further down e.g.
753 unknown instruction:2515 "move.l" or better still unknown instruction:2515 753 unknown instruction:2515 "move.l" or better still unknown instruction:2515
754 "Fixme not implemented yet, call Martin" you are probably are attempting to compile some code 754 "Fixme not implemented yet, call Martin" you are probably are attempting to compile some code
755 meant for another architecture or code that is simply not implemented, with a fixme statement 755 meant for another architecture or code that is simply not implemented, with a fixme statement
756 stuck into the inline assembly code so that the author of the file now knows he has work to do. 756 stuck into the inline assembly code so that the author of the file now knows he has work to do.
757 To look at the assembly emitted by gcc just before it is about to call gas ( the gnu assembler ) 757 To look at the assembly emitted by gcc just before it is about to call gas ( the gnu assembler )
758 use the -S option. 758 use the -S option.
759 Again for your convenience the Linux kernel's Makefile will hold your hand & 759 Again for your convenience the Linux kernel's Makefile will hold your hand &
760 do all this donkey work for you also by building the file with the .s suffix. 760 do all this donkey work for you also by building the file with the .s suffix.
761 e.g. 761 e.g.
762 from the Linux directory type 762 from the Linux directory type
763 make arch/s390/kernel/signal.s 763 make arch/s390/kernel/signal.s
764 764
765 s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer 765 s390-gcc -D__KERNEL__ -I/home1/barrow/linux/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
766 -fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -S arch/s390/kernel/signal.c 766 -fno-strict-aliasing -D__SMP__ -pipe -fno-strength-reduce -S arch/s390/kernel/signal.c
767 -o arch/s390/kernel/signal.s 767 -o arch/s390/kernel/signal.s
768 768
769 769
770 This will output something like, ( please note the constant pool & the useful comments 770 This will output something like, ( please note the constant pool & the useful comments
771 in the prologue to give you a hand at interpreting it ). 771 in the prologue to give you a hand at interpreting it ).
772 772
773 .LC54: 773 .LC54:
774 .string "misaligned (__u16 *) in __xchg\n" 774 .string "misaligned (__u16 *) in __xchg\n"
775 .LC57: 775 .LC57:
776 .string "misaligned (__u32 *) in __xchg\n" 776 .string "misaligned (__u32 *) in __xchg\n"
777 .L$PG1: # Pool sys_sigsuspend 777 .L$PG1: # Pool sys_sigsuspend
778 .LC192: 778 .LC192:
779 .long -262401 779 .long -262401
780 .LC193: 780 .LC193:
781 .long -1 781 .long -1
782 .LC194: 782 .LC194:
783 .long schedule-.L$PG1 783 .long schedule-.L$PG1
784 .LC195: 784 .LC195:
785 .long do_signal-.L$PG1 785 .long do_signal-.L$PG1
786 .align 4 786 .align 4
787 .globl sys_sigsuspend 787 .globl sys_sigsuspend
788 .type sys_sigsuspend,@function 788 .type sys_sigsuspend,@function
789 sys_sigsuspend: 789 sys_sigsuspend:
790 # leaf function 0 790 # leaf function 0
791 # automatics 16 791 # automatics 16
792 # outgoing args 0 792 # outgoing args 0
793 # need frame pointer 0 793 # need frame pointer 0
794 # call alloca 0 794 # call alloca 0
795 # has varargs 0 795 # has varargs 0
796 # incoming args (stack) 0 796 # incoming args (stack) 0
797 # function length 168 797 # function length 168
798 STM 8,15,32(15) 798 STM 8,15,32(15)
799 LR 0,15 799 LR 0,15
800 AHI 15,-112 800 AHI 15,-112
801 BASR 13,0 801 BASR 13,0
802 .L$CO1: AHI 13,.L$PG1-.L$CO1 802 .L$CO1: AHI 13,.L$PG1-.L$CO1
803 ST 0,0(15) 803 ST 0,0(15)
804 LR 8,2 804 LR 8,2
805 N 5,.LC192-.L$PG1(13) 805 N 5,.LC192-.L$PG1(13)
806 806
807 Adding -g to the above output makes the output even more useful 807 Adding -g to the above output makes the output even more useful
808 e.g. typing 808 e.g. typing
809 make CC:="s390-gcc -g" kernel/sched.s 809 make CC:="s390-gcc -g" kernel/sched.s
810 810
811 which compiles. 811 which compiles.
812 s390-gcc -g -D__KERNEL__ -I/home/barrow/linux-2.3/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -fno-strength-reduce -S kernel/sched.c -o kernel/sched.s 812 s390-gcc -g -D__KERNEL__ -I/home/barrow/linux-2.3/include -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe -fno-strength-reduce -S kernel/sched.c -o kernel/sched.s
813 813
814 also outputs stabs ( debugger ) info, from this info you can find out the 814 also outputs stabs ( debugger ) info, from this info you can find out the
815 offsets & sizes of various elements in structures. 815 offsets & sizes of various elements in structures.
816 e.g. the stab for the structure 816 e.g. the stab for the structure
817 struct rlimit { 817 struct rlimit {
818 unsigned long rlim_cur; 818 unsigned long rlim_cur;
819 unsigned long rlim_max; 819 unsigned long rlim_max;
820 }; 820 };
821 is 821 is
822 .stabs "rlimit:T(151,2)=s8rlim_cur:(0,5),0,32;rlim_max:(0,5),32,32;;",128,0,0,0 822 .stabs "rlimit:T(151,2)=s8rlim_cur:(0,5),0,32;rlim_max:(0,5),32,32;;",128,0,0,0
823 from this stab you can see that 823 from this stab you can see that
824 rlimit_cur starts at bit offset 0 & is 32 bits in size 824 rlimit_cur starts at bit offset 0 & is 32 bits in size
825 rlimit_max starts at bit offset 32 & is 32 bits in size. 825 rlimit_max starts at bit offset 32 & is 32 bits in size.
826 826
827 827
828 Debugging Tools: 828 Debugging Tools:
829 ================ 829 ================
830 830
831 objdump 831 objdump
832 ======= 832 =======
833 This is a tool with many options the most useful being ( if compiled with -g). 833 This is a tool with many options the most useful being ( if compiled with -g).
834 objdump --source <victim program or object file> > <victims debug listing > 834 objdump --source <victim program or object file> > <victims debug listing >
835 835
836 836
837 The whole kernel can be compiled like this ( Doing this will make a 17MB kernel 837 The whole kernel can be compiled like this ( Doing this will make a 17MB kernel
838 & a 200 MB listing ) however you have to strip it before building the image 838 & a 200 MB listing ) however you have to strip it before building the image
839 using the strip command to make it a more reasonable size to boot it. 839 using the strip command to make it a more reasonable size to boot it.
840 840
841 A source/assembly mixed dump of the kernel can be done with the line 841 A source/assembly mixed dump of the kernel can be done with the line
842 objdump --source vmlinux > vmlinux.lst 842 objdump --source vmlinux > vmlinux.lst
843 Also, if the file isn't compiled -g, this will output as much debugging information 843 Also, if the file isn't compiled -g, this will output as much debugging information
844 as it can (e.g. function names). This is very slow as it spends lots 844 as it can (e.g. function names). This is very slow as it spends lots
845 of time searching for debugging info. The following self explanatory line should be used 845 of time searching for debugging info. The following self explanatory line should be used
846 instead if the code isn't compiled -g, as it is much faster: 846 instead if the code isn't compiled -g, as it is much faster:
847 objdump --disassemble-all --syms vmlinux > vmlinux.lst 847 objdump --disassemble-all --syms vmlinux > vmlinux.lst
848 848
849 As hard drive space is valuble most of us use the following approach. 849 As hard drive space is valuble most of us use the following approach.
850 1) Look at the emitted psw on the console to find the crash address in the kernel. 850 1) Look at the emitted psw on the console to find the crash address in the kernel.
851 2) Look at the file System.map ( in the linux directory ) produced when building 851 2) Look at the file System.map ( in the linux directory ) produced when building
852 the kernel to find the closest address less than the current PSW to find the 852 the kernel to find the closest address less than the current PSW to find the
853 offending function. 853 offending function.
854 3) use grep or similar to search the source tree looking for the source file 854 3) use grep or similar to search the source tree looking for the source file
855 with this function if you don't know where it is. 855 with this function if you don't know where it is.
856 4) rebuild this object file with -g on, as an example suppose the file was 856 4) rebuild this object file with -g on, as an example suppose the file was
857 ( /arch/s390/kernel/signal.o ) 857 ( /arch/s390/kernel/signal.o )
858 5) Assuming the file with the erroneous function is signal.c Move to the base of the 858 5) Assuming the file with the erroneous function is signal.c Move to the base of the
859 Linux source tree. 859 Linux source tree.
860 6) rm /arch/s390/kernel/signal.o 860 6) rm /arch/s390/kernel/signal.o
861 7) make /arch/s390/kernel/signal.o 861 7) make /arch/s390/kernel/signal.o
862 8) watch the gcc command line emitted 862 8) watch the gcc command line emitted
863 9) type it in again or alternatively cut & paste it on the console adding the -g option. 863 9) type it in again or alternatively cut & paste it on the console adding the -g option.
864 10) objdump --source arch/s390/kernel/signal.o > signal.lst 864 10) objdump --source arch/s390/kernel/signal.o > signal.lst
865 This will output the source & the assembly intermixed, as the snippet below shows 865 This will output the source & the assembly intermixed, as the snippet below shows
866 This will unfortunately output addresses which aren't the same 866 This will unfortunately output addresses which aren't the same
867 as the kernel ones you should be able to get around the mental arithmetic 867 as the kernel ones you should be able to get around the mental arithmetic
868 by playing with the --adjust-vma parameter to objdump. 868 by playing with the --adjust-vma parameter to objdump.
869 869
870 870
871 871
872 872
873 static inline void spin_lock(spinlock_t *lp) 873 static inline void spin_lock(spinlock_t *lp)
874 { 874 {
875 a0: 18 34 lr %r3,%r4 875 a0: 18 34 lr %r3,%r4
876 a2: a7 3a 03 bc ahi %r3,956 876 a2: a7 3a 03 bc ahi %r3,956
877 __asm__ __volatile(" lhi 1,-1\n" 877 __asm__ __volatile(" lhi 1,-1\n"
878 a6: a7 18 ff ff lhi %r1,-1 878 a6: a7 18 ff ff lhi %r1,-1
879 aa: 1f 00 slr %r0,%r0 879 aa: 1f 00 slr %r0,%r0
880 ac: ba 01 30 00 cs %r0,%r1,0(%r3) 880 ac: ba 01 30 00 cs %r0,%r1,0(%r3)
881 b0: a7 44 ff fd jm aa <sys_sigsuspend+0x2e> 881 b0: a7 44 ff fd jm aa <sys_sigsuspend+0x2e>
882 saveset = current->blocked; 882 saveset = current->blocked;
883 b4: d2 07 f0 68 mvc 104(8,%r15),972(%r4) 883 b4: d2 07 f0 68 mvc 104(8,%r15),972(%r4)
884 b8: 43 cc 884 b8: 43 cc
885 return (set->sig[0] & mask) != 0; 885 return (set->sig[0] & mask) != 0;
886 } 886 }
887 887
888 6) If debugging under VM go down to that section in the document for more info. 888 6) If debugging under VM go down to that section in the document for more info.
889 889
890 890
891 I now have a tool which takes the pain out of --adjust-vma 891 I now have a tool which takes the pain out of --adjust-vma
892 & you are able to do something like 892 & you are able to do something like
893 make /arch/s390/kernel/traps.lst 893 make /arch/s390/kernel/traps.lst
894 & it automatically generates the correctly relocated entries for 894 & it automatically generates the correctly relocated entries for
895 the text segment in traps.lst. 895 the text segment in traps.lst.
896 This tool is now standard in linux distro's in scripts/makelst 896 This tool is now standard in linux distro's in scripts/makelst
897 897
898 strace: 898 strace:
899 ------- 899 -------
900 Q. What is it ? 900 Q. What is it ?
901 A. It is a tool for intercepting calls to the kernel & logging them 901 A. It is a tool for intercepting calls to the kernel & logging them
902 to a file & on the screen. 902 to a file & on the screen.
903 903
904 Q. What use is it ? 904 Q. What use is it ?
905 A. You can used it to find out what files a particular program opens. 905 A. You can used it to find out what files a particular program opens.
906 906
907 907
908 908
909 Example 1 909 Example 1
910 --------- 910 ---------
911 If you wanted to know does ping work but didn't have the source 911 If you wanted to know does ping work but didn't have the source
912 strace ping -c 1 127.0.0.1 912 strace ping -c 1 127.0.0.1
913 & then look at the man pages for each of the syscalls below, 913 & then look at the man pages for each of the syscalls below,
914 ( In fact this is sometimes easier than looking at some spagetti 914 ( In fact this is sometimes easier than looking at some spagetti
915 source which conditionally compiles for several architectures ). 915 source which conditionally compiles for several architectures ).
916 Not everything that it throws out needs to make sense immediately. 916 Not everything that it throws out needs to make sense immediately.
917 917
918 Just looking quickly you can see that it is making up a RAW socket 918 Just looking quickly you can see that it is making up a RAW socket
919 for the ICMP protocol. 919 for the ICMP protocol.
920 Doing an alarm(10) for a 10 second timeout 920 Doing an alarm(10) for a 10 second timeout
921 & doing a gettimeofday call before & after each read to see 921 & doing a gettimeofday call before & after each read to see
922 how long the replies took, & writing some text to stdout so the user 922 how long the replies took, & writing some text to stdout so the user
923 has an idea what is going on. 923 has an idea what is going on.
924 924
925 socket(PF_INET, SOCK_RAW, IPPROTO_ICMP) = 3 925 socket(PF_INET, SOCK_RAW, IPPROTO_ICMP) = 3
926 getuid() = 0 926 getuid() = 0
927 setuid(0) = 0 927 setuid(0) = 0
928 stat("/usr/share/locale/C/libc.cat", 0xbffff134) = -1 ENOENT (No such file or directory) 928 stat("/usr/share/locale/C/libc.cat", 0xbffff134) = -1 ENOENT (No such file or directory)
929 stat("/usr/share/locale/libc/C", 0xbffff134) = -1 ENOENT (No such file or directory) 929 stat("/usr/share/locale/libc/C", 0xbffff134) = -1 ENOENT (No such file or directory)
930 stat("/usr/local/share/locale/C/libc.cat", 0xbffff134) = -1 ENOENT (No such file or directory) 930 stat("/usr/local/share/locale/C/libc.cat", 0xbffff134) = -1 ENOENT (No such file or directory)
931 getpid() = 353 931 getpid() = 353
932 setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0 932 setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
933 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [49152], 4) = 0 933 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [49152], 4) = 0
934 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(3, 1), ...}) = 0 934 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(3, 1), ...}) = 0
935 mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40008000 935 mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40008000
936 ioctl(1, TCGETS, {B9600 opost isig icanon echo ...}) = 0 936 ioctl(1, TCGETS, {B9600 opost isig icanon echo ...}) = 0
937 write(1, "PING 127.0.0.1 (127.0.0.1): 56 d"..., 42PING 127.0.0.1 (127.0.0.1): 56 data bytes 937 write(1, "PING 127.0.0.1 (127.0.0.1): 56 d"..., 42PING 127.0.0.1 (127.0.0.1): 56 data bytes
938 ) = 42 938 ) = 42
939 sigaction(SIGINT, {0x8049ba0, [], SA_RESTART}, {SIG_DFL}) = 0 939 sigaction(SIGINT, {0x8049ba0, [], SA_RESTART}, {SIG_DFL}) = 0
940 sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {SIG_DFL}) = 0 940 sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {SIG_DFL}) = 0
941 gettimeofday({948904719, 138951}, NULL) = 0 941 gettimeofday({948904719, 138951}, NULL) = 0
942 sendto(3, "\10\0D\201a\1\0\0\17#\2178\307\36"..., 64, 0, {sin_family=AF_INET, 942 sendto(3, "\10\0D\201a\1\0\0\17#\2178\307\36"..., 64, 0, {sin_family=AF_INET,
943 sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 64 943 sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 64
944 sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0 944 sigaction(SIGALRM, {0x8049600, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0
945 sigaction(SIGALRM, {0x8049ba0, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0 945 sigaction(SIGALRM, {0x8049ba0, [], SA_RESTART}, {0x8049600, [], SA_RESTART}) = 0
946 alarm(10) = 0 946 alarm(10) = 0
947 recvfrom(3, "E\0\0T\0005\0\0@\1|r\177\0\0\1\177"..., 192, 0, 947 recvfrom(3, "E\0\0T\0005\0\0@\1|r\177\0\0\1\177"..., 192, 0,
948 {sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("127.0.0.1")}, [16]) = 84 948 {sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("127.0.0.1")}, [16]) = 84
949 gettimeofday({948904719, 160224}, NULL) = 0 949 gettimeofday({948904719, 160224}, NULL) = 0
950 recvfrom(3, "E\0\0T\0006\0\0\377\1\275p\177\0"..., 192, 0, 950 recvfrom(3, "E\0\0T\0006\0\0\377\1\275p\177\0"..., 192, 0,
951 {sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("127.0.0.1")}, [16]) = 84 951 {sin_family=AF_INET, sin_port=htons(50882), sin_addr=inet_addr("127.0.0.1")}, [16]) = 84
952 gettimeofday({948904719, 166952}, NULL) = 0 952 gettimeofday({948904719, 166952}, NULL) = 0
953 write(1, "64 bytes from 127.0.0.1: icmp_se"..., 953 write(1, "64 bytes from 127.0.0.1: icmp_se"...,
954 5764 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=28.0 ms 954 5764 bytes from 127.0.0.1: icmp_seq=0 ttl=255 time=28.0 ms
955 955
956 Example 2 956 Example 2
957 --------- 957 ---------
958 strace passwd 2>&1 | grep open 958 strace passwd 2>&1 | grep open
959 produces the following output 959 produces the following output
960 open("/etc/ld.so.cache", O_RDONLY) = 3 960 open("/etc/ld.so.cache", O_RDONLY) = 3
961 open("/opt/kde/lib/libc.so.5", O_RDONLY) = -1 ENOENT (No such file or directory) 961 open("/opt/kde/lib/libc.so.5", O_RDONLY) = -1 ENOENT (No such file or directory)
962 open("/lib/libc.so.5", O_RDONLY) = 3 962 open("/lib/libc.so.5", O_RDONLY) = 3
963 open("/dev", O_RDONLY) = 3 963 open("/dev", O_RDONLY) = 3
964 open("/var/run/utmp", O_RDONLY) = 3 964 open("/var/run/utmp", O_RDONLY) = 3
965 open("/etc/passwd", O_RDONLY) = 3 965 open("/etc/passwd", O_RDONLY) = 3
966 open("/etc/shadow", O_RDONLY) = 3 966 open("/etc/shadow", O_RDONLY) = 3
967 open("/etc/login.defs", O_RDONLY) = 4 967 open("/etc/login.defs", O_RDONLY) = 4
968 open("/dev/tty", O_RDONLY) = 4 968 open("/dev/tty", O_RDONLY) = 4
969 969
970 The 2>&1 is done to redirect stderr to stdout & grep is then filtering this input 970 The 2>&1 is done to redirect stderr to stdout & grep is then filtering this input
971 through the pipe for each line containing the string open. 971 through the pipe for each line containing the string open.
972 972
973 973
974 Example 3 974 Example 3
975 --------- 975 ---------
976 Getting sophisticated 976 Getting sophisticated
977 telnetd crashes & I don't know why 977 telnetd crashes & I don't know why
978 978
979 Steps 979 Steps
980 ----- 980 -----
981 1) Replace the following line in /etc/inetd.conf 981 1) Replace the following line in /etc/inetd.conf
982 telnet stream tcp nowait root /usr/sbin/in.telnetd -h 982 telnet stream tcp nowait root /usr/sbin/in.telnetd -h
983 with 983 with
984 telnet stream tcp nowait root /blah 984 telnet stream tcp nowait root /blah
985 985
986 2) Create the file /blah with the following contents to start tracing telnetd 986 2) Create the file /blah with the following contents to start tracing telnetd
987 #!/bin/bash 987 #!/bin/bash
988 /usr/bin/strace -o/t1 -f /usr/sbin/in.telnetd -h 988 /usr/bin/strace -o/t1 -f /usr/sbin/in.telnetd -h
989 3) chmod 700 /blah to make it executable only to root 989 3) chmod 700 /blah to make it executable only to root
990 4) 990 4)
991 killall -HUP inetd 991 killall -HUP inetd
992 or ps aux | grep inetd 992 or ps aux | grep inetd
993 get inetd's process id 993 get inetd's process id
994 & kill -HUP inetd to restart it. 994 & kill -HUP inetd to restart it.
995 995
996 Important options 996 Important options
997 ----------------- 997 -----------------
998 -o is used to tell strace to output to a file in our case t1 in the root directory 998 -o is used to tell strace to output to a file in our case t1 in the root directory
999 -f is to follow children i.e. 999 -f is to follow children i.e.
1000 e.g in our case above telnetd will start the login process & subsequently a shell like bash. 1000 e.g in our case above telnetd will start the login process & subsequently a shell like bash.
1001 You will be able to tell which is which from the process ID's listed on the left hand side 1001 You will be able to tell which is which from the process ID's listed on the left hand side
1002 of the strace output. 1002 of the strace output.
1003 -p<pid> will tell strace to attach to a running process, yup this can be done provided 1003 -p<pid> will tell strace to attach to a running process, yup this can be done provided
1004 it isn't being traced or debugged already & you have enough privileges, 1004 it isn't being traced or debugged already & you have enough privileges,
1005 the reason 2 processes cannot trace or debug the same program is that strace 1005 the reason 2 processes cannot trace or debug the same program is that strace
1006 becomes the parent process of the one being debugged & processes ( unlike people ) 1006 becomes the parent process of the one being debugged & processes ( unlike people )
1007 can have only one parent. 1007 can have only one parent.
1008 1008
1009 1009
1010 However the file /t1 will get big quite quickly 1010 However the file /t1 will get big quite quickly
1011 to test it telnet 127.0.0.1 1011 to test it telnet 127.0.0.1
1012 1012
1013 now look at what files in.telnetd execve'd 1013 now look at what files in.telnetd execve'd
1014 413 execve("/usr/sbin/in.telnetd", ["/usr/sbin/in.telnetd", "-h"], [/* 17 vars */]) = 0 1014 413 execve("/usr/sbin/in.telnetd", ["/usr/sbin/in.telnetd", "-h"], [/* 17 vars */]) = 0
1015 414 execve("/bin/login", ["/bin/login", "-h", "localhost", "-p"], [/* 2 vars */]) = 0 1015 414 execve("/bin/login", ["/bin/login", "-h", "localhost", "-p"], [/* 2 vars */]) = 0
1016 1016
1017 Whey it worked!. 1017 Whey it worked!.
1018 1018
1019 1019
1020 Other hints: 1020 Other hints:
1021 ------------ 1021 ------------
1022 If the program is not very interactive ( i.e. not much keyboard input ) 1022 If the program is not very interactive ( i.e. not much keyboard input )
1023 & is crashing in one architecture but not in another you can do 1023 & is crashing in one architecture but not in another you can do
1024 an strace of both programs under as identical a scenario as you can 1024 an strace of both programs under as identical a scenario as you can
1025 on both architectures outputting to a file then. 1025 on both architectures outputting to a file then.
1026 do a diff of the two traces using the diff program 1026 do a diff of the two traces using the diff program
1027 i.e. 1027 i.e.
1028 diff output1 output2 1028 diff output1 output2
1029 & maybe you'll be able to see where the call paths differed, this 1029 & maybe you'll be able to see where the call paths differed, this
1030 is possibly near the cause of the crash. 1030 is possibly near the cause of the crash.
1031 1031
1032 More info 1032 More info
1033 --------- 1033 ---------
1034 Look at man pages for strace & the various syscalls 1034 Look at man pages for strace & the various syscalls
1035 e.g. man strace, man alarm, man socket. 1035 e.g. man strace, man alarm, man socket.
1036 1036
1037 1037
1038 Performance Debugging 1038 Performance Debugging
1039 ===================== 1039 =====================
1040 gcc is capible of compiling in profiling code just add the -p option 1040 gcc is capible of compiling in profiling code just add the -p option
1041 to the CFLAGS, this obviously affects program size & performance. 1041 to the CFLAGS, this obviously affects program size & performance.
1042 This can be used by the gprof gnu profiling tool or the 1042 This can be used by the gprof gnu profiling tool or the
1043 gcov the gnu code coverage tool ( code coverage is a means of testing 1043 gcov the gnu code coverage tool ( code coverage is a means of testing
1044 code quality by checking if all the code in an executable in exercised by 1044 code quality by checking if all the code in an executable in exercised by
1045 a tester ). 1045 a tester ).
1046 1046
1047 1047
1048 Using top to find out where processes are sleeping in the kernel 1048 Using top to find out where processes are sleeping in the kernel
1049 ---------------------------------------------------------------- 1049 ----------------------------------------------------------------
1050 To do this copy the System.map from the root directory where 1050 To do this copy the System.map from the root directory where
1051 the linux kernel was built to the /boot directory on your 1051 the linux kernel was built to the /boot directory on your
1052 linux machine. 1052 linux machine.
1053 Start top 1053 Start top
1054 Now type fU<return> 1054 Now type fU<return>
1055 You should see a new field called WCHAN which 1055 You should see a new field called WCHAN which
1056 tells you where each process is sleeping here is a typical output. 1056 tells you where each process is sleeping here is a typical output.
1057 1057
1058 6:59pm up 41 min, 1 user, load average: 0.00, 0.00, 0.00 1058 6:59pm up 41 min, 1 user, load average: 0.00, 0.00, 0.00
1059 28 processes: 27 sleeping, 1 running, 0 zombie, 0 stopped 1059 28 processes: 27 sleeping, 1 running, 0 zombie, 0 stopped
1060 CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle 1060 CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle
1061 Mem: 254900K av, 45976K used, 208924K free, 0K shrd, 28636K buff 1061 Mem: 254900K av, 45976K used, 208924K free, 0K shrd, 28636K buff
1062 Swap: 0K av, 0K used, 0K free 8620K cached 1062 Swap: 0K av, 0K used, 0K free 8620K cached
1063 1063
1064 PID USER PRI NI SIZE RSS SHARE WCHAN STAT LIB %CPU %MEM TIME COMMAND 1064 PID USER PRI NI SIZE RSS SHARE WCHAN STAT LIB %CPU %MEM TIME COMMAND
1065 750 root 12 0 848 848 700 do_select S 0 0.1 0.3 0:00 in.telnetd 1065 750 root 12 0 848 848 700 do_select S 0 0.1 0.3 0:00 in.telnetd
1066 767 root 16 0 1140 1140 964 R 0 0.1 0.4 0:00 top 1066 767 root 16 0 1140 1140 964 R 0 0.1 0.4 0:00 top
1067 1 root 8 0 212 212 180 do_select S 0 0.0 0.0 0:00 init 1067 1 root 8 0 212 212 180 do_select S 0 0.0 0.0 0:00 init
1068 2 root 9 0 0 0 0 down_inte SW 0 0.0 0.0 0:00 kmcheck 1068 2 root 9 0 0 0 0 down_inte SW 0 0.0 0.0 0:00 kmcheck
1069 1069
1070 The time command 1070 The time command
1071 ---------------- 1071 ----------------
1072 Another related command is the time command which gives you an indication 1072 Another related command is the time command which gives you an indication
1073 of where a process is spending the majority of its time. 1073 of where a process is spending the majority of its time.
1074 e.g. 1074 e.g.
1075 time ping -c 5 nc 1075 time ping -c 5 nc
1076 outputs 1076 outputs
1077 real 0m4.054s 1077 real 0m4.054s
1078 user 0m0.010s 1078 user 0m0.010s
1079 sys 0m0.010s 1079 sys 0m0.010s
1080 1080
1081 Debugging under VM 1081 Debugging under VM
1082 ================== 1082 ==================
1083 1083
1084 Notes 1084 Notes
1085 ----- 1085 -----
1086 Addresses & values in the VM debugger are always hex never decimal 1086 Addresses & values in the VM debugger are always hex never decimal
1087 Address ranges are of the format <HexValue1>-<HexValue2> or <HexValue1>.<HexValue2> 1087 Address ranges are of the format <HexValue1>-<HexValue2> or <HexValue1>.<HexValue2>
1088 e.g. The address range 0x2000 to 0x3000 can be described described as 1088 e.g. The address range 0x2000 to 0x3000 can be described as 2000-3000 or 2000.1000
1089 2000-3000 or 2000.1000
1090 1089
1091 The VM Debugger is case insensitive. 1090 The VM Debugger is case insensitive.
1092 1091
1093 VM's strengths are usually other debuggers weaknesses you can get at any resource 1092 VM's strengths are usually other debuggers weaknesses you can get at any resource
1094 no matter how sensitive e.g. memory management resources,change address translation 1093 no matter how sensitive e.g. memory management resources,change address translation
1095 in the PSW. For kernel hacking you will reap dividends if you get good at it. 1094 in the PSW. For kernel hacking you will reap dividends if you get good at it.
1096 1095
1097 The VM Debugger displays operators but not operands, probably because some 1096 The VM Debugger displays operators but not operands, probably because some
1098 of it was written when memory was expensive & the programmer was probably proud that 1097 of it was written when memory was expensive & the programmer was probably proud that
1099 it fitted into 2k of memory & the programmers & didn't want to shock hardcore VM'ers by 1098 it fitted into 2k of memory & the programmers & didn't want to shock hardcore VM'ers by
1100 changing the interface :-), also the debugger displays useful information on the same line & 1099 changing the interface :-), also the debugger displays useful information on the same line &
1101 the author of the code probably felt that it was a good idea not to go over 1100 the author of the code probably felt that it was a good idea not to go over
1102 the 80 columns on the screen. 1101 the 80 columns on the screen.
1103 1102
1104 As some of you are probably in a panic now this isn't as unintuitive as it may seem 1103 As some of you are probably in a panic now this isn't as unintuitive as it may seem
1105 as the 390 instructions are easy to decode mentally & you can make a good guess at a lot 1104 as the 390 instructions are easy to decode mentally & you can make a good guess at a lot
1106 of them as all the operands are nibble ( half byte aligned ) & if you have an objdump listing 1105 of them as all the operands are nibble ( half byte aligned ) & if you have an objdump listing
1107 also it is quite easy to follow, if you don't have an objdump listing keep a copy of 1106 also it is quite easy to follow, if you don't have an objdump listing keep a copy of
1108 the s/390 Reference Summary & look at between pages 2 & 7 or alternatively the 1107 the s/390 Reference Summary & look at between pages 2 & 7 or alternatively the
1109 s/390 principles of operation. 1108 s/390 principles of operation.
1110 e.g. even I can guess that 1109 e.g. even I can guess that
1111 0001AFF8' LR 180F CC 0 1110 0001AFF8' LR 180F CC 0
1112 is a ( load register ) lr r0,r15 1111 is a ( load register ) lr r0,r15
1113 1112
1114 Also it is very easy to tell the length of a 390 instruction from the 2 most significant 1113 Also it is very easy to tell the length of a 390 instruction from the 2 most significant
1115 bits in the instruction ( not that this info is really useful except if you are trying to 1114 bits in the instruction ( not that this info is really useful except if you are trying to
1116 make sense of a hexdump of code ). 1115 make sense of a hexdump of code ).
1117 Here is a table 1116 Here is a table
1118 Bits Instruction Length 1117 Bits Instruction Length
1119 ------------------------------------------ 1118 ------------------------------------------
1120 00 2 Bytes 1119 00 2 Bytes
1121 01 4 Bytes 1120 01 4 Bytes
1122 10 4 Bytes 1121 10 4 Bytes
1123 11 6 Bytes 1122 11 6 Bytes
1124 1123
1125 1124
1126 1125
1127 1126
1128 The debugger also displays other useful info on the same line such as the 1127 The debugger also displays other useful info on the same line such as the
1129 addresses being operated on destination addresses of branches & condition codes. 1128 addresses being operated on destination addresses of branches & condition codes.
1130 e.g. 1129 e.g.
1131 00019736' AHI A7DAFF0E CC 1 1130 00019736' AHI A7DAFF0E CC 1
1132 000198BA' BRC A7840004 -> 000198C2' CC 0 1131 000198BA' BRC A7840004 -> 000198C2' CC 0
1133 000198CE' STM 900EF068 >> 0FA95E78 CC 2 1132 000198CE' STM 900EF068 >> 0FA95E78 CC 2
1134 1133
1135 1134
1136 1135
1137 Useful VM debugger commands 1136 Useful VM debugger commands
1138 --------------------------- 1137 ---------------------------
1139 1138
1140 I suppose I'd better mention this before I start 1139 I suppose I'd better mention this before I start
1141 to list the current active traces do 1140 to list the current active traces do
1142 Q TR 1141 Q TR
1143 there can be a maximum of 255 of these per set 1142 there can be a maximum of 255 of these per set
1144 ( more about trace sets later ). 1143 ( more about trace sets later ).
1145 To stop traces issue a 1144 To stop traces issue a
1146 TR END. 1145 TR END.
1147 To delete a particular breakpoint issue 1146 To delete a particular breakpoint issue
1148 TR DEL <breakpoint number> 1147 TR DEL <breakpoint number>
1149 1148
1150 The PA1 key drops to CP mode so you can issue debugger commands, 1149 The PA1 key drops to CP mode so you can issue debugger commands,
1151 Doing alt c (on my 3270 console at least ) clears the screen. 1150 Doing alt c (on my 3270 console at least ) clears the screen.
1152 hitting b <enter> comes back to the running operating system 1151 hitting b <enter> comes back to the running operating system
1153 from cp mode ( in our case linux ). 1152 from cp mode ( in our case linux ).
1154 It is typically useful to add shortcuts to your profile.exec file 1153 It is typically useful to add shortcuts to your profile.exec file
1155 if you have one ( this is roughly equivalent to autoexec.bat in DOS ). 1154 if you have one ( this is roughly equivalent to autoexec.bat in DOS ).
1156 file here are a few from mine. 1155 file here are a few from mine.
1157 /* this gives me command history on issuing f12 */ 1156 /* this gives me command history on issuing f12 */
1158 set pf12 retrieve 1157 set pf12 retrieve
1159 /* this continues */ 1158 /* this continues */
1160 set pf8 imm b 1159 set pf8 imm b
1161 /* goes to trace set a */ 1160 /* goes to trace set a */
1162 set pf1 imm tr goto a 1161 set pf1 imm tr goto a
1163 /* goes to trace set b */ 1162 /* goes to trace set b */
1164 set pf2 imm tr goto b 1163 set pf2 imm tr goto b
1165 /* goes to trace set c */ 1164 /* goes to trace set c */
1166 set pf3 imm tr goto c 1165 set pf3 imm tr goto c
1167 1166
1168 1167
1169 1168
1170 Instruction Tracing 1169 Instruction Tracing
1171 ------------------- 1170 -------------------
1172 Setting a simple breakpoint 1171 Setting a simple breakpoint
1173 TR I PSWA <address> 1172 TR I PSWA <address>
1174 To debug a particular function try 1173 To debug a particular function try
1175 TR I R <function address range> 1174 TR I R <function address range>
1176 TR I on its own will single step. 1175 TR I on its own will single step.
1177 TR I DATA <MNEMONIC> <OPTIONAL RANGE> will trace for particular mnemonics 1176 TR I DATA <MNEMONIC> <OPTIONAL RANGE> will trace for particular mnemonics
1178 e.g. 1177 e.g.
1179 TR I DATA 4D R 0197BC.4000 1178 TR I DATA 4D R 0197BC.4000
1180 will trace for BAS'es ( opcode 4D ) in the range 0197BC.4000 1179 will trace for BAS'es ( opcode 4D ) in the range 0197BC.4000
1181 if you were inclined you could add traces for all branch instructions & 1180 if you were inclined you could add traces for all branch instructions &
1182 suffix them with the run prefix so you would have a backtrace on screen 1181 suffix them with the run prefix so you would have a backtrace on screen
1183 when a program crashes. 1182 when a program crashes.
1184 TR BR <INTO OR FROM> will trace branches into or out of an address. 1183 TR BR <INTO OR FROM> will trace branches into or out of an address.
1185 e.g. 1184 e.g.
1186 TR BR INTO 0 is often quite useful if a program is getting awkward & deciding 1185 TR BR INTO 0 is often quite useful if a program is getting awkward & deciding
1187 to branch to 0 & crashing as this will stop at the address before in jumps to 0. 1186 to branch to 0 & crashing as this will stop at the address before in jumps to 0.
1188 TR I R <address range> RUN cmd d g 1187 TR I R <address range> RUN cmd d g
1189 single steps a range of addresses but stays running & 1188 single steps a range of addresses but stays running &
1190 displays the gprs on each step. 1189 displays the gprs on each step.
1191 1190
1192 1191
1193 1192
1194 Displaying & modifying Registers 1193 Displaying & modifying Registers
1195 -------------------------------- 1194 --------------------------------
1196 D G will display all the gprs 1195 D G will display all the gprs
1197 Adding a extra G to all the commands is necessary to access the full 64 bit 1196 Adding a extra G to all the commands is necessary to access the full 64 bit
1198 content in VM on z/Architecture obviously this isn't required for access registers 1197 content in VM on z/Architecture obviously this isn't required for access registers
1199 as these are still 32 bit. 1198 as these are still 32 bit.
1200 e.g. DGG instead of DG 1199 e.g. DGG instead of DG
1201 D X will display all the control registers 1200 D X will display all the control registers
1202 D AR will display all the access registers 1201 D AR will display all the access registers
1203 D AR4-7 will display access registers 4 to 7 1202 D AR4-7 will display access registers 4 to 7
1204 CPU ALL D G will display the GRPS of all CPUS in the configuration 1203 CPU ALL D G will display the GRPS of all CPUS in the configuration
1205 D PSW will display the current PSW 1204 D PSW will display the current PSW
1206 st PSW 2000 will put the value 2000 into the PSW & 1205 st PSW 2000 will put the value 2000 into the PSW &
1207 cause crash your machine. 1206 cause crash your machine.
1208 D PREFIX displays the prefix offset 1207 D PREFIX displays the prefix offset
1209 1208
1210 1209
1211 Displaying Memory 1210 Displaying Memory
1212 ----------------- 1211 -----------------
1213 To display memory mapped using the current PSW's mapping try 1212 To display memory mapped using the current PSW's mapping try
1214 D <range> 1213 D <range>
1215 To make VM display a message each time it hits a particular address & continue try 1214 To make VM display a message each time it hits a particular address & continue try
1216 D I<range> will disassemble/display a range of instructions. 1215 D I<range> will disassemble/display a range of instructions.
1217 ST addr 32 bit word will store a 32 bit aligned address 1216 ST addr 32 bit word will store a 32 bit aligned address
1218 D T<range> will display the EBCDIC in an address ( if you are that way inclined ) 1217 D T<range> will display the EBCDIC in an address ( if you are that way inclined )
1219 D R<range> will display real addresses ( without DAT ) but with prefixing. 1218 D R<range> will display real addresses ( without DAT ) but with prefixing.
1220 There are other complex options to display if you need to get at say home space 1219 There are other complex options to display if you need to get at say home space
1221 but are in primary space the easiest thing to do is to temporarily 1220 but are in primary space the easiest thing to do is to temporarily
1222 modify the PSW to the other addressing mode, display the stuff & then 1221 modify the PSW to the other addressing mode, display the stuff & then
1223 restore it. 1222 restore it.
1224 1223
1225 1224
1226 1225
1227 Hints 1226 Hints
1228 ----- 1227 -----
1229 If you want to issue a debugger command without halting your virtual machine with the 1228 If you want to issue a debugger command without halting your virtual machine with the
1230 PA1 key try prefixing the command with #CP e.g. 1229 PA1 key try prefixing the command with #CP e.g.
1231 #cp tr i pswa 2000 1230 #cp tr i pswa 2000
1232 also suffixing most debugger commands with RUN will cause them not 1231 also suffixing most debugger commands with RUN will cause them not
1233 to stop just display the mnemonic at the current instruction on the console. 1232 to stop just display the mnemonic at the current instruction on the console.
1234 If you have several breakpoints you want to put into your program & 1233 If you have several breakpoints you want to put into your program &
1235 you get fed up of cross referencing with System.map 1234 you get fed up of cross referencing with System.map
1236 you can do the following trick for several symbols. 1235 you can do the following trick for several symbols.
1237 grep do_signal System.map 1236 grep do_signal System.map
1238 which emits the following among other things 1237 which emits the following among other things
1239 0001f4e0 T do_signal 1238 0001f4e0 T do_signal
1240 now you can do 1239 now you can do
1241 1240
1242 TR I PSWA 0001f4e0 cmd msg * do_signal 1241 TR I PSWA 0001f4e0 cmd msg * do_signal
1243 This sends a message to your own console each time do_signal is entered. 1242 This sends a message to your own console each time do_signal is entered.
1244 ( As an aside I wrote a perl script once which automatically generated a REXX 1243 ( As an aside I wrote a perl script once which automatically generated a REXX
1245 script with breakpoints on every kernel procedure, this isn't a good idea 1244 script with breakpoints on every kernel procedure, this isn't a good idea
1246 because there are thousands of these routines & VM can only set 255 breakpoints 1245 because there are thousands of these routines & VM can only set 255 breakpoints
1247 at a time so you nearly had to spend as long pruning the file down as you would 1246 at a time so you nearly had to spend as long pruning the file down as you would
1248 entering the msg's by hand ),however, the trick might be useful for a single object file. 1247 entering the msg's by hand ),however, the trick might be useful for a single object file.
1249 On linux'es 3270 emulator x3270 there is a very useful option under the file ment 1248 On linux'es 3270 emulator x3270 there is a very useful option under the file ment
1250 Save Screens In File this is very good of keeping a copy of traces. 1249 Save Screens In File this is very good of keeping a copy of traces.
1251 1250
1252 From CMS help <command name> will give you online help on a particular command. 1251 From CMS help <command name> will give you online help on a particular command.
1253 e.g. 1252 e.g.
1254 HELP DISPLAY 1253 HELP DISPLAY
1255 1254
1256 Also CP has a file called profile.exec which automatically gets called 1255 Also CP has a file called profile.exec which automatically gets called
1257 on startup of CMS ( like autoexec.bat ), keeping on a DOS analogy session 1256 on startup of CMS ( like autoexec.bat ), keeping on a DOS analogy session
1258 CP has a feature similar to doskey, it may be useful for you to 1257 CP has a feature similar to doskey, it may be useful for you to
1259 use profile.exec to define some keystrokes. 1258 use profile.exec to define some keystrokes.
1260 e.g. 1259 e.g.
1261 SET PF9 IMM B 1260 SET PF9 IMM B
1262 This does a single step in VM on pressing F8. 1261 This does a single step in VM on pressing F8.
1263 SET PF10 ^ 1262 SET PF10 ^
1264 This sets up the ^ key. 1263 This sets up the ^ key.
1265 which can be used for ^c (ctrl-c),^z (ctrl-z) which can't be typed directly into some 3270 consoles. 1264 which can be used for ^c (ctrl-c),^z (ctrl-z) which can't be typed directly into some 3270 consoles.
1266 SET PF11 ^- 1265 SET PF11 ^-
1267 This types the starting keystrokes for a sysrq see SysRq below. 1266 This types the starting keystrokes for a sysrq see SysRq below.
1268 SET PF12 RETRIEVE 1267 SET PF12 RETRIEVE
1269 This retrieves command history on pressing F12. 1268 This retrieves command history on pressing F12.
1270 1269
1271 1270
1272 Sometimes in VM the display is set up to scroll automatically this 1271 Sometimes in VM the display is set up to scroll automatically this
1273 can be very annoying if there are messages you wish to look at 1272 can be very annoying if there are messages you wish to look at
1274 to stop this do 1273 to stop this do
1275 TERM MORE 255 255 1274 TERM MORE 255 255
1276 This will nearly stop automatic screen updates, however it will 1275 This will nearly stop automatic screen updates, however it will
1277 cause a denial of service if lots of messages go to the 3270 console, 1276 cause a denial of service if lots of messages go to the 3270 console,
1278 so it would be foolish to use this as the default on a production machine. 1277 so it would be foolish to use this as the default on a production machine.
1279 1278
1280 1279
1281 Tracing particular processes 1280 Tracing particular processes
1282 ---------------------------- 1281 ----------------------------
1283 The kernel's text segment is intentionally at an address in memory that it will 1282 The kernel's text segment is intentionally at an address in memory that it will
1284 very seldom collide with text segments of user programs ( thanks Martin ), 1283 very seldom collide with text segments of user programs ( thanks Martin ),
1285 this simplifies debugging the kernel. 1284 this simplifies debugging the kernel.
1286 However it is quite common for user processes to have addresses which collide 1285 However it is quite common for user processes to have addresses which collide
1287 this can make debugging a particular process under VM painful under normal 1286 this can make debugging a particular process under VM painful under normal
1288 circumstances as the process may change when doing a 1287 circumstances as the process may change when doing a
1289 TR I R <address range>. 1288 TR I R <address range>.
1290 Thankfully after reading VM's online help I figured out how to debug 1289 Thankfully after reading VM's online help I figured out how to debug
1291 I particular process. 1290 I particular process.
1292 1291
1293 Your first problem is to find the STD ( segment table designation ) 1292 Your first problem is to find the STD ( segment table designation )
1294 of the program you wish to debug. 1293 of the program you wish to debug.
1295 There are several ways you can do this here are a few 1294 There are several ways you can do this here are a few
1296 1) objdump --syms <program to be debugged> | grep main 1295 1) objdump --syms <program to be debugged> | grep main
1297 To get the address of main in the program. 1296 To get the address of main in the program.
1298 tr i pswa <address of main> 1297 tr i pswa <address of main>
1299 Start the program, if VM drops to CP on what looks like the entry 1298 Start the program, if VM drops to CP on what looks like the entry
1300 point of the main function this is most likely the process you wish to debug. 1299 point of the main function this is most likely the process you wish to debug.
1301 Now do a D X13 or D XG13 on z/Architecture. 1300 Now do a D X13 or D XG13 on z/Architecture.
1302 On 31 bit the STD is bits 1-19 ( the STO segment table origin ) 1301 On 31 bit the STD is bits 1-19 ( the STO segment table origin )
1303 & 25-31 ( the STL segment table length ) of CR13. 1302 & 25-31 ( the STL segment table length ) of CR13.
1304 now type 1303 now type
1305 TR I R STD <CR13's value> 0.7fffffff 1304 TR I R STD <CR13's value> 0.7fffffff
1306 e.g. 1305 e.g.
1307 TR I R STD 8F32E1FF 0.7fffffff 1306 TR I R STD 8F32E1FF 0.7fffffff
1308 Another very useful variation is 1307 Another very useful variation is
1309 TR STORE INTO STD <CR13's value> <address range> 1308 TR STORE INTO STD <CR13's value> <address range>
1310 for finding out when a particular variable changes. 1309 for finding out when a particular variable changes.
1311 1310
1312 An alternative way of finding the STD of a currently running process 1311 An alternative way of finding the STD of a currently running process
1313 is to do the following, ( this method is more complex but 1312 is to do the following, ( this method is more complex but
1314 could be quite convenient if you aren't updating the kernel much & 1313 could be quite convenient if you aren't updating the kernel much &
1315 so your kernel structures will stay constant for a reasonable period of 1314 so your kernel structures will stay constant for a reasonable period of
1316 time ). 1315 time ).
1317 1316
1318 grep task /proc/<pid>/status 1317 grep task /proc/<pid>/status
1319 from this you should see something like 1318 from this you should see something like
1320 task: 0f160000 ksp: 0f161de8 pt_regs: 0f161f68 1319 task: 0f160000 ksp: 0f161de8 pt_regs: 0f161f68
1321 This now gives you a pointer to the task structure. 1320 This now gives you a pointer to the task structure.
1322 Now make CC:="s390-gcc -g" kernel/sched.s 1321 Now make CC:="s390-gcc -g" kernel/sched.s
1323 To get the task_struct stabinfo. 1322 To get the task_struct stabinfo.
1324 ( task_struct is defined in include/linux/sched.h ). 1323 ( task_struct is defined in include/linux/sched.h ).
1325 Now we want to look at 1324 Now we want to look at
1326 task->active_mm->pgd 1325 task->active_mm->pgd
1327 on my machine the active_mm in the task structure stab is 1326 on my machine the active_mm in the task structure stab is
1328 active_mm:(4,12),672,32 1327 active_mm:(4,12),672,32
1329 its offset is 672/8=84=0x54 1328 its offset is 672/8=84=0x54
1330 the pgd member in the mm_struct stab is 1329 the pgd member in the mm_struct stab is
1331 pgd:(4,6)=*(29,5),96,32 1330 pgd:(4,6)=*(29,5),96,32
1332 so its offset is 96/8=12=0xc 1331 so its offset is 96/8=12=0xc
1333 1332
1334 so we'll 1333 so we'll
1335 hexdump -s 0xf160054 /dev/mem | more 1334 hexdump -s 0xf160054 /dev/mem | more
1336 i.e. task_struct+active_mm offset 1335 i.e. task_struct+active_mm offset
1337 to look at the active_mm member 1336 to look at the active_mm member
1338 f160054 0fee cc60 0019 e334 0000 0000 0000 0011 1337 f160054 0fee cc60 0019 e334 0000 0000 0000 0011
1339 hexdump -s 0x0feecc6c /dev/mem | more 1338 hexdump -s 0x0feecc6c /dev/mem | more
1340 i.e. active_mm+pgd offset 1339 i.e. active_mm+pgd offset
1341 feecc6c 0f2c 0000 0000 0001 0000 0001 0000 0010 1340 feecc6c 0f2c 0000 0000 0001 0000 0001 0000 0010
1342 we get something like 1341 we get something like
1343 now do 1342 now do
1344 TR I R STD <pgd|0x7f> 0.7fffffff 1343 TR I R STD <pgd|0x7f> 0.7fffffff
1345 i.e. the 0x7f is added because the pgd only 1344 i.e. the 0x7f is added because the pgd only
1346 gives the page table origin & we need to set the low bits 1345 gives the page table origin & we need to set the low bits
1347 to the maximum possible segment table length. 1346 to the maximum possible segment table length.
1348 TR I R STD 0f2c007f 0.7fffffff 1347 TR I R STD 0f2c007f 0.7fffffff
1349 on z/Architecture you'll probably need to do 1348 on z/Architecture you'll probably need to do
1350 TR I R STD <pgd|0x7> 0.ffffffffffffffff 1349 TR I R STD <pgd|0x7> 0.ffffffffffffffff
1351 to set the TableType to 0x1 & the Table length to 3. 1350 to set the TableType to 0x1 & the Table length to 3.
1352 1351
1353 1352
1354 1353
1355 Tracing Program Exceptions 1354 Tracing Program Exceptions
1356 -------------------------- 1355 --------------------------
1357 If you get a crash which says something like 1356 If you get a crash which says something like
1358 illegal operation or specification exception followed by a register dump 1357 illegal operation or specification exception followed by a register dump
1359 You can restart linux & trace these using the tr prog <range or value> trace option. 1358 You can restart linux & trace these using the tr prog <range or value> trace option.
1360 1359
1361 1360
1362 1361
1363 The most common ones you will normally be tracing for is 1362 The most common ones you will normally be tracing for is
1364 1=operation exception 1363 1=operation exception
1365 2=privileged operation exception 1364 2=privileged operation exception
1366 4=protection exception 1365 4=protection exception
1367 5=addressing exception 1366 5=addressing exception
1368 6=specification exception 1367 6=specification exception
1369 10=segment translation exception 1368 10=segment translation exception
1370 11=page translation exception 1369 11=page translation exception
1371 1370
1372 The full list of these is on page 22 of the current s/390 Reference Summary. 1371 The full list of these is on page 22 of the current s/390 Reference Summary.
1373 e.g. 1372 e.g.
1374 tr prog 10 will trace segment translation exceptions. 1373 tr prog 10 will trace segment translation exceptions.
1375 tr prog on its own will trace all program interruption codes. 1374 tr prog on its own will trace all program interruption codes.
1376 1375
1377 Trace Sets 1376 Trace Sets
1378 ---------- 1377 ----------
1379 On starting VM you are initially in the INITIAL trace set. 1378 On starting VM you are initially in the INITIAL trace set.
1380 You can do a Q TR to verify this. 1379 You can do a Q TR to verify this.
1381 If you have a complex tracing situation where you wish to wait for instance 1380 If you have a complex tracing situation where you wish to wait for instance
1382 till a driver is open before you start tracing IO, but know in your 1381 till a driver is open before you start tracing IO, but know in your
1383 heart that you are going to have to make several runs through the code till you 1382 heart that you are going to have to make several runs through the code till you
1384 have a clue whats going on. 1383 have a clue whats going on.
1385 1384
1386 What you can do is 1385 What you can do is
1387 TR I PSWA <Driver open address> 1386 TR I PSWA <Driver open address>
1388 hit b to continue till breakpoint 1387 hit b to continue till breakpoint
1389 reach the breakpoint 1388 reach the breakpoint
1390 now do your 1389 now do your
1391 TR GOTO B 1390 TR GOTO B
1392 TR IO 7c08-7c09 inst int run 1391 TR IO 7c08-7c09 inst int run
1393 or whatever the IO channels you wish to trace are & hit b 1392 or whatever the IO channels you wish to trace are & hit b
1394 1393
1395 To got back to the initial trace set do 1394 To got back to the initial trace set do
1396 TR GOTO INITIAL 1395 TR GOTO INITIAL
1397 & the TR I PSWA <Driver open address> will be the only active breakpoint again. 1396 & the TR I PSWA <Driver open address> will be the only active breakpoint again.
1398 1397
1399 1398
1400 Tracing linux syscalls under VM 1399 Tracing linux syscalls under VM
1401 ------------------------------- 1400 -------------------------------
1402 Syscalls are implemented on Linux for S390 by the Supervisor call instruction (SVC) there 256 1401 Syscalls are implemented on Linux for S390 by the Supervisor call instruction (SVC) there 256
1403 possibilities of these as the instruction is made up of a 0xA opcode & the second byte being 1402 possibilities of these as the instruction is made up of a 0xA opcode & the second byte being
1404 the syscall number. They are traced using the simple command. 1403 the syscall number. They are traced using the simple command.
1405 TR SVC <Optional value or range> 1404 TR SVC <Optional value or range>
1406 the syscalls are defined in linux/include/asm-s390/unistd.h 1405 the syscalls are defined in linux/include/asm-s390/unistd.h
1407 e.g. to trace all file opens just do 1406 e.g. to trace all file opens just do
1408 TR SVC 5 ( as this is the syscall number of open ) 1407 TR SVC 5 ( as this is the syscall number of open )
1409 1408
1410 1409
1411 SMP Specific commands 1410 SMP Specific commands
1412 --------------------- 1411 ---------------------
1413 To find out how many cpus you have 1412 To find out how many cpus you have
1414 Q CPUS displays all the CPU's available to your virtual machine 1413 Q CPUS displays all the CPU's available to your virtual machine
1415 To find the cpu that the current cpu VM debugger commands are being directed at do 1414 To find the cpu that the current cpu VM debugger commands are being directed at do
1416 Q CPU to change the current cpu cpu VM debugger commands are being directed at do 1415 Q CPU to change the current cpu VM debugger commands are being directed at do
1417 CPU <desired cpu no> 1416 CPU <desired cpu no>
1418 1417
1419 On a SMP guest issue a command to all CPUs try prefixing the command with cpu all. 1418 On a SMP guest issue a command to all CPUs try prefixing the command with cpu all.
1420 To issue a command to a particular cpu try cpu <cpu number> e.g. 1419 To issue a command to a particular cpu try cpu <cpu number> e.g.
1421 CPU 01 TR I R 2000.3000 1420 CPU 01 TR I R 2000.3000
1422 If you are running on a guest with several cpus & you have a IO related problem 1421 If you are running on a guest with several cpus & you have a IO related problem
1423 & cannot follow the flow of code but you know it isnt smp related. 1422 & cannot follow the flow of code but you know it isnt smp related.
1424 from the bash prompt issue 1423 from the bash prompt issue
1425 shutdown -h now or halt. 1424 shutdown -h now or halt.
1426 do a Q CPUS to find out how many cpus you have 1425 do a Q CPUS to find out how many cpus you have
1427 detach each one of them from cp except cpu 0 1426 detach each one of them from cp except cpu 0
1428 by issuing a 1427 by issuing a
1429 DETACH CPU 01-(number of cpus in configuration) 1428 DETACH CPU 01-(number of cpus in configuration)
1430 & boot linux again. 1429 & boot linux again.
1431 TR SIGP will trace inter processor signal processor instructions. 1430 TR SIGP will trace inter processor signal processor instructions.
1432 DEFINE CPU 01-(number in configuration) 1431 DEFINE CPU 01-(number in configuration)
1433 will get your guests cpus back. 1432 will get your guests cpus back.
1434 1433
1435 1434
1436 Help for displaying ascii textstrings 1435 Help for displaying ascii textstrings
1437 ------------------------------------- 1436 -------------------------------------
1438 On the very latest VM Nucleus'es VM can now display ascii 1437 On the very latest VM Nucleus'es VM can now display ascii
1439 ( thanks Neale for the hint ) by doing 1438 ( thanks Neale for the hint ) by doing
1440 D TX<lowaddr>.<len> 1439 D TX<lowaddr>.<len>
1441 e.g. 1440 e.g.
1442 D TX0.100 1441 D TX0.100
1443 1442
1444 Alternatively 1443 Alternatively
1445 ============= 1444 =============
1446 Under older VM debuggers ( I love EBDIC too ) you can use this little program I wrote which 1445 Under older VM debuggers ( I love EBDIC too ) you can use this little program I wrote which
1447 will convert a command line of hex digits to ascii text which can be compiled under linux & 1446 will convert a command line of hex digits to ascii text which can be compiled under linux &
1448 you can copy the hex digits from your x3270 terminal to your xterm if you are debugging 1447 you can copy the hex digits from your x3270 terminal to your xterm if you are debugging
1449 from a linuxbox. 1448 from a linuxbox.
1450 1449
1451 This is quite useful when looking at a parameter passed in as a text string 1450 This is quite useful when looking at a parameter passed in as a text string
1452 under VM ( unless you are good at decoding ASCII in your head ). 1451 under VM ( unless you are good at decoding ASCII in your head ).
1453 1452
1454 e.g. consider tracing an open syscall 1453 e.g. consider tracing an open syscall
1455 TR SVC 5 1454 TR SVC 5
1456 We have stopped at a breakpoint 1455 We have stopped at a breakpoint
1457 000151B0' SVC 0A05 -> 0001909A' CC 0 1456 000151B0' SVC 0A05 -> 0001909A' CC 0
1458 1457
1459 D 20.8 to check the SVC old psw in the prefix area & see was it from userspace 1458 D 20.8 to check the SVC old psw in the prefix area & see was it from userspace
1460 ( for the layout of the prefix area consult P18 of the s/390 390 Reference Summary 1459 ( for the layout of the prefix area consult P18 of the s/390 390 Reference Summary
1461 if you have it available ). 1460 if you have it available ).
1462 V00000020 070C2000 800151B2 1461 V00000020 070C2000 800151B2
1463 The problem state bit wasn't set & it's also too early in the boot sequence 1462 The problem state bit wasn't set & it's also too early in the boot sequence
1464 for it to be a userspace SVC if it was we would have to temporarily switch the 1463 for it to be a userspace SVC if it was we would have to temporarily switch the
1465 psw to user space addressing so we could get at the first parameter of the open in 1464 psw to user space addressing so we could get at the first parameter of the open in
1466 gpr2. 1465 gpr2.
1467 Next do a 1466 Next do a
1468 D G2 1467 D G2
1469 GPR 2 = 00014CB4 1468 GPR 2 = 00014CB4
1470 Now display what gpr2 is pointing to 1469 Now display what gpr2 is pointing to
1471 D 00014CB4.20 1470 D 00014CB4.20
1472 V00014CB4 2F646576 2F636F6E 736F6C65 00001BF5 1471 V00014CB4 2F646576 2F636F6E 736F6C65 00001BF5
1473 V00014CC4 FC00014C B4001001 E0001000 B8070707 1472 V00014CC4 FC00014C B4001001 E0001000 B8070707
1474 Now copy the text till the first 00 hex ( which is the end of the string 1473 Now copy the text till the first 00 hex ( which is the end of the string
1475 to an xterm & do hex2ascii on it. 1474 to an xterm & do hex2ascii on it.
1476 hex2ascii 2F646576 2F636F6E 736F6C65 00 1475 hex2ascii 2F646576 2F636F6E 736F6C65 00
1477 outputs 1476 outputs
1478 Decoded Hex:=/ d e v / c o n s o l e 0x00 1477 Decoded Hex:=/ d e v / c o n s o l e 0x00
1479 We were opening the console device, 1478 We were opening the console device,
1480 1479
1481 You can compile the code below yourself for practice :-), 1480 You can compile the code below yourself for practice :-),
1482 /* 1481 /*
1483 * hex2ascii.c 1482 * hex2ascii.c
1484 * a useful little tool for converting a hexadecimal command line to ascii 1483 * a useful little tool for converting a hexadecimal command line to ascii
1485 * 1484 *
1486 * Author(s): Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com) 1485 * Author(s): Denis Joseph Barrow (djbarrow@de.ibm.com,barrow_dj@yahoo.com)
1487 * (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation. 1486 * (C) 2000 IBM Deutschland Entwicklung GmbH, IBM Corporation.
1488 */ 1487 */
1489 #include <stdio.h> 1488 #include <stdio.h>
1490 1489
1491 int main(int argc,char *argv[]) 1490 int main(int argc,char *argv[])
1492 { 1491 {
1493 int cnt1,cnt2,len,toggle=0; 1492 int cnt1,cnt2,len,toggle=0;
1494 int startcnt=1; 1493 int startcnt=1;
1495 unsigned char c,hex; 1494 unsigned char c,hex;
1496 1495
1497 if(argc>1&&(strcmp(argv[1],"-a")==0)) 1496 if(argc>1&&(strcmp(argv[1],"-a")==0))
1498 startcnt=2; 1497 startcnt=2;
1499 printf("Decoded Hex:="); 1498 printf("Decoded Hex:=");
1500 for(cnt1=startcnt;cnt1<argc;cnt1++) 1499 for(cnt1=startcnt;cnt1<argc;cnt1++)
1501 { 1500 {
1502 len=strlen(argv[cnt1]); 1501 len=strlen(argv[cnt1]);
1503 for(cnt2=0;cnt2<len;cnt2++) 1502 for(cnt2=0;cnt2<len;cnt2++)
1504 { 1503 {
1505 c=argv[cnt1][cnt2]; 1504 c=argv[cnt1][cnt2];
1506 if(c>='0'&&c<='9') 1505 if(c>='0'&&c<='9')
1507 c=c-'0'; 1506 c=c-'0';
1508 if(c>='A'&&c<='F') 1507 if(c>='A'&&c<='F')
1509 c=c-'A'+10; 1508 c=c-'A'+10;
1510 if(c>='a'&&c<='f') 1509 if(c>='a'&&c<='f')
1511 c=c-'a'+10; 1510 c=c-'a'+10;
1512 switch(toggle) 1511 switch(toggle)
1513 { 1512 {
1514 case 0: 1513 case 0:
1515 hex=c<<4; 1514 hex=c<<4;
1516 toggle=1; 1515 toggle=1;
1517 break; 1516 break;
1518 case 1: 1517 case 1:
1519 hex+=c; 1518 hex+=c;
1520 if(hex<32||hex>127) 1519 if(hex<32||hex>127)
1521 { 1520 {
1522 if(startcnt==1) 1521 if(startcnt==1)
1523 printf("0x%02X ",(int)hex); 1522 printf("0x%02X ",(int)hex);
1524 else 1523 else
1525 printf("."); 1524 printf(".");
1526 } 1525 }
1527 else 1526 else
1528 { 1527 {
1529 printf("%c",hex); 1528 printf("%c",hex);
1530 if(startcnt==1) 1529 if(startcnt==1)
1531 printf(" "); 1530 printf(" ");
1532 } 1531 }
1533 toggle=0; 1532 toggle=0;
1534 break; 1533 break;
1535 } 1534 }
1536 } 1535 }
1537 } 1536 }
1538 printf("\n"); 1537 printf("\n");
1539 } 1538 }
1540 1539
1541 1540
1542 1541
1543 1542
1544 Stack tracing under VM 1543 Stack tracing under VM
1545 ---------------------- 1544 ----------------------
1546 A basic backtrace 1545 A basic backtrace
1547 ----------------- 1546 -----------------
1548 1547
1549 Here are the tricks I use 9 out of 10 times it works pretty well, 1548 Here are the tricks I use 9 out of 10 times it works pretty well,
1550 1549
1551 When your backchain reaches a dead end 1550 When your backchain reaches a dead end
1552 -------------------------------------- 1551 --------------------------------------
1553 This can happen when an exception happens in the kernel & the kernel is entered twice 1552 This can happen when an exception happens in the kernel & the kernel is entered twice
1554 if you reach the NULL pointer at the end of the back chain you should be 1553 if you reach the NULL pointer at the end of the back chain you should be
1555 able to sniff further back if you follow the following tricks. 1554 able to sniff further back if you follow the following tricks.
1556 1) A kernel address should be easy to recognise since it is in 1555 1) A kernel address should be easy to recognise since it is in
1557 primary space & the problem state bit isn't set & also 1556 primary space & the problem state bit isn't set & also
1558 The Hi bit of the address is set. 1557 The Hi bit of the address is set.
1559 2) Another backchain should also be easy to recognise since it is an 1558 2) Another backchain should also be easy to recognise since it is an
1560 address pointing to another address approximately 100 bytes or 0x70 hex 1559 address pointing to another address approximately 100 bytes or 0x70 hex
1561 behind the current stackpointer. 1560 behind the current stackpointer.
1562 1561
1563 1562
1564 Here is some practice. 1563 Here is some practice.
1565 boot the kernel & hit PA1 at some random time 1564 boot the kernel & hit PA1 at some random time
1566 d g to display the gprs, this should display something like 1565 d g to display the gprs, this should display something like
1567 GPR 0 = 00000001 00156018 0014359C 00000000 1566 GPR 0 = 00000001 00156018 0014359C 00000000
1568 GPR 4 = 00000001 001B8888 000003E0 00000000 1567 GPR 4 = 00000001 001B8888 000003E0 00000000
1569 GPR 8 = 00100080 00100084 00000000 000FE000 1568 GPR 8 = 00100080 00100084 00000000 000FE000
1570 GPR 12 = 00010400 8001B2DC 8001B36A 000FFED8 1569 GPR 12 = 00010400 8001B2DC 8001B36A 000FFED8
1571 Note that GPR14 is a return address but as we are real men we are going to 1570 Note that GPR14 is a return address but as we are real men we are going to
1572 trace the stack. 1571 trace the stack.
1573 display 0x40 bytes after the stack pointer. 1572 display 0x40 bytes after the stack pointer.
1574 1573
1575 V000FFED8 000FFF38 8001B838 80014C8E 000FFF38 1574 V000FFED8 000FFF38 8001B838 80014C8E 000FFF38
1576 V000FFEE8 00000000 00000000 000003E0 00000000 1575 V000FFEE8 00000000 00000000 000003E0 00000000
1577 V000FFEF8 00100080 00100084 00000000 000FE000 1576 V000FFEF8 00100080 00100084 00000000 000FE000
1578 V000FFF08 00010400 8001B2DC 8001B36A 000FFED8 1577 V000FFF08 00010400 8001B2DC 8001B36A 000FFED8
1579 1578
1580 1579
1581 Ah now look at whats in sp+56 (sp+0x38) this is 8001B36A our saved r14 if 1580 Ah now look at whats in sp+56 (sp+0x38) this is 8001B36A our saved r14 if
1582 you look above at our stackframe & also agrees with GPR14. 1581 you look above at our stackframe & also agrees with GPR14.
1583 1582
1584 now backchain 1583 now backchain
1585 d 000FFF38.40 1584 d 000FFF38.40
1586 we now are taking the contents of SP to get our first backchain. 1585 we now are taking the contents of SP to get our first backchain.
1587 1586
1588 V000FFF38 000FFFA0 00000000 00014995 00147094 1587 V000FFF38 000FFFA0 00000000 00014995 00147094
1589 V000FFF48 00147090 001470A0 000003E0 00000000 1588 V000FFF48 00147090 001470A0 000003E0 00000000
1590 V000FFF58 00100080 00100084 00000000 001BF1D0 1589 V000FFF58 00100080 00100084 00000000 001BF1D0
1591 V000FFF68 00010400 800149BA 80014CA6 000FFF38 1590 V000FFF68 00010400 800149BA 80014CA6 000FFF38
1592 1591
1593 This displays a 2nd return address of 80014CA6 1592 This displays a 2nd return address of 80014CA6
1594 1593
1595 now do d 000FFFA0.40 for our 3rd backchain 1594 now do d 000FFFA0.40 for our 3rd backchain
1596 1595
1597 V000FFFA0 04B52002 0001107F 00000000 00000000 1596 V000FFFA0 04B52002 0001107F 00000000 00000000
1598 V000FFFB0 00000000 00000000 FF000000 0001107F 1597 V000FFFB0 00000000 00000000 FF000000 0001107F
1599 V000FFFC0 00000000 00000000 00000000 00000000 1598 V000FFFC0 00000000 00000000 00000000 00000000
1600 V000FFFD0 00010400 80010802 8001085A 000FFFA0 1599 V000FFFD0 00010400 80010802 8001085A 000FFFA0
1601 1600
1602 1601
1603 our 3rd return address is 8001085A 1602 our 3rd return address is 8001085A
1604 1603
1605 as the 04B52002 looks suspiciously like rubbish it is fair to assume that the kernel entry routines 1604 as the 04B52002 looks suspiciously like rubbish it is fair to assume that the kernel entry routines
1606 for the sake of optimisation dont set up a backchain. 1605 for the sake of optimisation dont set up a backchain.
1607 1606
1608 now look at System.map to see if the addresses make any sense. 1607 now look at System.map to see if the addresses make any sense.
1609 1608
1610 grep -i 0001b3 System.map 1609 grep -i 0001b3 System.map
1611 outputs among other things 1610 outputs among other things
1612 0001b304 T cpu_idle 1611 0001b304 T cpu_idle
1613 so 8001B36A 1612 so 8001B36A
1614 is cpu_idle+0x66 ( quiet the cpu is asleep, don't wake it ) 1613 is cpu_idle+0x66 ( quiet the cpu is asleep, don't wake it )
1615 1614
1616 1615
1617 grep -i 00014 System.map 1616 grep -i 00014 System.map
1618 produces among other things 1617 produces among other things
1619 00014a78 T start_kernel 1618 00014a78 T start_kernel
1620 so 0014CA6 is start_kernel+some hex number I can't add in my head. 1619 so 0014CA6 is start_kernel+some hex number I can't add in my head.
1621 1620
1622 grep -i 00108 System.map 1621 grep -i 00108 System.map
1623 this produces 1622 this produces
1624 00010800 T _stext 1623 00010800 T _stext
1625 so 8001085A is _stext+0x5a 1624 so 8001085A is _stext+0x5a
1626 1625
1627 Congrats you've done your first backchain. 1626 Congrats you've done your first backchain.
1628 1627
1629 1628
1630 1629
1631 s/390 & z/Architecture IO Overview 1630 s/390 & z/Architecture IO Overview
1632 ================================== 1631 ==================================
1633 1632
1634 I am not going to give a course in 390 IO architecture as this would take me quite a 1633 I am not going to give a course in 390 IO architecture as this would take me quite a
1635 while & I'm no expert. Instead I'll give a 390 IO architecture summary for Dummies if you have 1634 while & I'm no expert. Instead I'll give a 390 IO architecture summary for Dummies if you have
1636 the s/390 principles of operation available read this instead. If nothing else you may find a few 1635 the s/390 principles of operation available read this instead. If nothing else you may find a few
1637 useful keywords in here & be able to use them on a web search engine like altavista to find 1636 useful keywords in here & be able to use them on a web search engine like altavista to find
1638 more useful information. 1637 more useful information.
1639 1638
1640 Unlike other bus architectures modern 390 systems do their IO using mostly 1639 Unlike other bus architectures modern 390 systems do their IO using mostly
1641 fibre optics & devices such as tapes & disks can be shared between several mainframes, 1640 fibre optics & devices such as tapes & disks can be shared between several mainframes,
1642 also S390 can support upto 65536 devices while a high end PC based system might be choking 1641 also S390 can support upto 65536 devices while a high end PC based system might be choking
1643 with around 64. Here is some of the common IO terminology 1642 with around 64. Here is some of the common IO terminology
1644 1643
1645 Subchannel: 1644 Subchannel:
1646 This is the logical number most IO commands use to talk to an IO device there can be upto 1645 This is the logical number most IO commands use to talk to an IO device there can be upto
1647 0x10000 (65536) of these in a configuration typically there is a few hundred. Under VM 1646 0x10000 (65536) of these in a configuration typically there is a few hundred. Under VM
1648 for simplicity they are allocated contiguously, however on the native hardware they are not 1647 for simplicity they are allocated contiguously, however on the native hardware they are not
1649 they typically stay consistent between boots provided no new hardware is inserted or removed. 1648 they typically stay consistent between boots provided no new hardware is inserted or removed.
1650 Under Linux for 390 we use these as IRQ's & also when issuing an IO command (CLEAR SUBCHANNEL, 1649 Under Linux for 390 we use these as IRQ's & also when issuing an IO command (CLEAR SUBCHANNEL,
1651 HALT SUBCHANNEL,MODIFY SUBCHANNEL,RESUME SUBCHANNEL,START SUBCHANNEL,STORE SUBCHANNEL & 1650 HALT SUBCHANNEL,MODIFY SUBCHANNEL,RESUME SUBCHANNEL,START SUBCHANNEL,STORE SUBCHANNEL &
1652 TEST SUBCHANNEL ) we use this as the ID of the device we wish to talk to, the most 1651 TEST SUBCHANNEL ) we use this as the ID of the device we wish to talk to, the most
1653 important of these instructions are START SUBCHANNEL ( to start IO ), TEST SUBCHANNEL ( to check 1652 important of these instructions are START SUBCHANNEL ( to start IO ), TEST SUBCHANNEL ( to check
1654 whether the IO completed successfully ), & HALT SUBCHANNEL ( to kill IO ), a subchannel 1653 whether the IO completed successfully ), & HALT SUBCHANNEL ( to kill IO ), a subchannel
1655 can have up to 8 channel paths to a device this offers redunancy if one is not available. 1654 can have up to 8 channel paths to a device this offers redunancy if one is not available.
1656 1655
1657 1656
1658 Device Number: 1657 Device Number:
1659 This number remains static & Is closely tied to the hardware, there are 65536 of these 1658 This number remains static & Is closely tied to the hardware, there are 65536 of these
1660 also they are made up of a CHPID ( Channel Path ID, the most significant 8 bits ) 1659 also they are made up of a CHPID ( Channel Path ID, the most significant 8 bits )
1661 & another lsb 8 bits. These remain static even if more devices are inserted or removed 1660 & another lsb 8 bits. These remain static even if more devices are inserted or removed
1662 from the hardware, there is a 1 to 1 mapping between Subchannels & Device Numbers provided 1661 from the hardware, there is a 1 to 1 mapping between Subchannels & Device Numbers provided
1663 devices arent inserted or removed. 1662 devices arent inserted or removed.
1664 1663
1665 Channel Control Words: 1664 Channel Control Words:
1666 CCWS are linked lists of instructions initially pointed to by an operation request block (ORB), 1665 CCWS are linked lists of instructions initially pointed to by an operation request block (ORB),
1667 which is initially given to Start Subchannel (SSCH) command along with the subchannel number 1666 which is initially given to Start Subchannel (SSCH) command along with the subchannel number
1668 for the IO subsystem to process while the CPU continues executing normal code. 1667 for the IO subsystem to process while the CPU continues executing normal code.
1669 These come in two flavours, Format 0 ( 24 bit for backward ) 1668 These come in two flavours, Format 0 ( 24 bit for backward )
1670 compatibility & Format 1 ( 31 bit ). These are typically used to issue read & write 1669 compatibility & Format 1 ( 31 bit ). These are typically used to issue read & write
1671 ( & many other instructions ) they consist of a length field & an absolute address field. 1670 ( & many other instructions ) they consist of a length field & an absolute address field.
1672 For each IO typically get 1 or 2 interrupts one for channel end ( primary status ) when the 1671 For each IO typically get 1 or 2 interrupts one for channel end ( primary status ) when the
1673 channel is idle & the second for device end ( secondary status ) sometimes you get both 1672 channel is idle & the second for device end ( secondary status ) sometimes you get both
1674 concurrently, you check how the IO went on by issuing a TEST SUBCHANNEL at each interrupt, 1673 concurrently, you check how the IO went on by issuing a TEST SUBCHANNEL at each interrupt,
1675 from which you receive an Interruption response block (IRB). If you get channel & device end 1674 from which you receive an Interruption response block (IRB). If you get channel & device end
1676 status in the IRB without channel checks etc. your IO probably went okay. If you didn't you 1675 status in the IRB without channel checks etc. your IO probably went okay. If you didn't you
1677 probably need a doctor to examine the IRB & extended status word etc. 1676 probably need a doctor to examine the IRB & extended status word etc.
1678 If an error occurs, more sophistocated control units have a facitity known as 1677 If an error occurs, more sophistocated control units have a facitity known as
1679 concurrent sense this means that if an error occurs Extended sense information will 1678 concurrent sense this means that if an error occurs Extended sense information will
1680 be presented in the Extended status word in the IRB if not you have to issue a 1679 be presented in the Extended status word in the IRB if not you have to issue a
1681 subsequent SENSE CCW command after the test subchannel. 1680 subsequent SENSE CCW command after the test subchannel.
1682 1681
1683 1682
1684 TPI( Test pending interrupt) can also be used for polled IO but in multitasking multiprocessor 1683 TPI( Test pending interrupt) can also be used for polled IO but in multitasking multiprocessor
1685 systems it isn't recommended except for checking special cases ( i.e. non looping checks for 1684 systems it isn't recommended except for checking special cases ( i.e. non looping checks for
1686 pending IO etc. ). 1685 pending IO etc. ).
1687 1686
1688 Store Subchannel & Modify Subchannel can be used to examine & modify operating characteristics 1687 Store Subchannel & Modify Subchannel can be used to examine & modify operating characteristics
1689 of a subchannel ( e.g. channel paths ). 1688 of a subchannel ( e.g. channel paths ).
1690 1689
1691 Other IO related Terms: 1690 Other IO related Terms:
1692 Sysplex: S390's Clustering Technology 1691 Sysplex: S390's Clustering Technology
1693 QDIO: S390's new high speed IO architecture to support devices such as gigabit ethernet, 1692 QDIO: S390's new high speed IO architecture to support devices such as gigabit ethernet,
1694 this architecture is also designed to be forward compatible with up & coming 64 bit machines. 1693 this architecture is also designed to be forward compatible with up & coming 64 bit machines.
1695 1694
1696 1695
1697 General Concepts 1696 General Concepts
1698 1697
1699 Input Output Processors (IOP's) are responsible for communicating between 1698 Input Output Processors (IOP's) are responsible for communicating between
1700 the mainframe CPU's & the channel & relieve the mainframe CPU's from the 1699 the mainframe CPU's & the channel & relieve the mainframe CPU's from the
1701 burden of communicating with IO devices directly, this allows the CPU's to 1700 burden of communicating with IO devices directly, this allows the CPU's to
1702 concentrate on data processing. 1701 concentrate on data processing.
1703 1702
1704 IOP's can use one or more links ( known as channel paths ) to talk to each 1703 IOP's can use one or more links ( known as channel paths ) to talk to each
1705 IO device. It first checks for path availability & chooses an available one, 1704 IO device. It first checks for path availability & chooses an available one,
1706 then starts ( & sometimes terminates IO ). 1705 then starts ( & sometimes terminates IO ).
1707 There are two types of channel path: ESCON & the Parallel IO interface. 1706 There are two types of channel path: ESCON & the Parallel IO interface.
1708 1707
1709 IO devices are attached to control units, control units provide the 1708 IO devices are attached to control units, control units provide the
1710 logic to interface the channel paths & channel path IO protocols to 1709 logic to interface the channel paths & channel path IO protocols to
1711 the IO devices, they can be integrated with the devices or housed separately 1710 the IO devices, they can be integrated with the devices or housed separately
1712 & often talk to several similar devices ( typical examples would be raid 1711 & often talk to several similar devices ( typical examples would be raid
1713 controllers or a control unit which connects to 1000 3270 terminals ). 1712 controllers or a control unit which connects to 1000 3270 terminals ).
1714 1713
1715 1714
1716 +---------------------------------------------------------------+ 1715 +---------------------------------------------------------------+
1717 | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | 1716 | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ |
1718 | | CPU | | CPU | | CPU | | CPU | | Main | | Expanded | | 1717 | | CPU | | CPU | | CPU | | CPU | | Main | | Expanded | |
1719 | | | | | | | | | | Memory | | Storage | | 1718 | | | | | | | | | | Memory | | Storage | |
1720 | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ | 1719 | +-----+ +-----+ +-----+ +-----+ +----------+ +----------+ |
1721 |---------------------------------------------------------------+ 1720 |---------------------------------------------------------------+
1722 | IOP | IOP | IOP | 1721 | IOP | IOP | IOP |
1723 |--------------------------------------------------------------- 1722 |---------------------------------------------------------------
1724 | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | 1723 | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C | C |
1725 ---------------------------------------------------------------- 1724 ----------------------------------------------------------------
1726 || || 1725 || ||
1727 || Bus & Tag Channel Path || ESCON 1726 || Bus & Tag Channel Path || ESCON
1728 || ====================== || Channel 1727 || ====================== || Channel
1729 || || || || Path 1728 || || || || Path
1730 +----------+ +----------+ +----------+ 1729 +----------+ +----------+ +----------+
1731 | | | | | | 1730 | | | | | |
1732 | CU | | CU | | CU | 1731 | CU | | CU | | CU |
1733 | | | | | | 1732 | | | | | |
1734 +----------+ +----------+ +----------+ 1733 +----------+ +----------+ +----------+
1735 | | | | | 1734 | | | | |
1736 +----------+ +----------+ +----------+ +----------+ +----------+ 1735 +----------+ +----------+ +----------+ +----------+ +----------+
1737 |I/O Device| |I/O Device| |I/O Device| |I/O Device| |I/O Device| 1736 |I/O Device| |I/O Device| |I/O Device| |I/O Device| |I/O Device|
1738 +----------+ +----------+ +----------+ +----------+ +----------+ 1737 +----------+ +----------+ +----------+ +----------+ +----------+
1739 CPU = Central Processing Unit 1738 CPU = Central Processing Unit
1740 C = Channel 1739 C = Channel
1741 IOP = IP Processor 1740 IOP = IP Processor
1742 CU = Control Unit 1741 CU = Control Unit
1743 1742
1744 The 390 IO systems come in 2 flavours the current 390 machines support both 1743 The 390 IO systems come in 2 flavours the current 390 machines support both
1745 1744
1746 The Older 360 & 370 Interface,sometimes called the Parallel I/O interface, 1745 The Older 360 & 370 Interface,sometimes called the Parallel I/O interface,
1747 sometimes called Bus-and Tag & sometimes Original Equipment Manufacturers 1746 sometimes called Bus-and Tag & sometimes Original Equipment Manufacturers
1748 Interface (OEMI). 1747 Interface (OEMI).
1749 1748
1750 This byte wide Parallel channel path/bus has parity & data on the "Bus" cable 1749 This byte wide Parallel channel path/bus has parity & data on the "Bus" cable
1751 & control lines on the "Tag" cable. These can operate in byte multiplex mode for 1750 & control lines on the "Tag" cable. These can operate in byte multiplex mode for
1752 sharing between several slow devices or burst mode & monopolize the channel for the 1751 sharing between several slow devices or burst mode & monopolize the channel for the
1753 whole burst. Upto 256 devices can be addressed on one of these cables. These cables are 1752 whole burst. Upto 256 devices can be addressed on one of these cables. These cables are
1754 about one inch in diameter. The maximum unextended length supported by these cables is 1753 about one inch in diameter. The maximum unextended length supported by these cables is
1755 125 Meters but this can be extended up to 2km with a fibre optic channel extended 1754 125 Meters but this can be extended up to 2km with a fibre optic channel extended
1756 such as a 3044. The maximum burst speed supported is 4.5 megabytes per second however 1755 such as a 3044. The maximum burst speed supported is 4.5 megabytes per second however
1757 some really old processors support only transfer rates of 3.0, 2.0 & 1.0 MB/sec. 1756 some really old processors support only transfer rates of 3.0, 2.0 & 1.0 MB/sec.
1758 One of these paths can be daisy chained to up to 8 control units. 1757 One of these paths can be daisy chained to up to 8 control units.
1759 1758
1760 1759
1761 ESCON if fibre optic it is also called FICON 1760 ESCON if fibre optic it is also called FICON
1762 Was introduced by IBM in 1990. Has 2 fibre optic cables & uses either leds or lasers 1761 Was introduced by IBM in 1990. Has 2 fibre optic cables & uses either leds or lasers
1763 for communication at a signaling rate of upto 200 megabits/sec. As 10bits are transferred 1762 for communication at a signaling rate of upto 200 megabits/sec. As 10bits are transferred
1764 for every 8 bits info this drops to 160 megabits/sec & to 18.6 Megabytes/sec once 1763 for every 8 bits info this drops to 160 megabits/sec & to 18.6 Megabytes/sec once
1765 control info & CRC are added. ESCON only operates in burst mode. 1764 control info & CRC are added. ESCON only operates in burst mode.
1766 1765
1767 ESCONs typical max cable length is 3km for the led version & 20km for the laser version 1766 ESCONs typical max cable length is 3km for the led version & 20km for the laser version
1768 known as XDF ( extended distance facility ). This can be further extended by using an 1767 known as XDF ( extended distance facility ). This can be further extended by using an
1769 ESCON director which triples the above mentioned ranges. Unlike Bus & Tag as ESCON is 1768 ESCON director which triples the above mentioned ranges. Unlike Bus & Tag as ESCON is
1770 serial it uses a packet switching architecture the standard Bus & Tag control protocol 1769 serial it uses a packet switching architecture the standard Bus & Tag control protocol
1771 is however present within the packets. Upto 256 devices can be attached to each control 1770 is however present within the packets. Upto 256 devices can be attached to each control
1772 unit that uses one of these interfaces. 1771 unit that uses one of these interfaces.
1773 1772
1774 Common 390 Devices include: 1773 Common 390 Devices include:
1775 Network adapters typically OSA2,3172's,2116's & OSA-E gigabit ethernet adapters, 1774 Network adapters typically OSA2,3172's,2116's & OSA-E gigabit ethernet adapters,
1776 Consoles 3270 & 3215 ( a teletype emulated under linux for a line mode console ). 1775 Consoles 3270 & 3215 ( a teletype emulated under linux for a line mode console ).
1777 DASD's direct access storage devices ( otherwise known as hard disks ). 1776 DASD's direct access storage devices ( otherwise known as hard disks ).
1778 Tape Drives. 1777 Tape Drives.
1779 CTC ( Channel to Channel Adapters ), 1778 CTC ( Channel to Channel Adapters ),
1780 ESCON or Parallel Cables used as a very high speed serial link 1779 ESCON or Parallel Cables used as a very high speed serial link
1781 between 2 machines. We use 2 cables under linux to do a bi-directional serial link. 1780 between 2 machines. We use 2 cables under linux to do a bi-directional serial link.
1782 1781
1783 1782
1784 Debugging IO on s/390 & z/Architecture under VM 1783 Debugging IO on s/390 & z/Architecture under VM
1785 =============================================== 1784 ===============================================
1786 1785
1787 Now we are ready to go on with IO tracing commands under VM 1786 Now we are ready to go on with IO tracing commands under VM
1788 1787
1789 A few self explanatory queries: 1788 A few self explanatory queries:
1790 Q OSA 1789 Q OSA
1791 Q CTC 1790 Q CTC
1792 Q DISK ( This command is CMS specific ) 1791 Q DISK ( This command is CMS specific )
1793 Q DASD 1792 Q DASD
1794 1793
1795 1794
1796 1795
1797 1796
1798 1797
1799 1798
1800 Q OSA on my machine returns 1799 Q OSA on my machine returns
1801 OSA 7C08 ON OSA 7C08 SUBCHANNEL = 0000 1800 OSA 7C08 ON OSA 7C08 SUBCHANNEL = 0000
1802 OSA 7C09 ON OSA 7C09 SUBCHANNEL = 0001 1801 OSA 7C09 ON OSA 7C09 SUBCHANNEL = 0001
1803 OSA 7C14 ON OSA 7C14 SUBCHANNEL = 0002 1802 OSA 7C14 ON OSA 7C14 SUBCHANNEL = 0002
1804 OSA 7C15 ON OSA 7C15 SUBCHANNEL = 0003 1803 OSA 7C15 ON OSA 7C15 SUBCHANNEL = 0003
1805 1804
1806 If you have a guest with certain privileges you may be able to see devices 1805 If you have a guest with certain privileges you may be able to see devices
1807 which don't belong to you. To avoid this, add the option V. 1806 which don't belong to you. To avoid this, add the option V.
1808 e.g. 1807 e.g.
1809 Q V OSA 1808 Q V OSA
1810 1809
1811 Now using the device numbers returned by this command we will 1810 Now using the device numbers returned by this command we will
1812 Trace the io starting up on the first device 7c08 & 7c09 1811 Trace the io starting up on the first device 7c08 & 7c09
1813 In our simplest case we can trace the 1812 In our simplest case we can trace the
1814 start subchannels 1813 start subchannels
1815 like TR SSCH 7C08-7C09 1814 like TR SSCH 7C08-7C09
1816 or the halt subchannels 1815 or the halt subchannels
1817 or TR HSCH 7C08-7C09 1816 or TR HSCH 7C08-7C09
1818 MSCH's ,STSCH's I think you can guess the rest 1817 MSCH's ,STSCH's I think you can guess the rest
1819 1818
1820 Ingo's favourite trick is tracing all the IO's & CCWS & spooling them into the reader of another 1819 Ingo's favourite trick is tracing all the IO's & CCWS & spooling them into the reader of another
1821 VM guest so he can ftp the logfile back to his own machine.I'll do a small bit of this & give you 1820 VM guest so he can ftp the logfile back to his own machine.I'll do a small bit of this & give you
1822 a look at the output. 1821 a look at the output.
1823 1822
1824 1) Spool stdout to VM reader 1823 1) Spool stdout to VM reader
1825 SP PRT TO (another vm guest ) or * for the local vm guest 1824 SP PRT TO (another vm guest ) or * for the local vm guest
1826 2) Fill the reader with the trace 1825 2) Fill the reader with the trace
1827 TR IO 7c08-7c09 INST INT CCW PRT RUN 1826 TR IO 7c08-7c09 INST INT CCW PRT RUN
1828 3) Start up linux 1827 3) Start up linux
1829 i 00c 1828 i 00c
1830 4) Finish the trace 1829 4) Finish the trace
1831 TR END 1830 TR END
1832 5) close the reader 1831 5) close the reader
1833 C PRT 1832 C PRT
1834 6) list reader contents 1833 6) list reader contents
1835 RDRLIST 1834 RDRLIST
1836 7) copy it to linux4's minidisk 1835 7) copy it to linux4's minidisk
1837 RECEIVE / LOG TXT A1 ( replace 1836 RECEIVE / LOG TXT A1 ( replace
1838 8) 1837 8)
1839 filel & press F11 to look at it 1838 filel & press F11 to look at it
1840 You should see something like: 1839 You should see something like:
1841 1840
1842 00020942' SSCH B2334000 0048813C CC 0 SCH 0000 DEV 7C08 1841 00020942' SSCH B2334000 0048813C CC 0 SCH 0000 DEV 7C08
1843 CPA 000FFDF0 PARM 00E2C9C4 KEY 0 FPI C0 LPM 80 1842 CPA 000FFDF0 PARM 00E2C9C4 KEY 0 FPI C0 LPM 80
1844 CCW 000FFDF0 E4200100 00487FE8 0000 E4240100 ........ 1843 CCW 000FFDF0 E4200100 00487FE8 0000 E4240100 ........
1845 IDAL 43D8AFE8 1844 IDAL 43D8AFE8
1846 IDAL 0FB76000 1845 IDAL 0FB76000
1847 00020B0A' I/O DEV 7C08 -> 000197BC' SCH 0000 PARM 00E2C9C4 1846 00020B0A' I/O DEV 7C08 -> 000197BC' SCH 0000 PARM 00E2C9C4
1848 00021628' TSCH B2354000 >> 00488164 CC 0 SCH 0000 DEV 7C08 1847 00021628' TSCH B2354000 >> 00488164 CC 0 SCH 0000 DEV 7C08
1849 CCWA 000FFDF8 DEV STS 0C SCH STS 00 CNT 00EC 1848 CCWA 000FFDF8 DEV STS 0C SCH STS 00 CNT 00EC
1850 KEY 0 FPI C0 CC 0 CTLS 4007 1849 KEY 0 FPI C0 CC 0 CTLS 4007
1851 00022238' STSCH B2344000 >> 00488108 CC 0 SCH 0000 DEV 7C08 1850 00022238' STSCH B2344000 >> 00488108 CC 0 SCH 0000 DEV 7C08
1852 1851
1853 If you don't like messing up your readed ( because you possibly booted from it ) 1852 If you don't like messing up your readed ( because you possibly booted from it )
1854 you can alternatively spool it to another readers guest. 1853 you can alternatively spool it to another readers guest.
1855 1854
1856 1855
1857 Other common VM device related commands 1856 Other common VM device related commands
1858 --------------------------------------------- 1857 ---------------------------------------------
1859 These commands are listed only because they have 1858 These commands are listed only because they have
1860 been of use to me in the past & may be of use to 1859 been of use to me in the past & may be of use to
1861 you too. For more complete info on each of the commands 1860 you too. For more complete info on each of the commands
1862 use type HELP <command> from CMS. 1861 use type HELP <command> from CMS.
1863 detaching devices 1862 detaching devices
1864 DET <devno range> 1863 DET <devno range>
1865 ATT <devno range> <guest> 1864 ATT <devno range> <guest>
1866 attach a device to guest * for your own guest 1865 attach a device to guest * for your own guest
1867 READY <devno> cause VM to issue a fake interrupt. 1866 READY <devno> cause VM to issue a fake interrupt.
1868 1867
1869 The VARY command is normally only available to VM administrators. 1868 The VARY command is normally only available to VM administrators.
1870 VARY ON PATH <path> TO <devno range> 1869 VARY ON PATH <path> TO <devno range>
1871 VARY OFF PATH <PATH> FROM <devno range> 1870 VARY OFF PATH <PATH> FROM <devno range>
1872 This is used to switch on or off channel paths to devices. 1871 This is used to switch on or off channel paths to devices.
1873 1872
1874 Q CHPID <channel path ID> 1873 Q CHPID <channel path ID>
1875 This displays state of devices using this channel path 1874 This displays state of devices using this channel path
1876 D SCHIB <subchannel> 1875 D SCHIB <subchannel>
1877 This displays the subchannel information SCHIB block for the device. 1876 This displays the subchannel information SCHIB block for the device.
1878 this I believe is also only available to administrators. 1877 this I believe is also only available to administrators.
1879 DEFINE CTC <devno> 1878 DEFINE CTC <devno>
1880 defines a virtual CTC channel to channel connection 1879 defines a virtual CTC channel to channel connection
1881 2 need to be defined on each guest for the CTC driver to use. 1880 2 need to be defined on each guest for the CTC driver to use.
1882 COUPLE devno userid remote devno 1881 COUPLE devno userid remote devno
1883 Joins a local virtual device to a remote virtual device 1882 Joins a local virtual device to a remote virtual device
1884 ( commonly used for the CTC driver ). 1883 ( commonly used for the CTC driver ).
1885 1884
1886 Building a VM ramdisk under CMS which linux can use 1885 Building a VM ramdisk under CMS which linux can use
1887 def vfb-<blocksize> <subchannel> <number blocks> 1886 def vfb-<blocksize> <subchannel> <number blocks>
1888 blocksize is commonly 4096 for linux. 1887 blocksize is commonly 4096 for linux.
1889 Formatting it 1888 Formatting it
1890 format <subchannel> <driver letter e.g. x> (blksize <blocksize> 1889 format <subchannel> <driver letter e.g. x> (blksize <blocksize>
1891 1890
1892 Sharing a disk between multiple guests 1891 Sharing a disk between multiple guests
1893 LINK userid devno1 devno2 mode password 1892 LINK userid devno1 devno2 mode password
1894 1893
1895 1894
1896 1895
1897 GDB on S390 1896 GDB on S390
1898 =========== 1897 ===========
1899 N.B. if compiling for debugging gdb works better without optimisation 1898 N.B. if compiling for debugging gdb works better without optimisation
1900 ( see Compiling programs for debugging ) 1899 ( see Compiling programs for debugging )
1901 1900
1902 invocation 1901 invocation
1903 ---------- 1902 ----------
1904 gdb <victim program> <optional corefile> 1903 gdb <victim program> <optional corefile>
1905 1904
1906 Online help 1905 Online help
1907 ----------- 1906 -----------
1908 help: gives help on commands 1907 help: gives help on commands
1909 e.g. 1908 e.g.
1910 help 1909 help
1911 help display 1910 help display
1912 Note gdb's online help is very good use it. 1911 Note gdb's online help is very good use it.
1913 1912
1914 1913
1915 Assembly 1914 Assembly
1916 -------- 1915 --------
1917 info registers: displays registers other than floating point. 1916 info registers: displays registers other than floating point.
1918 info all-registers: displays floating points as well. 1917 info all-registers: displays floating points as well.
1919 disassemble: disassembles 1918 disassemble: disassembles
1920 e.g. 1919 e.g.
1921 disassemble without parameters will disassemble the current function 1920 disassemble without parameters will disassemble the current function
1922 disassemble $pc $pc+10 1921 disassemble $pc $pc+10
1923 1922
1924 Viewing & modifying variables 1923 Viewing & modifying variables
1925 ----------------------------- 1924 -----------------------------
1926 print or p: displays variable or register 1925 print or p: displays variable or register
1927 e.g. p/x $sp will display the stack pointer 1926 e.g. p/x $sp will display the stack pointer
1928 1927
1929 display: prints variable or register each time program stops 1928 display: prints variable or register each time program stops
1930 e.g. 1929 e.g.
1931 display/x $pc will display the program counter 1930 display/x $pc will display the program counter
1932 display argc 1931 display argc
1933 1932
1934 undisplay : undo's display's 1933 undisplay : undo's display's
1935 1934
1936 info breakpoints: shows all current breakpoints 1935 info breakpoints: shows all current breakpoints
1937 1936
1938 info stack: shows stack back trace ( if this doesn't work too well, I'll show you the 1937 info stack: shows stack back trace ( if this doesn't work too well, I'll show you the
1939 stacktrace by hand below ). 1938 stacktrace by hand below ).
1940 1939
1941 info locals: displays local variables. 1940 info locals: displays local variables.
1942 1941
1943 info args: display current procedure arguments. 1942 info args: display current procedure arguments.
1944 1943
1945 set args: will set argc & argv each time the victim program is invoked. 1944 set args: will set argc & argv each time the victim program is invoked.
1946 1945
1947 set <variable>=value 1946 set <variable>=value
1948 set argc=100 1947 set argc=100
1949 set $pc=0 1948 set $pc=0
1950 1949
1951 1950
1952 1951
1953 Modifying execution 1952 Modifying execution
1954 ------------------- 1953 -------------------
1955 step: steps n lines of sourcecode 1954 step: steps n lines of sourcecode
1956 step steps 1 line. 1955 step steps 1 line.
1957 step 100 steps 100 lines of code. 1956 step 100 steps 100 lines of code.
1958 1957
1959 next: like step except this will not step into subroutines 1958 next: like step except this will not step into subroutines
1960 1959
1961 stepi: steps a single machine code instruction. 1960 stepi: steps a single machine code instruction.
1962 e.g. stepi 100 1961 e.g. stepi 100
1963 1962
1964 nexti: steps a single machine code instruction but will not step into subroutines. 1963 nexti: steps a single machine code instruction but will not step into subroutines.
1965 1964
1966 finish: will run until exit of the current routine 1965 finish: will run until exit of the current routine
1967 1966
1968 run: (re)starts a program 1967 run: (re)starts a program
1969 1968
1970 cont: continues a program 1969 cont: continues a program
1971 1970
1972 quit: exits gdb. 1971 quit: exits gdb.
1973 1972
1974 1973
1975 breakpoints 1974 breakpoints
1976 ------------ 1975 ------------
1977 1976
1978 break 1977 break
1979 sets a breakpoint 1978 sets a breakpoint
1980 e.g. 1979 e.g.
1981 1980
1982 break main 1981 break main
1983 1982
1984 break *$pc 1983 break *$pc
1985 1984
1986 break *0x400618 1985 break *0x400618
1987 1986
1988 heres a really useful one for large programs 1987 heres a really useful one for large programs
1989 rbr 1988 rbr
1990 Set a breakpoint for all functions matching REGEXP 1989 Set a breakpoint for all functions matching REGEXP
1991 e.g. 1990 e.g.
1992 rbr 390 1991 rbr 390
1993 will set a breakpoint with all functions with 390 in their name. 1992 will set a breakpoint with all functions with 390 in their name.
1994 1993
1995 info breakpoints 1994 info breakpoints
1996 lists all breakpoints 1995 lists all breakpoints
1997 1996
1998 delete: delete breakpoint by number or delete them all 1997 delete: delete breakpoint by number or delete them all
1999 e.g. 1998 e.g.
2000 delete 1 will delete the first breakpoint 1999 delete 1 will delete the first breakpoint
2001 delete will delete them all 2000 delete will delete them all
2002 2001
2003 watch: This will set a watchpoint ( usually hardware assisted ), 2002 watch: This will set a watchpoint ( usually hardware assisted ),
2004 This will watch a variable till it changes 2003 This will watch a variable till it changes
2005 e.g. 2004 e.g.
2006 watch cnt, will watch the variable cnt till it changes. 2005 watch cnt, will watch the variable cnt till it changes.
2007 As an aside unfortunately gdb's, architecture independent watchpoint code 2006 As an aside unfortunately gdb's, architecture independent watchpoint code
2008 is inconsistent & not very good, watchpoints usually work but not always. 2007 is inconsistent & not very good, watchpoints usually work but not always.
2009 2008
2010 info watchpoints: Display currently active watchpoints 2009 info watchpoints: Display currently active watchpoints
2011 2010
2012 condition: ( another useful one ) 2011 condition: ( another useful one )
2013 Specify breakpoint number N to break only if COND is true. 2012 Specify breakpoint number N to break only if COND is true.
2014 Usage is `condition N COND', where N is an integer and COND is an 2013 Usage is `condition N COND', where N is an integer and COND is an
2015 expression to be evaluated whenever breakpoint N is reached. 2014 expression to be evaluated whenever breakpoint N is reached.
2016 2015
2017 2016
2018 2017
2019 User defined functions/macros 2018 User defined functions/macros
2020 ----------------------------- 2019 -----------------------------
2021 define: ( Note this is very very useful,simple & powerful ) 2020 define: ( Note this is very very useful,simple & powerful )
2022 usage define <name> <list of commands> end 2021 usage define <name> <list of commands> end
2023 2022
2024 examples which you should consider putting into .gdbinit in your home directory 2023 examples which you should consider putting into .gdbinit in your home directory
2025 define d 2024 define d
2026 stepi 2025 stepi
2027 disassemble $pc $pc+10 2026 disassemble $pc $pc+10
2028 end 2027 end
2029 2028
2030 define e 2029 define e
2031 nexti 2030 nexti
2032 disassemble $pc $pc+10 2031 disassemble $pc $pc+10
2033 end 2032 end
2034 2033
2035 2034
2036 Other hard to classify stuff 2035 Other hard to classify stuff
2037 ---------------------------- 2036 ----------------------------
2038 signal n: 2037 signal n:
2039 sends the victim program a signal. 2038 sends the victim program a signal.
2040 e.g. signal 3 will send a SIGQUIT. 2039 e.g. signal 3 will send a SIGQUIT.
2041 2040
2042 info signals: 2041 info signals:
2043 what gdb does when the victim receives certain signals. 2042 what gdb does when the victim receives certain signals.
2044 2043
2045 list: 2044 list:
2046 e.g. 2045 e.g.
2047 list lists current function source 2046 list lists current function source
2048 list 1,10 list first 10 lines of current file. 2047 list 1,10 list first 10 lines of current file.
2049 list test.c:1,10 2048 list test.c:1,10
2050 2049
2051 2050
2052 directory: 2051 directory:
2053 Adds directories to be searched for source if gdb cannot find the source. 2052 Adds directories to be searched for source if gdb cannot find the source.
2054 (note it is a bit sensititive about slashes) 2053 (note it is a bit sensititive about slashes)
2055 e.g. To add the root of the filesystem to the searchpath do 2054 e.g. To add the root of the filesystem to the searchpath do
2056 directory // 2055 directory //
2057 2056
2058 2057
2059 call <function> 2058 call <function>
2060 This calls a function in the victim program, this is pretty powerful 2059 This calls a function in the victim program, this is pretty powerful
2061 e.g. 2060 e.g.
2062 (gdb) call printf("hello world") 2061 (gdb) call printf("hello world")
2063 outputs: 2062 outputs:
2064 $1 = 11 2063 $1 = 11
2065 2064
2066 You might now be thinking that the line above didn't work, something extra had to be done. 2065 You might now be thinking that the line above didn't work, something extra had to be done.
2067 (gdb) call fflush(stdout) 2066 (gdb) call fflush(stdout)
2068 hello world$2 = 0 2067 hello world$2 = 0
2069 As an aside the debugger also calls malloc & free under the hood 2068 As an aside the debugger also calls malloc & free under the hood
2070 to make space for the "hello world" string. 2069 to make space for the "hello world" string.
2071 2070
2072 2071
2073 2072
2074 hints 2073 hints
2075 ----- 2074 -----
2076 1) command completion works just like bash 2075 1) command completion works just like bash
2077 ( if you are a bad typist like me this really helps ) 2076 ( if you are a bad typist like me this really helps )
2078 e.g. hit br <TAB> & cursor up & down :-). 2077 e.g. hit br <TAB> & cursor up & down :-).
2079 2078
2080 2) if you have a debugging problem that takes a few steps to recreate 2079 2) if you have a debugging problem that takes a few steps to recreate
2081 put the steps into a file called .gdbinit in your current working directory 2080 put the steps into a file called .gdbinit in your current working directory
2082 if you have defined a few extra useful user defined commands put these in 2081 if you have defined a few extra useful user defined commands put these in
2083 your home directory & they will be read each time gdb is launched. 2082 your home directory & they will be read each time gdb is launched.
2084 2083
2085 A typical .gdbinit file might be. 2084 A typical .gdbinit file might be.
2086 break main 2085 break main
2087 run 2086 run
2088 break runtime_exception 2087 break runtime_exception
2089 cont 2088 cont
2090 2089
2091 2090
2092 stack chaining in gdb by hand 2091 stack chaining in gdb by hand
2093 ----------------------------- 2092 -----------------------------
2094 This is done using a the same trick described for VM 2093 This is done using a the same trick described for VM
2095 p/x (*($sp+56))&0x7fffffff get the first backchain. 2094 p/x (*($sp+56))&0x7fffffff get the first backchain.
2096 2095
2097 For z/Architecture 2096 For z/Architecture
2098 Replace 56 with 112 & ignore the &0x7fffffff 2097 Replace 56 with 112 & ignore the &0x7fffffff
2099 in the macros below & do nasty casts to longs like the following 2098 in the macros below & do nasty casts to longs like the following
2100 as gdb unfortunately deals with printed arguments as ints which 2099 as gdb unfortunately deals with printed arguments as ints which
2101 messes up everything. 2100 messes up everything.
2102 i.e. here is a 3rd backchain dereference 2101 i.e. here is a 3rd backchain dereference
2103 p/x *(long *)(***(long ***)$sp+112) 2102 p/x *(long *)(***(long ***)$sp+112)
2104 2103
2105 2104
2106 this outputs 2105 this outputs
2107 $5 = 0x528f18 2106 $5 = 0x528f18
2108 on my machine. 2107 on my machine.
2109 Now you can use 2108 Now you can use
2110 info symbol (*($sp+56))&0x7fffffff 2109 info symbol (*($sp+56))&0x7fffffff
2111 you might see something like. 2110 you might see something like.
2112 rl_getc + 36 in section .text telling you what is located at address 0x528f18 2111 rl_getc + 36 in section .text telling you what is located at address 0x528f18
2113 Now do. 2112 Now do.
2114 p/x (*(*$sp+56))&0x7fffffff 2113 p/x (*(*$sp+56))&0x7fffffff
2115 This outputs 2114 This outputs
2116 $6 = 0x528ed0 2115 $6 = 0x528ed0
2117 Now do. 2116 Now do.
2118 info symbol (*(*$sp+56))&0x7fffffff 2117 info symbol (*(*$sp+56))&0x7fffffff
2119 rl_read_key + 180 in section .text 2118 rl_read_key + 180 in section .text
2120 now do 2119 now do
2121 p/x (*(**$sp+56))&0x7fffffff 2120 p/x (*(**$sp+56))&0x7fffffff
2122 & so on. 2121 & so on.
2123 2122
2124 Disassembling instructions without debug info 2123 Disassembling instructions without debug info
2125 --------------------------------------------- 2124 ---------------------------------------------
2126 gdb typically complains if there is a lack of debugging 2125 gdb typically complains if there is a lack of debugging
2127 symbols in the disassemble command with 2126 symbols in the disassemble command with
2128 "No function contains specified address." To get around 2127 "No function contains specified address." To get around
2129 this do 2128 this do
2130 x/<number lines to disassemble>xi <address> 2129 x/<number lines to disassemble>xi <address>
2131 e.g. 2130 e.g.
2132 x/20xi 0x400730 2131 x/20xi 0x400730
2133 2132
2134 2133
2135 2134
2136 Note: Remember gdb has history just like bash you don't need to retype the 2135 Note: Remember gdb has history just like bash you don't need to retype the
2137 whole line just use the up & down arrows. 2136 whole line just use the up & down arrows.
2138 2137
2139 2138
2140 2139
2141 For more info 2140 For more info
2142 ------------- 2141 -------------
2143 From your linuxbox do 2142 From your linuxbox do
2144 man gdb or info gdb. 2143 man gdb or info gdb.
2145 2144
2146 core dumps 2145 core dumps
2147 ---------- 2146 ----------
2148 What a core dump ?, 2147 What a core dump ?,
2149 A core dump is a file generated by the kernel ( if allowed ) which contains the registers, 2148 A core dump is a file generated by the kernel ( if allowed ) which contains the registers,
2150 & all active pages of the program which has crashed. 2149 & all active pages of the program which has crashed.
2151 From this file gdb will allow you to look at the registers & stack trace & memory of the 2150 From this file gdb will allow you to look at the registers & stack trace & memory of the
2152 program as if it just crashed on your system, it is usually called core & created in the 2151 program as if it just crashed on your system, it is usually called core & created in the
2153 current working directory. 2152 current working directory.
2154 This is very useful in that a customer can mail a core dump to a technical support department 2153 This is very useful in that a customer can mail a core dump to a technical support department
2155 & the technical support department can reconstruct what happened. 2154 & the technical support department can reconstruct what happened.
2156 Provided the have an identical copy of this program with debugging symbols compiled in & 2155 Provided the have an identical copy of this program with debugging symbols compiled in &
2157 the source base of this build is available. 2156 the source base of this build is available.
2158 In short it is far more useful than something like a crash log could ever hope to be. 2157 In short it is far more useful than something like a crash log could ever hope to be.
2159 2158
2160 In theory all that is missing to restart a core dumped program is a kernel patch which 2159 In theory all that is missing to restart a core dumped program is a kernel patch which
2161 will do the following. 2160 will do the following.
2162 1) Make a new kernel task structure 2161 1) Make a new kernel task structure
2163 2) Reload all the dumped pages back into the kernel's memory management structures. 2162 2) Reload all the dumped pages back into the kernel's memory management structures.
2164 3) Do the required clock fixups 2163 3) Do the required clock fixups
2165 4) Get all files & network connections for the process back into an identical state ( really difficult ). 2164 4) Get all files & network connections for the process back into an identical state ( really difficult ).
2166 5) A few more difficult things I haven't thought of. 2165 5) A few more difficult things I haven't thought of.
2167 2166
2168 2167
2169 2168
2170 Why have I never seen one ?. 2169 Why have I never seen one ?.
2171 Probably because you haven't used the command 2170 Probably because you haven't used the command
2172 ulimit -c unlimited in bash 2171 ulimit -c unlimited in bash
2173 to allow core dumps, now do 2172 to allow core dumps, now do
2174 ulimit -a 2173 ulimit -a
2175 to verify that the limit was accepted. 2174 to verify that the limit was accepted.
2176 2175
2177 A sample core dump 2176 A sample core dump
2178 To create this I'm going to do 2177 To create this I'm going to do
2179 ulimit -c unlimited 2178 ulimit -c unlimited
2180 gdb 2179 gdb
2181 to launch gdb (my victim app. ) now be bad & do the following from another 2180 to launch gdb (my victim app. ) now be bad & do the following from another
2182 telnet/xterm session to the same machine 2181 telnet/xterm session to the same machine
2183 ps -aux | grep gdb 2182 ps -aux | grep gdb
2184 kill -SIGSEGV <gdb's pid> 2183 kill -SIGSEGV <gdb's pid>
2185 or alternatively use killall -SIGSEGV gdb if you have the killall command. 2184 or alternatively use killall -SIGSEGV gdb if you have the killall command.
2186 Now look at the core dump. 2185 Now look at the core dump.
2187 ./gdb ./gdb core 2186 ./gdb core
2188 Displays the following 2187 Displays the following
2189 GNU gdb 4.18 2188 GNU gdb 4.18
2190 Copyright 1998 Free Software Foundation, Inc. 2189 Copyright 1998 Free Software Foundation, Inc.
2191 GDB is free software, covered by the GNU General Public License, and you are 2190 GDB is free software, covered by the GNU General Public License, and you are
2192 welcome to change it and/or distribute copies of it under certain conditions. 2191 welcome to change it and/or distribute copies of it under certain conditions.
2193 Type "show copying" to see the conditions. 2192 Type "show copying" to see the conditions.
2194 There is absolutely no warranty for GDB. Type "show warranty" for details. 2193 There is absolutely no warranty for GDB. Type "show warranty" for details.
2195 This GDB was configured as "s390-ibm-linux"... 2194 This GDB was configured as "s390-ibm-linux"...
2196 Core was generated by `./gdb'. 2195 Core was generated by `./gdb'.
2197 Program terminated with signal 11, Segmentation fault. 2196 Program terminated with signal 11, Segmentation fault.
2198 Reading symbols from /usr/lib/libncurses.so.4...done. 2197 Reading symbols from /usr/lib/libncurses.so.4...done.
2199 Reading symbols from /lib/libm.so.6...done. 2198 Reading symbols from /lib/libm.so.6...done.
2200 Reading symbols from /lib/libc.so.6...done. 2199 Reading symbols from /lib/libc.so.6...done.
2201 Reading symbols from /lib/ld-linux.so.2...done. 2200 Reading symbols from /lib/ld-linux.so.2...done.
2202 #0 0x40126d1a in read () from /lib/libc.so.6 2201 #0 0x40126d1a in read () from /lib/libc.so.6
2203 Setting up the environment for debugging gdb. 2202 Setting up the environment for debugging gdb.
2204 Breakpoint 1 at 0x4dc6f8: file utils.c, line 471. 2203 Breakpoint 1 at 0x4dc6f8: file utils.c, line 471.
2205 Breakpoint 2 at 0x4d87a4: file top.c, line 2609. 2204 Breakpoint 2 at 0x4d87a4: file top.c, line 2609.
2206 (top-gdb) info stack 2205 (top-gdb) info stack
2207 #0 0x40126d1a in read () from /lib/libc.so.6 2206 #0 0x40126d1a in read () from /lib/libc.so.6
2208 #1 0x528f26 in rl_getc (stream=0x7ffffde8) at input.c:402 2207 #1 0x528f26 in rl_getc (stream=0x7ffffde8) at input.c:402
2209 #2 0x528ed0 in rl_read_key () at input.c:381 2208 #2 0x528ed0 in rl_read_key () at input.c:381
2210 #3 0x5167e6 in readline_internal_char () at readline.c:454 2209 #3 0x5167e6 in readline_internal_char () at readline.c:454
2211 #4 0x5168ee in readline_internal_charloop () at readline.c:507 2210 #4 0x5168ee in readline_internal_charloop () at readline.c:507
2212 #5 0x51692c in readline_internal () at readline.c:521 2211 #5 0x51692c in readline_internal () at readline.c:521
2213 #6 0x5164fe in readline (prompt=0x7ffff810 "\177ยรฟยรธx\177ยรฟยรทยร˜\177ยรฟยรธxยร€") 2212 #6 0x5164fe in readline (prompt=0x7ffff810 "\177ยรฟยรธx\177ยรฟยรทยร˜\177ยรฟยรธxยร€")
2214 at readline.c:349 2213 at readline.c:349
2215 #7 0x4d7a8a in command_line_input (prrompt=0x564420 "(gdb) ", repeat=1, 2214 #7 0x4d7a8a in command_line_input (prrompt=0x564420 "(gdb) ", repeat=1,
2216 annotation_suffix=0x4d6b44 "prompt") at top.c:2091 2215 annotation_suffix=0x4d6b44 "prompt") at top.c:2091
2217 #8 0x4d6cf0 in command_loop () at top.c:1345 2216 #8 0x4d6cf0 in command_loop () at top.c:1345
2218 #9 0x4e25bc in main (argc=1, argv=0x7ffffdf4) at main.c:635 2217 #9 0x4e25bc in main (argc=1, argv=0x7ffffdf4) at main.c:635
2219 2218
2220 2219
2221 LDD 2220 LDD
2222 === 2221 ===
2223 This is a program which lists the shared libraries which a library needs, 2222 This is a program which lists the shared libraries which a library needs,
2224 Note you also get the relocations of the shared library text segments which 2223 Note you also get the relocations of the shared library text segments which
2225 help when using objdump --source. 2224 help when using objdump --source.
2226 e.g. 2225 e.g.
2227 ldd ./gdb 2226 ldd ./gdb
2228 outputs 2227 outputs
2229 libncurses.so.4 => /usr/lib/libncurses.so.4 (0x40018000) 2228 libncurses.so.4 => /usr/lib/libncurses.so.4 (0x40018000)
2230 libm.so.6 => /lib/libm.so.6 (0x4005e000) 2229 libm.so.6 => /lib/libm.so.6 (0x4005e000)
2231 libc.so.6 => /lib/libc.so.6 (0x40084000) 2230 libc.so.6 => /lib/libc.so.6 (0x40084000)
2232 /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) 2231 /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
2233 2232
2234 2233
2235 Debugging shared libraries 2234 Debugging shared libraries
2236 ========================== 2235 ==========================
2237 Most programs use shared libraries, however it can be very painful 2236 Most programs use shared libraries, however it can be very painful
2238 when you single step instruction into a function like printf for the 2237 when you single step instruction into a function like printf for the
2239 first time & you end up in functions like _dl_runtime_resolve this is 2238 first time & you end up in functions like _dl_runtime_resolve this is
2240 the ld.so doing lazy binding, lazy binding is a concept in ELF where 2239 the ld.so doing lazy binding, lazy binding is a concept in ELF where
2241 shared library functions are not loaded into memory unless they are 2240 shared library functions are not loaded into memory unless they are
2242 actually used, great for saving memory but a pain to debug. 2241 actually used, great for saving memory but a pain to debug.
2243 To get around this either relink the program -static or exit gdb type 2242 To get around this either relink the program -static or exit gdb type
2244 export LD_BIND_NOW=true this will stop lazy binding & restart the gdb'ing 2243 export LD_BIND_NOW=true this will stop lazy binding & restart the gdb'ing
2245 the program in question. 2244 the program in question.
2246 2245
2247 2246
2248 2247
2249 Debugging modules 2248 Debugging modules
2250 ================= 2249 =================
2251 As modules are dynamically loaded into the kernel their address can be 2250 As modules are dynamically loaded into the kernel their address can be
2252 anywhere to get around this use the -m option with insmod to emit a load 2251 anywhere to get around this use the -m option with insmod to emit a load
2253 map which can be piped into a file if required. 2252 map which can be piped into a file if required.
2254 2253
2255 The proc file system 2254 The proc file system
2256 ==================== 2255 ====================
2257 What is it ?. 2256 What is it ?.
2258 It is a filesystem created by the kernel with files which are created on demand 2257 It is a filesystem created by the kernel with files which are created on demand
2259 by the kernel if read, or can be used to modify kernel parameters, 2258 by the kernel if read, or can be used to modify kernel parameters,
2260 it is a powerful concept. 2259 it is a powerful concept.
2261 2260
2262 e.g. 2261 e.g.
2263 2262
2264 cat /proc/sys/net/ipv4/ip_forward 2263 cat /proc/sys/net/ipv4/ip_forward
2265 On my machine outputs 2264 On my machine outputs
2266 0 2265 0
2267 telling me ip_forwarding is not on to switch it on I can do 2266 telling me ip_forwarding is not on to switch it on I can do
2268 echo 1 > /proc/sys/net/ipv4/ip_forward 2267 echo 1 > /proc/sys/net/ipv4/ip_forward
2269 cat it again 2268 cat it again
2270 cat /proc/sys/net/ipv4/ip_forward 2269 cat /proc/sys/net/ipv4/ip_forward
2271 On my machine now outputs 2270 On my machine now outputs
2272 1 2271 1
2273 IP forwarding is on. 2272 IP forwarding is on.
2274 There is a lot of useful info in here best found by going in & having a look around, 2273 There is a lot of useful info in here best found by going in & having a look around,
2275 so I'll take you through some entries I consider important. 2274 so I'll take you through some entries I consider important.
2276 2275
2277 All the processes running on the machine have there own entry defined by 2276 All the processes running on the machine have there own entry defined by
2278 /proc/<pid> 2277 /proc/<pid>
2279 So lets have a look at the init process 2278 So lets have a look at the init process
2280 cd /proc/1 2279 cd /proc/1
2281 2280
2282 cat cmdline 2281 cat cmdline
2283 emits 2282 emits
2284 init [2] 2283 init [2]
2285 2284
2286 cd /proc/1/fd 2285 cd /proc/1/fd
2287 This contains numerical entries of all the open files, 2286 This contains numerical entries of all the open files,
2288 some of these you can cat e.g. stdout (2) 2287 some of these you can cat e.g. stdout (2)
2289 2288
2290 cat /proc/29/maps 2289 cat /proc/29/maps
2291 on my machine emits 2290 on my machine emits
2292 2291
2293 00400000-00478000 r-xp 00000000 5f:00 4103 /bin/bash 2292 00400000-00478000 r-xp 00000000 5f:00 4103 /bin/bash
2294 00478000-0047e000 rw-p 00077000 5f:00 4103 /bin/bash 2293 00478000-0047e000 rw-p 00077000 5f:00 4103 /bin/bash
2295 0047e000-00492000 rwxp 00000000 00:00 0 2294 0047e000-00492000 rwxp 00000000 00:00 0
2296 40000000-40015000 r-xp 00000000 5f:00 14382 /lib/ld-2.1.2.so 2295 40000000-40015000 r-xp 00000000 5f:00 14382 /lib/ld-2.1.2.so
2297 40015000-40016000 rw-p 00014000 5f:00 14382 /lib/ld-2.1.2.so 2296 40015000-40016000 rw-p 00014000 5f:00 14382 /lib/ld-2.1.2.so
2298 40016000-40017000 rwxp 00000000 00:00 0 2297 40016000-40017000 rwxp 00000000 00:00 0
2299 40017000-40018000 rw-p 00000000 00:00 0 2298 40017000-40018000 rw-p 00000000 00:00 0
2300 40018000-4001b000 r-xp 00000000 5f:00 14435 /lib/libtermcap.so.2.0.8 2299 40018000-4001b000 r-xp 00000000 5f:00 14435 /lib/libtermcap.so.2.0.8
2301 4001b000-4001c000 rw-p 00002000 5f:00 14435 /lib/libtermcap.so.2.0.8 2300 4001b000-4001c000 rw-p 00002000 5f:00 14435 /lib/libtermcap.so.2.0.8
2302 4001c000-4010d000 r-xp 00000000 5f:00 14387 /lib/libc-2.1.2.so 2301 4001c000-4010d000 r-xp 00000000 5f:00 14387 /lib/libc-2.1.2.so
2303 4010d000-40111000 rw-p 000f0000 5f:00 14387 /lib/libc-2.1.2.so 2302 4010d000-40111000 rw-p 000f0000 5f:00 14387 /lib/libc-2.1.2.so
2304 40111000-40114000 rw-p 00000000 00:00 0 2303 40111000-40114000 rw-p 00000000 00:00 0
2305 40114000-4011e000 r-xp 00000000 5f:00 14408 /lib/libnss_files-2.1.2.so 2304 40114000-4011e000 r-xp 00000000 5f:00 14408 /lib/libnss_files-2.1.2.so
2306 4011e000-4011f000 rw-p 00009000 5f:00 14408 /lib/libnss_files-2.1.2.so 2305 4011e000-4011f000 rw-p 00009000 5f:00 14408 /lib/libnss_files-2.1.2.so
2307 7fffd000-80000000 rwxp ffffe000 00:00 0 2306 7fffd000-80000000 rwxp ffffe000 00:00 0
2308 2307
2309 2308
2310 Showing us the shared libraries init uses where they are in memory 2309 Showing us the shared libraries init uses where they are in memory
2311 & memory access permissions for each virtual memory area. 2310 & memory access permissions for each virtual memory area.
2312 2311
2313 /proc/1/cwd is a softlink to the current working directory. 2312 /proc/1/cwd is a softlink to the current working directory.
2314 /proc/1/root is the root of the filesystem for this process. 2313 /proc/1/root is the root of the filesystem for this process.
2315 2314
2316 /proc/1/mem is the current running processes memory which you 2315 /proc/1/mem is the current running processes memory which you
2317 can read & write to like a file. 2316 can read & write to like a file.
2318 strace uses this sometimes as it is a bit faster than the 2317 strace uses this sometimes as it is a bit faster than the
2319 rather inefficient ptrace interface for peeking at DATA. 2318 rather inefficient ptrace interface for peeking at DATA.
2320 2319
2321 2320
2322 cat status 2321 cat status
2323 2322
2324 Name: init 2323 Name: init
2325 State: S (sleeping) 2324 State: S (sleeping)
2326 Pid: 1 2325 Pid: 1
2327 PPid: 0 2326 PPid: 0
2328 Uid: 0 0 0 0 2327 Uid: 0 0 0 0
2329 Gid: 0 0 0 0 2328 Gid: 0 0 0 0
2330 Groups: 2329 Groups:
2331 VmSize: 408 kB 2330 VmSize: 408 kB
2332 VmLck: 0 kB 2331 VmLck: 0 kB
2333 VmRSS: 208 kB 2332 VmRSS: 208 kB
2334 VmData: 24 kB 2333 VmData: 24 kB
2335 VmStk: 8 kB 2334 VmStk: 8 kB
2336 VmExe: 368 kB 2335 VmExe: 368 kB
2337 VmLib: 0 kB 2336 VmLib: 0 kB
2338 SigPnd: 0000000000000000 2337 SigPnd: 0000000000000000
2339 SigBlk: 0000000000000000 2338 SigBlk: 0000000000000000
2340 SigIgn: 7fffffffd7f0d8fc 2339 SigIgn: 7fffffffd7f0d8fc
2341 SigCgt: 00000000280b2603 2340 SigCgt: 00000000280b2603
2342 CapInh: 00000000fffffeff 2341 CapInh: 00000000fffffeff
2343 CapPrm: 00000000ffffffff 2342 CapPrm: 00000000ffffffff
2344 CapEff: 00000000fffffeff 2343 CapEff: 00000000fffffeff
2345 2344
2346 User PSW: 070de000 80414146 2345 User PSW: 070de000 80414146
2347 task: 004b6000 tss: 004b62d8 ksp: 004b7ca8 pt_regs: 004b7f68 2346 task: 004b6000 tss: 004b62d8 ksp: 004b7ca8 pt_regs: 004b7f68
2348 User GPRS: 2347 User GPRS:
2349 00000400 00000000 0000000b 7ffffa90 2348 00000400 00000000 0000000b 7ffffa90
2350 00000000 00000000 00000000 0045d9f4 2349 00000000 00000000 00000000 0045d9f4
2351 0045cafc 7ffffa90 7fffff18 0045cb08 2350 0045cafc 7ffffa90 7fffff18 0045cb08
2352 00010400 804039e8 80403af8 7ffff8b0 2351 00010400 804039e8 80403af8 7ffff8b0
2353 User ACRS: 2352 User ACRS:
2354 00000000 00000000 00000000 00000000 2353 00000000 00000000 00000000 00000000
2355 00000001 00000000 00000000 00000000 2354 00000001 00000000 00000000 00000000
2356 00000000 00000000 00000000 00000000 2355 00000000 00000000 00000000 00000000
2357 00000000 00000000 00000000 00000000 2356 00000000 00000000 00000000 00000000
2358 Kernel BackChain CallChain BackChain CallChain 2357 Kernel BackChain CallChain BackChain CallChain
2359 004b7ca8 8002bd0c 004b7d18 8002b92c 2358 004b7ca8 8002bd0c 004b7d18 8002b92c
2360 004b7db8 8005cd50 004b7e38 8005d12a 2359 004b7db8 8005cd50 004b7e38 8005d12a
2361 004b7f08 80019114 2360 004b7f08 80019114
2362 Showing among other things memory usage & status of some signals & 2361 Showing among other things memory usage & status of some signals &
2363 the processes'es registers from the kernel task_structure 2362 the processes'es registers from the kernel task_structure
2364 as well as a backchain which may be useful if a process crashes 2363 as well as a backchain which may be useful if a process crashes
2365 in the kernel for some unknown reason. 2364 in the kernel for some unknown reason.
2366 2365
2367 Some driver debugging techniques 2366 Some driver debugging techniques
2368 ================================ 2367 ================================
2369 debug feature 2368 debug feature
2370 ------------- 2369 -------------
2371 Some of our drivers now support a "debug feature" in 2370 Some of our drivers now support a "debug feature" in
2372 /proc/s390dbf see s390dbf.txt in the linux/Documentation directory 2371 /proc/s390dbf see s390dbf.txt in the linux/Documentation directory
2373 for more info. 2372 for more info.
2374 e.g. 2373 e.g.
2375 to switch on the lcs "debug feature" 2374 to switch on the lcs "debug feature"
2376 echo 5 > /proc/s390dbf/lcs/level 2375 echo 5 > /proc/s390dbf/lcs/level
2377 & then after the error occurred. 2376 & then after the error occurred.
2378 cat /proc/s390dbf/lcs/sprintf >/logfile 2377 cat /proc/s390dbf/lcs/sprintf >/logfile
2379 the logfile now contains some information which may help 2378 the logfile now contains some information which may help
2380 tech support resolve a problem in the field. 2379 tech support resolve a problem in the field.
2381 2380
2382 2381
2383 2382
2384 high level debugging network drivers 2383 high level debugging network drivers
2385 ------------------------------------ 2384 ------------------------------------
2386 ifconfig is a quite useful command 2385 ifconfig is a quite useful command
2387 it gives the current state of network drivers. 2386 it gives the current state of network drivers.
2388 2387
2389 If you suspect your network device driver is dead 2388 If you suspect your network device driver is dead
2390 one way to check is type 2389 one way to check is type
2391 ifconfig <network device> 2390 ifconfig <network device>
2392 e.g. tr0 2391 e.g. tr0
2393 You should see something like 2392 You should see something like
2394 tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:04:AC:20:8E:48 2393 tr0 Link encap:16/4 Mbps Token Ring (New) HWaddr 00:04:AC:20:8E:48
2395 inet addr:9.164.185.132 Bcast:9.164.191.255 Mask:255.255.224.0 2394 inet addr:9.164.185.132 Bcast:9.164.191.255 Mask:255.255.224.0
2396 UP BROADCAST RUNNING MULTICAST MTU:2000 Metric:1 2395 UP BROADCAST RUNNING MULTICAST MTU:2000 Metric:1
2397 RX packets:246134 errors:0 dropped:0 overruns:0 frame:0 2396 RX packets:246134 errors:0 dropped:0 overruns:0 frame:0
2398 TX packets:5 errors:0 dropped:0 overruns:0 carrier:0 2397 TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
2399 collisions:0 txqueuelen:100 2398 collisions:0 txqueuelen:100
2400 2399
2401 if the device doesn't say up 2400 if the device doesn't say up
2402 try 2401 try
2403 /etc/rc.d/init.d/network start 2402 /etc/rc.d/init.d/network start
2404 ( this starts the network stack & hopefully calls ifconfig tr0 up ). 2403 ( this starts the network stack & hopefully calls ifconfig tr0 up ).
2405 ifconfig looks at the output of /proc/net/dev & presents it in a more presentable form 2404 ifconfig looks at the output of /proc/net/dev & presents it in a more presentable form
2406 Now ping the device from a machine in the same subnet. 2405 Now ping the device from a machine in the same subnet.
2407 if the RX packets count & TX packets counts don't increment you probably 2406 if the RX packets count & TX packets counts don't increment you probably
2408 have problems. 2407 have problems.
2409 next 2408 next
2410 cat /proc/net/arp 2409 cat /proc/net/arp
2411 Do you see any hardware addresses in the cache if not you may have problems. 2410 Do you see any hardware addresses in the cache if not you may have problems.
2412 Next try 2411 Next try
2413 ping -c 5 <broadcast_addr> i.e. the Bcast field above in the output of 2412 ping -c 5 <broadcast_addr> i.e. the Bcast field above in the output of
2414 ifconfig. Do you see any replies from machines other than the local machine 2413 ifconfig. Do you see any replies from machines other than the local machine
2415 if not you may have problems. also if the TX packets count in ifconfig 2414 if not you may have problems. also if the TX packets count in ifconfig
2416 hasn't incremented either you have serious problems in your driver 2415 hasn't incremented either you have serious problems in your driver
2417 (e.g. the txbusy field of the network device being stuck on ) 2416 (e.g. the txbusy field of the network device being stuck on )
2418 or you may have multiple network devices connected. 2417 or you may have multiple network devices connected.
2419 2418
2420 2419
2421 chandev 2420 chandev
2422 ------- 2421 -------
2423 There is a new device layer for channel devices, some 2422 There is a new device layer for channel devices, some
2424 drivers e.g. lcs are registered with this layer. 2423 drivers e.g. lcs are registered with this layer.
2425 If the device uses the channel device layer you'll be 2424 If the device uses the channel device layer you'll be
2426 able to find what interrupts it uses & the current state 2425 able to find what interrupts it uses & the current state
2427 of the device. 2426 of the device.
2428 See the manpage chandev.8 &type cat /proc/chandev for more info. 2427 See the manpage chandev.8 &type cat /proc/chandev for more info.
2429 2428
2430 2429
2431 2430
2432 Starting points for debugging scripting languages etc. 2431 Starting points for debugging scripting languages etc.
2433 ====================================================== 2432 ======================================================
2434 2433
2435 bash/sh 2434 bash/sh
2436 2435
2437 bash -x <scriptname> 2436 bash -x <scriptname>
2438 e.g. bash -x /usr/bin/bashbug 2437 e.g. bash -x /usr/bin/bashbug
2439 displays the following lines as it executes them. 2438 displays the following lines as it executes them.
2440 + MACHINE=i586 2439 + MACHINE=i586
2441 + OS=linux-gnu 2440 + OS=linux-gnu
2442 + CC=gcc 2441 + CC=gcc
2443 + CFLAGS= -DPROGRAM='bash' -DHOSTTYPE='i586' -DOSTYPE='linux-gnu' -DMACHTYPE='i586-pc-linux-gnu' -DSHELL -DHAVE_CONFIG_H -I. -I. -I./lib -O2 -pipe 2442 + CFLAGS= -DPROGRAM='bash' -DHOSTTYPE='i586' -DOSTYPE='linux-gnu' -DMACHTYPE='i586-pc-linux-gnu' -DSHELL -DHAVE_CONFIG_H -I. -I. -I./lib -O2 -pipe
2444 + RELEASE=2.01 2443 + RELEASE=2.01
2445 + PATCHLEVEL=1 2444 + PATCHLEVEL=1
2446 + RELSTATUS=release 2445 + RELSTATUS=release
2447 + MACHTYPE=i586-pc-linux-gnu 2446 + MACHTYPE=i586-pc-linux-gnu
2448 2447
2449 perl -d <scriptname> runs the perlscript in a fully interactive debugger 2448 perl -d <scriptname> runs the perlscript in a fully interactive debugger
2450 <like gdb>. 2449 <like gdb>.
2451 Type 'h' in the debugger for help. 2450 Type 'h' in the debugger for help.
2452 2451
2453 for debugging java type 2452 for debugging java type
2454 jdb <filename> another fully interactive gdb style debugger. 2453 jdb <filename> another fully interactive gdb style debugger.
2455 & type ? in the debugger for help. 2454 & type ? in the debugger for help.
2456 2455
2457 2456
2458 2457
2459 Dumptool & Lcrash ( lkcd ) 2458 Dumptool & Lcrash ( lkcd )
2460 ========================== 2459 ==========================
2461 Michael Holzheu & others here at IBM have a fairly mature port of 2460 Michael Holzheu & others here at IBM have a fairly mature port of
2462 SGI's lcrash tool which allows one to look at kernel structures in a 2461 SGI's lcrash tool which allows one to look at kernel structures in a
2463 running kernel. 2462 running kernel.
2464 2463
2465 It also complements a tool called dumptool which dumps all the kernel's 2464 It also complements a tool called dumptool which dumps all the kernel's
2466 memory pages & registers to either a tape or a disk. 2465 memory pages & registers to either a tape or a disk.
2467 This can be used by tech support or an ambitious end user do 2466 This can be used by tech support or an ambitious end user do
2468 post mortem debugging of a machine like gdb core dumps. 2467 post mortem debugging of a machine like gdb core dumps.
2469 2468
2470 Going into how to use this tool in detail will be explained 2469 Going into how to use this tool in detail will be explained
2471 in other documentation supplied by IBM with the patches & the 2470 in other documentation supplied by IBM with the patches & the
2472 lcrash homepage http://oss.sgi.com/projects/lkcd/ & the lcrash manpage. 2471 lcrash homepage http://oss.sgi.com/projects/lkcd/ & the lcrash manpage.
2473 2472
2474 How they work 2473 How they work
2475 ------------- 2474 -------------
2476 Lcrash is a perfectly normal program,however, it requires 2 2475 Lcrash is a perfectly normal program,however, it requires 2
2477 additional files, Kerntypes which is built using a patch to the 2476 additional files, Kerntypes which is built using a patch to the
2478 linux kernel sources in the linux root directory & the System.map. 2477 linux kernel sources in the linux root directory & the System.map.
2479 2478
2480 Kerntypes is an an objectfile whose sole purpose in life 2479 Kerntypes is an objectfile whose sole purpose in life
2481 is to provide stabs debug info to lcrash, to do this 2480 is to provide stabs debug info to lcrash, to do this
2482 Kerntypes is built from kerntypes.c which just includes the most commonly 2481 Kerntypes is built from kerntypes.c which just includes the most commonly
2483 referenced header files used when debugging, lcrash can then read the 2482 referenced header files used when debugging, lcrash can then read the
2484 .stabs section of this file. 2483 .stabs section of this file.
2485 2484
2486 Debugging a live system it uses /dev/mem 2485 Debugging a live system it uses /dev/mem
2487 alternatively for post mortem debugging it uses the data 2486 alternatively for post mortem debugging it uses the data
2488 collected by dumptool. 2487 collected by dumptool.
2489 2488
2490 2489
2491 2490
2492 SysRq 2491 SysRq
2493 ===== 2492 =====
2494 This is now supported by linux for s/390 & z/Architecture. 2493 This is now supported by linux for s/390 & z/Architecture.
2495 To enable it do compile the kernel with 2494 To enable it do compile the kernel with
2496 Kernel Hacking -> Magic SysRq Key Enabled 2495 Kernel Hacking -> Magic SysRq Key Enabled
2497 echo "1" > /proc/sys/kernel/sysrq 2496 echo "1" > /proc/sys/kernel/sysrq
2498 also type 2497 also type
2499 echo "8" >/proc/sys/kernel/printk 2498 echo "8" >/proc/sys/kernel/printk
2500 To make printk output go to console. 2499 To make printk output go to console.
2501 On 390 all commands are prefixed with 2500 On 390 all commands are prefixed with
2502 ^- 2501 ^-
2503 e.g. 2502 e.g.
2504 ^-t will show tasks. 2503 ^-t will show tasks.
2505 ^-? or some unknown command will display help. 2504 ^-? or some unknown command will display help.
2506 The sysrq key reading is very picky ( I have to type the keys in an 2505 The sysrq key reading is very picky ( I have to type the keys in an
2507 xterm session & paste them into the x3270 console ) 2506 xterm session & paste them into the x3270 console )
2508 & it may be wise to predefine the keys as described in the VM hints above 2507 & it may be wise to predefine the keys as described in the VM hints above
2509 2508
2510 This is particularly useful for syncing disks unmounting & rebooting 2509 This is particularly useful for syncing disks unmounting & rebooting
2511 if the machine gets partially hung. 2510 if the machine gets partially hung.
2512 2511
2513 Read Documentation/sysrq.txt for more info 2512 Read Documentation/sysrq.txt for more info
2514 2513
2515 References: 2514 References:
2516 =========== 2515 ===========
2517 Enterprise Systems Architecture Reference Summary 2516 Enterprise Systems Architecture Reference Summary
2518 Enterprise Systems Architecture Principles of Operation 2517 Enterprise Systems Architecture Principles of Operation
2519 Hartmut Penners s390 stack frame sheet. 2518 Hartmut Penners s390 stack frame sheet.
2520 IBM Mainframe Channel Attachment a technology brief from a CISCO webpage 2519 IBM Mainframe Channel Attachment a technology brief from a CISCO webpage
2521 Various bits of man & info pages of Linux. 2520 Various bits of man & info pages of Linux.
2522 Linux & GDB source. 2521 Linux & GDB source.
2523 Various info & man pages. 2522 Various info & man pages.
2524 CMS Help on tracing commands. 2523 CMS Help on tracing commands.
2525 Linux for s/390 Elf Application Binary Interface 2524 Linux for s/390 Elf Application Binary Interface
2526 Linux for z/Series Elf Application Binary Interface ( Both Highly Recommended ) 2525 Linux for z/Series Elf Application Binary Interface ( Both Highly Recommended )
2527 z/Architecture Principles of Operation SA22-7832-00 2526 z/Architecture Principles of Operation SA22-7832-00
2528 Enterprise Systems Architecture/390 Reference Summary SA22-7209-01 & the 2527 Enterprise Systems Architecture/390 Reference Summary SA22-7209-01 & the
2529 Enterprise Systems Architecture/390 Principles of Operation SA22-7201-05 2528 Enterprise Systems Architecture/390 Principles of Operation SA22-7201-05
2530 2529
2531 Special Thanks 2530 Special Thanks
2532 ============== 2531 ==============
2533 Special thanks to Neale Ferguson who maintains a much 2532 Special thanks to Neale Ferguson who maintains a much
2534 prettier HTML version of this page at 2533 prettier HTML version of this page at
2535 http://penguinvm.princeton.edu/notes.html#Debug390 2534 http://penguinvm.princeton.edu/notes.html#Debug390
2536 Bob Grainger Stefan Bader & others for reporting bugs 2535 Bob Grainger Stefan Bader & others for reporting bugs
2537 2536
Documentation/s390/s390dbf.txt
1 S390 Debug Feature 1 S390 Debug Feature
2 ================== 2 ==================
3 3
4 files: arch/s390/kernel/debug.c 4 files: arch/s390/kernel/debug.c
5 include/asm-s390/debug.h 5 include/asm-s390/debug.h
6 6
7 Description: 7 Description:
8 ------------ 8 ------------
9 The goal of this feature is to provide a kernel debug logging API 9 The goal of this feature is to provide a kernel debug logging API
10 where log records can be stored efficiently in memory, where each component 10 where log records can be stored efficiently in memory, where each component
11 (e.g. device drivers) can have one separate debug log. 11 (e.g. device drivers) can have one separate debug log.
12 One purpose of this is to inspect the debug logs after a production system crash 12 One purpose of this is to inspect the debug logs after a production system crash
13 in order to analyze the reason for the crash. 13 in order to analyze the reason for the crash.
14 If the system still runs but only a subcomponent which uses dbf fails, 14 If the system still runs but only a subcomponent which uses dbf fails,
15 it is possible to look at the debug logs on a live system via the Linux 15 it is possible to look at the debug logs on a live system via the Linux
16 debugfs filesystem. 16 debugfs filesystem.
17 The debug feature may also very useful for kernel and driver development. 17 The debug feature may also very useful for kernel and driver development.
18 18
19 Design: 19 Design:
20 ------- 20 -------
21 Kernel components (e.g. device drivers) can register themselves at the debug 21 Kernel components (e.g. device drivers) can register themselves at the debug
22 feature with the function call debug_register(). This function initializes a 22 feature with the function call debug_register(). This function initializes a
23 debug log for the caller. For each debug log exists a number of debug areas 23 debug log for the caller. For each debug log exists a number of debug areas
24 where exactly one is active at one time. Each debug area consists of contiguous 24 where exactly one is active at one time. Each debug area consists of contiguous
25 pages in memory. In the debug areas there are stored debug entries (log records) 25 pages in memory. In the debug areas there are stored debug entries (log records)
26 which are written by event- and exception-calls. 26 which are written by event- and exception-calls.
27 27
28 An event-call writes the specified debug entry to the active debug 28 An event-call writes the specified debug entry to the active debug
29 area and updates the log pointer for the active area. If the end 29 area and updates the log pointer for the active area. If the end
30 of the active debug area is reached, a wrap around is done (ring buffer) 30 of the active debug area is reached, a wrap around is done (ring buffer)
31 and the next debug entry will be written at the beginning of the active 31 and the next debug entry will be written at the beginning of the active
32 debug area. 32 debug area.
33 33
34 An exception-call writes the specified debug entry to the log and 34 An exception-call writes the specified debug entry to the log and
35 switches to the next debug area. This is done in order to be sure 35 switches to the next debug area. This is done in order to be sure
36 that the records which describe the origin of the exception are not 36 that the records which describe the origin of the exception are not
37 overwritten when a wrap around for the current area occurs. 37 overwritten when a wrap around for the current area occurs.
38 38
39 The debug areas itselve are also ordered in form of a ring buffer. 39 The debug areas itselve are also ordered in form of a ring buffer.
40 When an exception is thrown in the last debug area, the following debug 40 When an exception is thrown in the last debug area, the following debug
41 entries are then written again in the very first area. 41 entries are then written again in the very first area.
42 42
43 There are three versions for the event- and exception-calls: One for 43 There are three versions for the event- and exception-calls: One for
44 logging raw data, one for text and one for numbers. 44 logging raw data, one for text and one for numbers.
45 45
46 Each debug entry contains the following data: 46 Each debug entry contains the following data:
47 47
48 - Timestamp 48 - Timestamp
49 - Cpu-Number of calling task 49 - Cpu-Number of calling task
50 - Level of debug entry (0...6) 50 - Level of debug entry (0...6)
51 - Return Address to caller 51 - Return Address to caller
52 - Flag, if entry is an exception or not 52 - Flag, if entry is an exception or not
53 53
54 The debug logs can be inspected in a live system through entries in 54 The debug logs can be inspected in a live system through entries in
55 the debugfs-filesystem. Under the toplevel directory "s390dbf" there is 55 the debugfs-filesystem. Under the toplevel directory "s390dbf" there is
56 a directory for each registered component, which is named like the 56 a directory for each registered component, which is named like the
57 corresponding component. The debugfs normally should be mounted to 57 corresponding component. The debugfs normally should be mounted to
58 /sys/kernel/debug therefore the debug feature can be accessed unter 58 /sys/kernel/debug therefore the debug feature can be accessed unter
59 /sys/kernel/debug/s390dbf. 59 /sys/kernel/debug/s390dbf.
60 60
61 The content of the directories are files which represent different views 61 The content of the directories are files which represent different views
62 to the debug log. Each component can decide which views should be 62 to the debug log. Each component can decide which views should be
63 used through registering them with the function debug_register_view(). 63 used through registering them with the function debug_register_view().
64 Predefined views for hex/ascii, sprintf and raw binary data are provided. 64 Predefined views for hex/ascii, sprintf and raw binary data are provided.
65 It is also possible to define other views. The content of 65 It is also possible to define other views. The content of
66 a view can be inspected simply by reading the corresponding debugfs file. 66 a view can be inspected simply by reading the corresponding debugfs file.
67 67
68 All debug logs have an an actual debug level (range from 0 to 6). 68 All debug logs have an actual debug level (range from 0 to 6).
69 The default level is 3. Event and Exception functions have a 'level' 69 The default level is 3. Event and Exception functions have a 'level'
70 parameter. Only debug entries with a level that is lower or equal 70 parameter. Only debug entries with a level that is lower or equal
71 than the actual level are written to the log. This means, when 71 than the actual level are written to the log. This means, when
72 writing events, high priority log entries should have a low level 72 writing events, high priority log entries should have a low level
73 value whereas low priority entries should have a high one. 73 value whereas low priority entries should have a high one.
74 The actual debug level can be changed with the help of the debugfs-filesystem 74 The actual debug level can be changed with the help of the debugfs-filesystem
75 through writing a number string "x" to the 'level' debugfs file which is 75 through writing a number string "x" to the 'level' debugfs file which is
76 provided for every debug log. Debugging can be switched off completely 76 provided for every debug log. Debugging can be switched off completely
77 by using "-" on the 'level' debugfs file. 77 by using "-" on the 'level' debugfs file.
78 78
79 Example: 79 Example:
80 80
81 > echo "-" > /sys/kernel/debug/s390dbf/dasd/level 81 > echo "-" > /sys/kernel/debug/s390dbf/dasd/level
82 82
83 It is also possible to deactivate the debug feature globally for every 83 It is also possible to deactivate the debug feature globally for every
84 debug log. You can change the behavior using 2 sysctl parameters in 84 debug log. You can change the behavior using 2 sysctl parameters in
85 /proc/sys/s390dbf: 85 /proc/sys/s390dbf:
86 There are currently 2 possible triggers, which stop the debug feature 86 There are currently 2 possible triggers, which stop the debug feature
87 globally. The first possibility is to use the "debug_active" sysctl. If 87 globally. The first possibility is to use the "debug_active" sysctl. If
88 set to 1 the debug feature is running. If "debug_active" is set to 0 the 88 set to 1 the debug feature is running. If "debug_active" is set to 0 the
89 debug feature is turned off. 89 debug feature is turned off.
90 The second trigger which stops the debug feature is an kernel oops. 90 The second trigger which stops the debug feature is an kernel oops.
91 That prevents the debug feature from overwriting debug information that 91 That prevents the debug feature from overwriting debug information that
92 happened before the oops. After an oops you can reactivate the debug feature 92 happened before the oops. After an oops you can reactivate the debug feature
93 by piping 1 to /proc/sys/s390dbf/debug_active. Nevertheless, its not 93 by piping 1 to /proc/sys/s390dbf/debug_active. Nevertheless, its not
94 suggested to use an oopsed kernel in an production environment. 94 suggested to use an oopsed kernel in an production environment.
95 If you want to disallow the deactivation of the debug feature, you can use 95 If you want to disallow the deactivation of the debug feature, you can use
96 the "debug_stoppable" sysctl. If you set "debug_stoppable" to 0 the debug 96 the "debug_stoppable" sysctl. If you set "debug_stoppable" to 0 the debug
97 feature cannot be stopped. If the debug feature is already stopped, it 97 feature cannot be stopped. If the debug feature is already stopped, it
98 will stay deactivated. 98 will stay deactivated.
99 99
100 Kernel Interfaces: 100 Kernel Interfaces:
101 ------------------ 101 ------------------
102 102
103 ---------------------------------------------------------------------------- 103 ----------------------------------------------------------------------------
104 debug_info_t *debug_register(char *name, int pages, int nr_areas, 104 debug_info_t *debug_register(char *name, int pages, int nr_areas,
105 int buf_size); 105 int buf_size);
106 106
107 Parameter: name: Name of debug log (e.g. used for debugfs entry) 107 Parameter: name: Name of debug log (e.g. used for debugfs entry)
108 pages: number of pages, which will be allocated per area 108 pages: number of pages, which will be allocated per area
109 nr_areas: number of debug areas 109 nr_areas: number of debug areas
110 buf_size: size of data area in each debug entry 110 buf_size: size of data area in each debug entry
111 111
112 Return Value: Handle for generated debug area 112 Return Value: Handle for generated debug area
113 NULL if register failed 113 NULL if register failed
114 114
115 Description: Allocates memory for a debug log 115 Description: Allocates memory for a debug log
116 Must not be called within an interrupt handler 116 Must not be called within an interrupt handler
117 117
118 --------------------------------------------------------------------------- 118 ---------------------------------------------------------------------------
119 void debug_unregister (debug_info_t * id); 119 void debug_unregister (debug_info_t * id);
120 120
121 Parameter: id: handle for debug log 121 Parameter: id: handle for debug log
122 122
123 Return Value: none 123 Return Value: none
124 124
125 Description: frees memory for a debug log 125 Description: frees memory for a debug log
126 Must not be called within an interrupt handler 126 Must not be called within an interrupt handler
127 127
128 --------------------------------------------------------------------------- 128 ---------------------------------------------------------------------------
129 void debug_set_level (debug_info_t * id, int new_level); 129 void debug_set_level (debug_info_t * id, int new_level);
130 130
131 Parameter: id: handle for debug log 131 Parameter: id: handle for debug log
132 new_level: new debug level 132 new_level: new debug level
133 133
134 Return Value: none 134 Return Value: none
135 135
136 Description: Sets new actual debug level if new_level is valid. 136 Description: Sets new actual debug level if new_level is valid.
137 137
138 --------------------------------------------------------------------------- 138 ---------------------------------------------------------------------------
139 void debug_stop_all(void); 139 void debug_stop_all(void);
140 140
141 Parameter: none 141 Parameter: none
142 142
143 Return Value: none 143 Return Value: none
144 144
145 Description: stops the debug feature if stopping is allowed. Currently 145 Description: stops the debug feature if stopping is allowed. Currently
146 used in case of a kernel oops. 146 used in case of a kernel oops.
147 147
148 --------------------------------------------------------------------------- 148 ---------------------------------------------------------------------------
149 debug_entry_t* debug_event (debug_info_t* id, int level, void* data, 149 debug_entry_t* debug_event (debug_info_t* id, int level, void* data,
150 int length); 150 int length);
151 151
152 Parameter: id: handle for debug log 152 Parameter: id: handle for debug log
153 level: debug level 153 level: debug level
154 data: pointer to data for debug entry 154 data: pointer to data for debug entry
155 length: length of data in bytes 155 length: length of data in bytes
156 156
157 Return Value: Address of written debug entry 157 Return Value: Address of written debug entry
158 158
159 Description: writes debug entry to active debug area (if level <= actual 159 Description: writes debug entry to active debug area (if level <= actual
160 debug level) 160 debug level)
161 161
162 --------------------------------------------------------------------------- 162 ---------------------------------------------------------------------------
163 debug_entry_t* debug_int_event (debug_info_t * id, int level, 163 debug_entry_t* debug_int_event (debug_info_t * id, int level,
164 unsigned int data); 164 unsigned int data);
165 debug_entry_t* debug_long_event(debug_info_t * id, int level, 165 debug_entry_t* debug_long_event(debug_info_t * id, int level,
166 unsigned long data); 166 unsigned long data);
167 167
168 Parameter: id: handle for debug log 168 Parameter: id: handle for debug log
169 level: debug level 169 level: debug level
170 data: integer value for debug entry 170 data: integer value for debug entry
171 171
172 Return Value: Address of written debug entry 172 Return Value: Address of written debug entry
173 173
174 Description: writes debug entry to active debug area (if level <= actual 174 Description: writes debug entry to active debug area (if level <= actual
175 debug level) 175 debug level)
176 176
177 --------------------------------------------------------------------------- 177 ---------------------------------------------------------------------------
178 debug_entry_t* debug_text_event (debug_info_t * id, int level, 178 debug_entry_t* debug_text_event (debug_info_t * id, int level,
179 const char* data); 179 const char* data);
180 180
181 Parameter: id: handle for debug log 181 Parameter: id: handle for debug log
182 level: debug level 182 level: debug level
183 data: string for debug entry 183 data: string for debug entry
184 184
185 Return Value: Address of written debug entry 185 Return Value: Address of written debug entry
186 186
187 Description: writes debug entry in ascii format to active debug area 187 Description: writes debug entry in ascii format to active debug area
188 (if level <= actual debug level) 188 (if level <= actual debug level)
189 189
190 --------------------------------------------------------------------------- 190 ---------------------------------------------------------------------------
191 debug_entry_t* debug_sprintf_event (debug_info_t * id, int level, 191 debug_entry_t* debug_sprintf_event (debug_info_t * id, int level,
192 char* string,...); 192 char* string,...);
193 193
194 Parameter: id: handle for debug log 194 Parameter: id: handle for debug log
195 level: debug level 195 level: debug level
196 string: format string for debug entry 196 string: format string for debug entry
197 ...: varargs used as in sprintf() 197 ...: varargs used as in sprintf()
198 198
199 Return Value: Address of written debug entry 199 Return Value: Address of written debug entry
200 200
201 Description: writes debug entry with format string and varargs (longs) to 201 Description: writes debug entry with format string and varargs (longs) to
202 active debug area (if level $<=$ actual debug level). 202 active debug area (if level $<=$ actual debug level).
203 floats and long long datatypes cannot be used as varargs. 203 floats and long long datatypes cannot be used as varargs.
204 204
205 --------------------------------------------------------------------------- 205 ---------------------------------------------------------------------------
206 206
207 debug_entry_t* debug_exception (debug_info_t* id, int level, void* data, 207 debug_entry_t* debug_exception (debug_info_t* id, int level, void* data,
208 int length); 208 int length);
209 209
210 Parameter: id: handle for debug log 210 Parameter: id: handle for debug log
211 level: debug level 211 level: debug level
212 data: pointer to data for debug entry 212 data: pointer to data for debug entry
213 length: length of data in bytes 213 length: length of data in bytes
214 214
215 Return Value: Address of written debug entry 215 Return Value: Address of written debug entry
216 216
217 Description: writes debug entry to active debug area (if level <= actual 217 Description: writes debug entry to active debug area (if level <= actual
218 debug level) and switches to next debug area 218 debug level) and switches to next debug area
219 219
220 --------------------------------------------------------------------------- 220 ---------------------------------------------------------------------------
221 debug_entry_t* debug_int_exception (debug_info_t * id, int level, 221 debug_entry_t* debug_int_exception (debug_info_t * id, int level,
222 unsigned int data); 222 unsigned int data);
223 debug_entry_t* debug_long_exception(debug_info_t * id, int level, 223 debug_entry_t* debug_long_exception(debug_info_t * id, int level,
224 unsigned long data); 224 unsigned long data);
225 225
226 Parameter: id: handle for debug log 226 Parameter: id: handle for debug log
227 level: debug level 227 level: debug level
228 data: integer value for debug entry 228 data: integer value for debug entry
229 229
230 Return Value: Address of written debug entry 230 Return Value: Address of written debug entry
231 231
232 Description: writes debug entry to active debug area (if level <= actual 232 Description: writes debug entry to active debug area (if level <= actual
233 debug level) and switches to next debug area 233 debug level) and switches to next debug area
234 234
235 --------------------------------------------------------------------------- 235 ---------------------------------------------------------------------------
236 debug_entry_t* debug_text_exception (debug_info_t * id, int level, 236 debug_entry_t* debug_text_exception (debug_info_t * id, int level,
237 const char* data); 237 const char* data);
238 238
239 Parameter: id: handle for debug log 239 Parameter: id: handle for debug log
240 level: debug level 240 level: debug level
241 data: string for debug entry 241 data: string for debug entry
242 242
243 Return Value: Address of written debug entry 243 Return Value: Address of written debug entry
244 244
245 Description: writes debug entry in ascii format to active debug area 245 Description: writes debug entry in ascii format to active debug area
246 (if level <= actual debug level) and switches to next debug 246 (if level <= actual debug level) and switches to next debug
247 area 247 area
248 248
249 --------------------------------------------------------------------------- 249 ---------------------------------------------------------------------------
250 debug_entry_t* debug_sprintf_exception (debug_info_t * id, int level, 250 debug_entry_t* debug_sprintf_exception (debug_info_t * id, int level,
251 char* string,...); 251 char* string,...);
252 252
253 Parameter: id: handle for debug log 253 Parameter: id: handle for debug log
254 level: debug level 254 level: debug level
255 string: format string for debug entry 255 string: format string for debug entry
256 ...: varargs used as in sprintf() 256 ...: varargs used as in sprintf()
257 257
258 Return Value: Address of written debug entry 258 Return Value: Address of written debug entry
259 259
260 Description: writes debug entry with format string and varargs (longs) to 260 Description: writes debug entry with format string and varargs (longs) to
261 active debug area (if level $<=$ actual debug level) and 261 active debug area (if level $<=$ actual debug level) and
262 switches to next debug area. 262 switches to next debug area.
263 floats and long long datatypes cannot be used as varargs. 263 floats and long long datatypes cannot be used as varargs.
264 264
265 --------------------------------------------------------------------------- 265 ---------------------------------------------------------------------------
266 266
267 int debug_register_view (debug_info_t * id, struct debug_view *view); 267 int debug_register_view (debug_info_t * id, struct debug_view *view);
268 268
269 Parameter: id: handle for debug log 269 Parameter: id: handle for debug log
270 view: pointer to debug view struct 270 view: pointer to debug view struct
271 271
272 Return Value: 0 : ok 272 Return Value: 0 : ok
273 < 0: Error 273 < 0: Error
274 274
275 Description: registers new debug view and creates debugfs dir entry 275 Description: registers new debug view and creates debugfs dir entry
276 276
277 --------------------------------------------------------------------------- 277 ---------------------------------------------------------------------------
278 int debug_unregister_view (debug_info_t * id, struct debug_view *view); 278 int debug_unregister_view (debug_info_t * id, struct debug_view *view);
279 279
280 Parameter: id: handle for debug log 280 Parameter: id: handle for debug log
281 view: pointer to debug view struct 281 view: pointer to debug view struct
282 282
283 Return Value: 0 : ok 283 Return Value: 0 : ok
284 < 0: Error 284 < 0: Error
285 285
286 Description: unregisters debug view and removes debugfs dir entry 286 Description: unregisters debug view and removes debugfs dir entry
287 287
288 288
289 289
290 Predefined views: 290 Predefined views:
291 ----------------- 291 -----------------
292 292
293 extern struct debug_view debug_hex_ascii_view; 293 extern struct debug_view debug_hex_ascii_view;
294 extern struct debug_view debug_raw_view; 294 extern struct debug_view debug_raw_view;
295 extern struct debug_view debug_sprintf_view; 295 extern struct debug_view debug_sprintf_view;
296 296
297 Examples 297 Examples
298 -------- 298 --------
299 299
300 /* 300 /*
301 * hex_ascii- + raw-view Example 301 * hex_ascii- + raw-view Example
302 */ 302 */
303 303
304 #include <linux/init.h> 304 #include <linux/init.h>
305 #include <asm/debug.h> 305 #include <asm/debug.h>
306 306
307 static debug_info_t* debug_info; 307 static debug_info_t* debug_info;
308 308
309 static int init(void) 309 static int init(void)
310 { 310 {
311 /* register 4 debug areas with one page each and 4 byte data field */ 311 /* register 4 debug areas with one page each and 4 byte data field */
312 312
313 debug_info = debug_register ("test", 1, 4, 4 ); 313 debug_info = debug_register ("test", 1, 4, 4 );
314 debug_register_view(debug_info,&debug_hex_ascii_view); 314 debug_register_view(debug_info,&debug_hex_ascii_view);
315 debug_register_view(debug_info,&debug_raw_view); 315 debug_register_view(debug_info,&debug_raw_view);
316 316
317 debug_text_event(debug_info, 4 , "one "); 317 debug_text_event(debug_info, 4 , "one ");
318 debug_int_exception(debug_info, 4, 4711); 318 debug_int_exception(debug_info, 4, 4711);
319 debug_event(debug_info, 3, &debug_info, 4); 319 debug_event(debug_info, 3, &debug_info, 4);
320 320
321 return 0; 321 return 0;
322 } 322 }
323 323
324 static void cleanup(void) 324 static void cleanup(void)
325 { 325 {
326 debug_unregister (debug_info); 326 debug_unregister (debug_info);
327 } 327 }
328 328
329 module_init(init); 329 module_init(init);
330 module_exit(cleanup); 330 module_exit(cleanup);
331 331
332 --------------------------------------------------------------------------- 332 ---------------------------------------------------------------------------
333 333
334 /* 334 /*
335 * sprintf-view Example 335 * sprintf-view Example
336 */ 336 */
337 337
338 #include <linux/init.h> 338 #include <linux/init.h>
339 #include <asm/debug.h> 339 #include <asm/debug.h>
340 340
341 static debug_info_t* debug_info; 341 static debug_info_t* debug_info;
342 342
343 static int init(void) 343 static int init(void)
344 { 344 {
345 /* register 4 debug areas with one page each and data field for */ 345 /* register 4 debug areas with one page each and data field for */
346 /* format string pointer + 2 varargs (= 3 * sizeof(long)) */ 346 /* format string pointer + 2 varargs (= 3 * sizeof(long)) */
347 347
348 debug_info = debug_register ("test", 1, 4, sizeof(long) * 3); 348 debug_info = debug_register ("test", 1, 4, sizeof(long) * 3);
349 debug_register_view(debug_info,&debug_sprintf_view); 349 debug_register_view(debug_info,&debug_sprintf_view);
350 350
351 debug_sprintf_event(debug_info, 2 , "first event in %s:%i\n",__FILE__,__LINE__); 351 debug_sprintf_event(debug_info, 2 , "first event in %s:%i\n",__FILE__,__LINE__);
352 debug_sprintf_exception(debug_info, 1, "pointer to debug info: %p\n",&debug_info); 352 debug_sprintf_exception(debug_info, 1, "pointer to debug info: %p\n",&debug_info);
353 353
354 return 0; 354 return 0;
355 } 355 }
356 356
357 static void cleanup(void) 357 static void cleanup(void)
358 { 358 {
359 debug_unregister (debug_info); 359 debug_unregister (debug_info);
360 } 360 }
361 361
362 module_init(init); 362 module_init(init);
363 module_exit(cleanup); 363 module_exit(cleanup);
364 364
365 365
366 366
367 Debugfs Interface 367 Debugfs Interface
368 ---------------- 368 ----------------
369 Views to the debug logs can be investigated through reading the corresponding 369 Views to the debug logs can be investigated through reading the corresponding
370 debugfs-files: 370 debugfs-files:
371 371
372 Example: 372 Example:
373 373
374 > ls /sys/kernel/debug/s390dbf/dasd 374 > ls /sys/kernel/debug/s390dbf/dasd
375 flush hex_ascii level pages raw 375 flush hex_ascii level pages raw
376 > cat /sys/kernel/debug/s390dbf/dasd/hex_ascii | sort +1 376 > cat /sys/kernel/debug/s390dbf/dasd/hex_ascii | sort +1
377 00 00974733272:680099 2 - 02 0006ad7e 07 ea 4a 90 | .... 377 00 00974733272:680099 2 - 02 0006ad7e 07 ea 4a 90 | ....
378 00 00974733272:682210 2 - 02 0006ade6 46 52 45 45 | FREE 378 00 00974733272:682210 2 - 02 0006ade6 46 52 45 45 | FREE
379 00 00974733272:682213 2 - 02 0006adf6 07 ea 4a 90 | .... 379 00 00974733272:682213 2 - 02 0006adf6 07 ea 4a 90 | ....
380 00 00974733272:682281 1 * 02 0006ab08 41 4c 4c 43 | EXCP 380 00 00974733272:682281 1 * 02 0006ab08 41 4c 4c 43 | EXCP
381 01 00974733272:682284 2 - 02 0006ab16 45 43 4b 44 | ECKD 381 01 00974733272:682284 2 - 02 0006ab16 45 43 4b 44 | ECKD
382 01 00974733272:682287 2 - 02 0006ab28 00 00 00 04 | .... 382 01 00974733272:682287 2 - 02 0006ab28 00 00 00 04 | ....
383 01 00974733272:682289 2 - 02 0006ab3e 00 00 00 20 | ... 383 01 00974733272:682289 2 - 02 0006ab3e 00 00 00 20 | ...
384 01 00974733272:682297 2 - 02 0006ad7e 07 ea 4a 90 | .... 384 01 00974733272:682297 2 - 02 0006ad7e 07 ea 4a 90 | ....
385 01 00974733272:684384 2 - 00 0006ade6 46 52 45 45 | FREE 385 01 00974733272:684384 2 - 00 0006ade6 46 52 45 45 | FREE
386 01 00974733272:684388 2 - 00 0006adf6 07 ea 4a 90 | .... 386 01 00974733272:684388 2 - 00 0006adf6 07 ea 4a 90 | ....
387 387
388 See section about predefined views for explanation of the above output! 388 See section about predefined views for explanation of the above output!
389 389
390 Changing the debug level 390 Changing the debug level
391 ------------------------ 391 ------------------------
392 392
393 Example: 393 Example:
394 394
395 395
396 > cat /sys/kernel/debug/s390dbf/dasd/level 396 > cat /sys/kernel/debug/s390dbf/dasd/level
397 3 397 3
398 > echo "5" > /sys/kernel/debug/s390dbf/dasd/level 398 > echo "5" > /sys/kernel/debug/s390dbf/dasd/level
399 > cat /sys/kernel/debug/s390dbf/dasd/level 399 > cat /sys/kernel/debug/s390dbf/dasd/level
400 5 400 5
401 401
402 Flushing debug areas 402 Flushing debug areas
403 -------------------- 403 --------------------
404 Debug areas can be flushed with piping the number of the desired 404 Debug areas can be flushed with piping the number of the desired
405 area (0...n) to the debugfs file "flush". When using "-" all debug areas 405 area (0...n) to the debugfs file "flush". When using "-" all debug areas
406 are flushed. 406 are flushed.
407 407
408 Examples: 408 Examples:
409 409
410 1. Flush debug area 0: 410 1. Flush debug area 0:
411 > echo "0" > /sys/kernel/debug/s390dbf/dasd/flush 411 > echo "0" > /sys/kernel/debug/s390dbf/dasd/flush
412 412
413 2. Flush all debug areas: 413 2. Flush all debug areas:
414 > echo "-" > /sys/kernel/debug/s390dbf/dasd/flush 414 > echo "-" > /sys/kernel/debug/s390dbf/dasd/flush
415 415
416 Changing the size of debug areas 416 Changing the size of debug areas
417 ------------------------------------ 417 ------------------------------------
418 It is possible the change the size of debug areas through piping 418 It is possible the change the size of debug areas through piping
419 the number of pages to the debugfs file "pages". The resize request will 419 the number of pages to the debugfs file "pages". The resize request will
420 also flush the debug areas. 420 also flush the debug areas.
421 421
422 Example: 422 Example:
423 423
424 Define 4 pages for the debug areas of debug feature "dasd": 424 Define 4 pages for the debug areas of debug feature "dasd":
425 > echo "4" > /sys/kernel/debug/s390dbf/dasd/pages 425 > echo "4" > /sys/kernel/debug/s390dbf/dasd/pages
426 426
427 Stooping the debug feature 427 Stooping the debug feature
428 -------------------------- 428 --------------------------
429 Example: 429 Example:
430 430
431 1. Check if stopping is allowed 431 1. Check if stopping is allowed
432 > cat /proc/sys/s390dbf/debug_stoppable 432 > cat /proc/sys/s390dbf/debug_stoppable
433 2. Stop debug feature 433 2. Stop debug feature
434 > echo 0 > /proc/sys/s390dbf/debug_active 434 > echo 0 > /proc/sys/s390dbf/debug_active
435 435
436 lcrash Interface 436 lcrash Interface
437 ---------------- 437 ----------------
438 It is planned that the dump analysis tool lcrash gets an additional command 438 It is planned that the dump analysis tool lcrash gets an additional command
439 's390dbf' to display all the debug logs. With this tool it will be possible 439 's390dbf' to display all the debug logs. With this tool it will be possible
440 to investigate the debug logs on a live system and with a memory dump after 440 to investigate the debug logs on a live system and with a memory dump after
441 a system crash. 441 a system crash.
442 442
443 Investigating raw memory 443 Investigating raw memory
444 ------------------------ 444 ------------------------
445 One last possibility to investigate the debug logs at a live 445 One last possibility to investigate the debug logs at a live
446 system and after a system crash is to look at the raw memory 446 system and after a system crash is to look at the raw memory
447 under VM or at the Service Element. 447 under VM or at the Service Element.
448 It is possible to find the anker of the debug-logs through 448 It is possible to find the anker of the debug-logs through
449 the 'debug_area_first' symbol in the System map. Then one has 449 the 'debug_area_first' symbol in the System map. Then one has
450 to follow the correct pointers of the data-structures defined 450 to follow the correct pointers of the data-structures defined
451 in debug.h and find the debug-areas in memory. 451 in debug.h and find the debug-areas in memory.
452 Normally modules which use the debug feature will also have 452 Normally modules which use the debug feature will also have
453 a global variable with the pointer to the debug-logs. Following 453 a global variable with the pointer to the debug-logs. Following
454 this pointer it will also be possible to find the debug logs in 454 this pointer it will also be possible to find the debug logs in
455 memory. 455 memory.
456 456
457 For this method it is recommended to use '16 * x + 4' byte (x = 0..n) 457 For this method it is recommended to use '16 * x + 4' byte (x = 0..n)
458 for the length of the data field in debug_register() in 458 for the length of the data field in debug_register() in
459 order to see the debug entries well formatted. 459 order to see the debug entries well formatted.
460 460
461 461
462 Predefined Views 462 Predefined Views
463 ---------------- 463 ----------------
464 464
465 There are three predefined views: hex_ascii, raw and sprintf. 465 There are three predefined views: hex_ascii, raw and sprintf.
466 The hex_ascii view shows the data field in hex and ascii representation 466 The hex_ascii view shows the data field in hex and ascii representation
467 (e.g. '45 43 4b 44 | ECKD'). 467 (e.g. '45 43 4b 44 | ECKD').
468 The raw view returns a bytestream as the debug areas are stored in memory. 468 The raw view returns a bytestream as the debug areas are stored in memory.
469 469
470 The sprintf view formats the debug entries in the same way as the sprintf 470 The sprintf view formats the debug entries in the same way as the sprintf
471 function would do. The sprintf event/exception functions write to the 471 function would do. The sprintf event/exception functions write to the
472 debug entry a pointer to the format string (size = sizeof(long)) 472 debug entry a pointer to the format string (size = sizeof(long))
473 and for each vararg a long value. So e.g. for a debug entry with a format 473 and for each vararg a long value. So e.g. for a debug entry with a format
474 string plus two varargs one would need to allocate a (3 * sizeof(long)) 474 string plus two varargs one would need to allocate a (3 * sizeof(long))
475 byte data area in the debug_register() function. 475 byte data area in the debug_register() function.
476 476
477 477
478 NOTE: If using the sprintf view do NOT use other event/exception functions 478 NOTE: If using the sprintf view do NOT use other event/exception functions
479 than the sprintf-event and -exception functions. 479 than the sprintf-event and -exception functions.
480 480
481 The format of the hex_ascii and sprintf view is as follows: 481 The format of the hex_ascii and sprintf view is as follows:
482 - Number of area 482 - Number of area
483 - Timestamp (formatted as seconds and microseconds since 00:00:00 Coordinated 483 - Timestamp (formatted as seconds and microseconds since 00:00:00 Coordinated
484 Universal Time (UTC), January 1, 1970) 484 Universal Time (UTC), January 1, 1970)
485 - level of debug entry 485 - level of debug entry
486 - Exception flag (* = Exception) 486 - Exception flag (* = Exception)
487 - Cpu-Number of calling task 487 - Cpu-Number of calling task
488 - Return Address to caller 488 - Return Address to caller
489 - data field 489 - data field
490 490
491 The format of the raw view is: 491 The format of the raw view is:
492 - Header as described in debug.h 492 - Header as described in debug.h
493 - datafield 493 - datafield
494 494
495 A typical line of the hex_ascii view will look like the following (first line 495 A typical line of the hex_ascii view will look like the following (first line
496 is only for explanation and will not be displayed when 'cating' the view): 496 is only for explanation and will not be displayed when 'cating' the view):
497 497
498 area time level exception cpu caller data (hex + ascii) 498 area time level exception cpu caller data (hex + ascii)
499 -------------------------------------------------------------------------- 499 --------------------------------------------------------------------------
500 00 00964419409:440690 1 - 00 88023fe 500 00 00964419409:440690 1 - 00 88023fe
501 501
502 502
503 Defining views 503 Defining views
504 -------------- 504 --------------
505 505
506 Views are specified with the 'debug_view' structure. There are defined 506 Views are specified with the 'debug_view' structure. There are defined
507 callback functions which are used for reading and writing the debugfs files: 507 callback functions which are used for reading and writing the debugfs files:
508 508
509 struct debug_view { 509 struct debug_view {
510 char name[DEBUG_MAX_PROCF_LEN]; 510 char name[DEBUG_MAX_PROCF_LEN];
511 debug_prolog_proc_t* prolog_proc; 511 debug_prolog_proc_t* prolog_proc;
512 debug_header_proc_t* header_proc; 512 debug_header_proc_t* header_proc;
513 debug_format_proc_t* format_proc; 513 debug_format_proc_t* format_proc;
514 debug_input_proc_t* input_proc; 514 debug_input_proc_t* input_proc;
515 void* private_data; 515 void* private_data;
516 }; 516 };
517 517
518 where 518 where
519 519
520 typedef int (debug_header_proc_t) (debug_info_t* id, 520 typedef int (debug_header_proc_t) (debug_info_t* id,
521 struct debug_view* view, 521 struct debug_view* view,
522 int area, 522 int area,
523 debug_entry_t* entry, 523 debug_entry_t* entry,
524 char* out_buf); 524 char* out_buf);
525 525
526 typedef int (debug_format_proc_t) (debug_info_t* id, 526 typedef int (debug_format_proc_t) (debug_info_t* id,
527 struct debug_view* view, char* out_buf, 527 struct debug_view* view, char* out_buf,
528 const char* in_buf); 528 const char* in_buf);
529 typedef int (debug_prolog_proc_t) (debug_info_t* id, 529 typedef int (debug_prolog_proc_t) (debug_info_t* id,
530 struct debug_view* view, 530 struct debug_view* view,
531 char* out_buf); 531 char* out_buf);
532 typedef int (debug_input_proc_t) (debug_info_t* id, 532 typedef int (debug_input_proc_t) (debug_info_t* id,
533 struct debug_view* view, 533 struct debug_view* view,
534 struct file* file, const char* user_buf, 534 struct file* file, const char* user_buf,
535 size_t in_buf_size, loff_t* offset); 535 size_t in_buf_size, loff_t* offset);
536 536
537 537
538 The "private_data" member can be used as pointer to view specific data. 538 The "private_data" member can be used as pointer to view specific data.
539 It is not used by the debug feature itself. 539 It is not used by the debug feature itself.
540 540
541 The output when reading a debugfs file is structured like this: 541 The output when reading a debugfs file is structured like this:
542 542
543 "prolog_proc output" 543 "prolog_proc output"
544 544
545 "header_proc output 1" "format_proc output 1" 545 "header_proc output 1" "format_proc output 1"
546 "header_proc output 2" "format_proc output 2" 546 "header_proc output 2" "format_proc output 2"
547 "header_proc output 3" "format_proc output 3" 547 "header_proc output 3" "format_proc output 3"
548 ... 548 ...
549 549
550 When a view is read from the debugfs, the Debug Feature calls the 550 When a view is read from the debugfs, the Debug Feature calls the
551 'prolog_proc' once for writing the prolog. 551 'prolog_proc' once for writing the prolog.
552 Then 'header_proc' and 'format_proc' are called for each 552 Then 'header_proc' and 'format_proc' are called for each
553 existing debug entry. 553 existing debug entry.
554 554
555 The input_proc can be used to implement functionality when it is written to 555 The input_proc can be used to implement functionality when it is written to
556 the view (e.g. like with 'echo "0" > /sys/kernel/debug/s390dbf/dasd/level). 556 the view (e.g. like with 'echo "0" > /sys/kernel/debug/s390dbf/dasd/level).
557 557
558 For header_proc there can be used the default function 558 For header_proc there can be used the default function
559 debug_dflt_header_fn() which is defined in in debug.h. 559 debug_dflt_header_fn() which is defined in debug.h.
560 and which produces the same header output as the predefined views. 560 and which produces the same header output as the predefined views.
561 E.g: 561 E.g:
562 00 00964419409:440761 2 - 00 88023ec 562 00 00964419409:440761 2 - 00 88023ec
563 563
564 In order to see how to use the callback functions check the implementation 564 In order to see how to use the callback functions check the implementation
565 of the default views! 565 of the default views!
566 566
567 Example 567 Example
568 568
569 #include <asm/debug.h> 569 #include <asm/debug.h>
570 570
571 #define UNKNOWNSTR "data: %08x" 571 #define UNKNOWNSTR "data: %08x"
572 572
573 const char* messages[] = 573 const char* messages[] =
574 {"This error...........\n", 574 {"This error...........\n",
575 "That error...........\n", 575 "That error...........\n",
576 "Problem..............\n", 576 "Problem..............\n",
577 "Something went wrong.\n", 577 "Something went wrong.\n",
578 "Everything ok........\n", 578 "Everything ok........\n",
579 NULL 579 NULL
580 }; 580 };
581 581
582 static int debug_test_format_fn( 582 static int debug_test_format_fn(
583 debug_info_t * id, struct debug_view *view, 583 debug_info_t * id, struct debug_view *view,
584 char *out_buf, const char *in_buf 584 char *out_buf, const char *in_buf
585 ) 585 )
586 { 586 {
587 int i, rc = 0; 587 int i, rc = 0;
588 588
589 if(id->buf_size >= 4) { 589 if(id->buf_size >= 4) {
590 int msg_nr = *((int*)in_buf); 590 int msg_nr = *((int*)in_buf);
591 if(msg_nr < sizeof(messages)/sizeof(char*) - 1) 591 if(msg_nr < sizeof(messages)/sizeof(char*) - 1)
592 rc += sprintf(out_buf, "%s", messages[msg_nr]); 592 rc += sprintf(out_buf, "%s", messages[msg_nr]);
593 else 593 else
594 rc += sprintf(out_buf, UNKNOWNSTR, msg_nr); 594 rc += sprintf(out_buf, UNKNOWNSTR, msg_nr);
595 } 595 }
596 out: 596 out:
597 return rc; 597 return rc;
598 } 598 }
599 599
600 struct debug_view debug_test_view = { 600 struct debug_view debug_test_view = {
601 "myview", /* name of view */ 601 "myview", /* name of view */
602 NULL, /* no prolog */ 602 NULL, /* no prolog */
603 &debug_dflt_header_fn, /* default header for each entry */ 603 &debug_dflt_header_fn, /* default header for each entry */
604 &debug_test_format_fn, /* our own format function */ 604 &debug_test_format_fn, /* our own format function */
605 NULL, /* no input function */ 605 NULL, /* no input function */
606 NULL /* no private data */ 606 NULL /* no private data */
607 }; 607 };
608 608
609 ===== 609 =====
610 test: 610 test:
611 ===== 611 =====
612 debug_info_t *debug_info; 612 debug_info_t *debug_info;
613 ... 613 ...
614 debug_info = debug_register ("test", 0, 4, 4 )); 614 debug_info = debug_register ("test", 0, 4, 4 ));
615 debug_register_view(debug_info, &debug_test_view); 615 debug_register_view(debug_info, &debug_test_view);
616 for(i = 0; i < 10; i ++) debug_int_event(debug_info, 1, i); 616 for(i = 0; i < 10; i ++) debug_int_event(debug_info, 1, i);
617 617
618 > cat /sys/kernel/debug/s390dbf/test/myview 618 > cat /sys/kernel/debug/s390dbf/test/myview
619 00 00964419734:611402 1 - 00 88042ca This error........... 619 00 00964419734:611402 1 - 00 88042ca This error...........
620 00 00964419734:611405 1 - 00 88042ca That error........... 620 00 00964419734:611405 1 - 00 88042ca That error...........
621 00 00964419734:611408 1 - 00 88042ca Problem.............. 621 00 00964419734:611408 1 - 00 88042ca Problem..............
622 00 00964419734:611411 1 - 00 88042ca Something went wrong. 622 00 00964419734:611411 1 - 00 88042ca Something went wrong.
623 00 00964419734:611414 1 - 00 88042ca Everything ok........ 623 00 00964419734:611414 1 - 00 88042ca Everything ok........
624 00 00964419734:611417 1 - 00 88042ca data: 00000005 624 00 00964419734:611417 1 - 00 88042ca data: 00000005
625 00 00964419734:611419 1 - 00 88042ca data: 00000006 625 00 00964419734:611419 1 - 00 88042ca data: 00000006
626 00 00964419734:611422 1 - 00 88042ca data: 00000007 626 00 00964419734:611422 1 - 00 88042ca data: 00000007
627 00 00964419734:611425 1 - 00 88042ca data: 00000008 627 00 00964419734:611425 1 - 00 88042ca data: 00000008
628 00 00964419734:611428 1 - 00 88042ca data: 00000009 628 00 00964419734:611428 1 - 00 88042ca data: 00000009
629 629
Documentation/scsi/ChangeLog.1992-1997
1 Sat Jan 18 15:51:45 1997 Richard Henderson <rth@tamu.edu> 1 Sat Jan 18 15:51:45 1997 Richard Henderson <rth@tamu.edu>
2 2
3 * Don't play with usage_count directly, instead hand around 3 * Don't play with usage_count directly, instead hand around
4 the module header and use the module macros. 4 the module header and use the module macros.
5 5
6 Fri May 17 00:00:00 1996 Leonard N. Zubkoff <lnz@dandelion.com> 6 Fri May 17 00:00:00 1996 Leonard N. Zubkoff <lnz@dandelion.com>
7 7
8 * BusLogic Driver Version 2.0.3 Released. 8 * BusLogic Driver Version 2.0.3 Released.
9 9
10 Tue Apr 16 21:00:00 1996 Leonard N. Zubkoff <lnz@dandelion.com> 10 Tue Apr 16 21:00:00 1996 Leonard N. Zubkoff <lnz@dandelion.com>
11 11
12 * BusLogic Driver Version 1.3.2 Released. 12 * BusLogic Driver Version 1.3.2 Released.
13 13
14 Sun Dec 31 23:26:00 1995 Leonard N. Zubkoff <lnz@dandelion.com> 14 Sun Dec 31 23:26:00 1995 Leonard N. Zubkoff <lnz@dandelion.com>
15 15
16 * BusLogic Driver Version 1.3.1 Released. 16 * BusLogic Driver Version 1.3.1 Released.
17 17
18 Fri Nov 10 15:29:49 1995 Leonard N. Zubkoff <lnz@dandelion.com> 18 Fri Nov 10 15:29:49 1995 Leonard N. Zubkoff <lnz@dandelion.com>
19 19
20 * Released new BusLogic driver. 20 * Released new BusLogic driver.
21 21
22 Wed Aug 9 22:37:04 1995 Andries Brouwer <aeb@cwi.nl> 22 Wed Aug 9 22:37:04 1995 Andries Brouwer <aeb@cwi.nl>
23 23
24 As a preparation for new device code, separated the various 24 As a preparation for new device code, separated the various
25 functions the request->dev field had into the device proper, 25 functions the request->dev field had into the device proper,
26 request->rq_dev and a status field request->rq_status. 26 request->rq_dev and a status field request->rq_status.
27 27
28 The 2nd argument of bios_param is now a kdev_t. 28 The 2nd argument of bios_param is now a kdev_t.
29 29
30 Wed Jul 19 10:43:15 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 30 Wed Jul 19 10:43:15 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
31 31
32 * scsi.c (scsi_proc_info): /proc/scsi/scsi now also lists all 32 * scsi.c (scsi_proc_info): /proc/scsi/scsi now also lists all
33 attached devices. 33 attached devices.
34 34
35 * scsi_proc.c (proc_print_scsidevice): Added. Used by scsi.c and 35 * scsi_proc.c (proc_print_scsidevice): Added. Used by scsi.c and
36 eata_dma_proc.c to produce some device info for /proc/scsi. 36 eata_dma_proc.c to produce some device info for /proc/scsi.
37 37
38 * eata_dma.c (eata_queue)(eata_int_handler)(eata_scsi_done): 38 * eata_dma.c (eata_queue)(eata_int_handler)(eata_scsi_done):
39 Changed handling of internal SCSI commands send to the HBA. 39 Changed handling of internal SCSI commands send to the HBA.
40 40
41 41
42 Wed Jul 19 10:09:17 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 42 Wed Jul 19 10:09:17 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
43 43
44 * Linux 1.3.11 released. 44 * Linux 1.3.11 released.
45 45
46 * eata_dma.c (eata_queue)(eata_int_handler): Added code to do 46 * eata_dma.c (eata_queue)(eata_int_handler): Added code to do
47 command latency measurements if requested by root through 47 command latency measurements if requested by root through
48 /proc/scsi interface. 48 /proc/scsi interface.
49 Throughout Use HZ constant for time references. 49 Throughout Use HZ constant for time references.
50 50
51 * eata_pio.c: Use HZ constant for time references. 51 * eata_pio.c: Use HZ constant for time references.
52 52
53 * aic7xxx.c, aic7xxx.h, aic7xxx_asm.c: Changed copyright from BSD 53 * aic7xxx.c, aic7xxx.h, aic7xxx_asm.c: Changed copyright from BSD
54 to GNU style. 54 to GNU style.
55 55
56 * scsi.h: Added READ_12 command opcode constant 56 * scsi.h: Added READ_12 command opcode constant
57 57
58 Wed Jul 19 09:25:30 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 58 Wed Jul 19 09:25:30 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
59 59
60 * Linux 1.3.10 released. 60 * Linux 1.3.10 released.
61 61
62 * scsi_proc.c (dispatch_scsi_info): Removed unused variable. 62 * scsi_proc.c (dispatch_scsi_info): Removed unused variable.
63 63
64 Wed Jul 19 09:25:30 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 64 Wed Jul 19 09:25:30 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
65 65
66 * Linux 1.3.9 released. 66 * Linux 1.3.9 released.
67 67
68 * scsi.c Blacklist concept expanded to 'support' more device 68 * scsi.c Blacklist concept expanded to 'support' more device
69 deficiencies. blacklist[] renamed to device_list[] 69 deficiencies. blacklist[] renamed to device_list[]
70 (scan_scsis): Code cleanup. 70 (scan_scsis): Code cleanup.
71 71
72 * scsi_debug.c (scsi_debug_proc_info): Added support to control 72 * scsi_debug.c (scsi_debug_proc_info): Added support to control
73 device lockup simulation via /proc/scsi interface. 73 device lockup simulation via /proc/scsi interface.
74 74
75 75
76 Wed Jul 19 09:22:34 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 76 Wed Jul 19 09:22:34 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
77 77
78 * Linux 1.3.7 released. 78 * Linux 1.3.7 released.
79 79
80 * scsi_proc.c: Fixed a number of bugs in directory handling 80 * scsi_proc.c: Fixed a number of bugs in directory handling
81 81
82 Wed Jul 19 09:18:28 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 82 Wed Jul 19 09:18:28 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
83 83
84 * Linux 1.3.5 released. 84 * Linux 1.3.5 released.
85 85
86 * Native wide, multichannel and /proc/scsi support now in official 86 * Native wide, multichannel and /proc/scsi support now in official
87 kernel distribution. 87 kernel distribution.
88 88
89 * scsi.c/h, hosts.c/h et al reindented to increase readability 89 * scsi.c/h, hosts.c/h et al reindented to increase readability
90 (especially on 80 column wide terminals). 90 (especially on 80 column wide terminals).
91 91
92 * scsi.c, scsi_proc.c, ../../fs/proc/inode.c: Added 92 * scsi.c, scsi_proc.c, ../../fs/proc/inode.c: Added
93 /proc/scsi/scsi which allows root to scan for hotplugged devices. 93 /proc/scsi/scsi which allows root to scan for hotplugged devices.
94 94
95 * scsi.c (scsi_proc_info): Added, to support /proc/scsi/scsi. 95 * scsi.c (scsi_proc_info): Added, to support /proc/scsi/scsi.
96 (scan_scsis): Added some 'spaghetti' code to allow scanning for 96 (scan_scsis): Added some 'spaghetti' code to allow scanning for
97 single devices. 97 single devices.
98 98
99 99
100 Thu Jun 20 15:20:27 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 100 Thu Jun 20 15:20:27 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
101 101
102 * proc.c: Renamed to scsi_proc.c 102 * proc.c: Renamed to scsi_proc.c
103 103
104 Mon Jun 12 20:32:45 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 104 Mon Jun 12 20:32:45 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
105 105
106 * Linux 1.3.0 released. 106 * Linux 1.3.0 released.
107 107
108 Mon May 15 19:33:14 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 108 Mon May 15 19:33:14 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
109 109
110 * scsi.c: Added native multichannel and wide scsi support. 110 * scsi.c: Added native multichannel and wide scsi support.
111 111
112 * proc.c (dispatch_scsi_info) (build_proc_dir_hba_entries): 112 * proc.c (dispatch_scsi_info) (build_proc_dir_hba_entries):
113 Updated /proc/scsi interface. 113 Updated /proc/scsi interface.
114 114
115 Thu May 4 17:58:48 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de> 115 Thu May 4 17:58:48 1995 Michael Neuffer <neuffer@goofy.zdv.uni-mainz.de>
116 116
117 * sd.c (requeue_sd_request): Zero out the scatterlist only if 117 * sd.c (requeue_sd_request): Zero out the scatterlist only if
118 scsi_malloc returned memory for it. 118 scsi_malloc returned memory for it.
119 119
120 * eata_dma.c (register_HBA) (eata_queue): Add support for 120 * eata_dma.c (register_HBA) (eata_queue): Add support for
121 large scatter/gather tables and set use_clustering accordingly 121 large scatter/gather tables and set use_clustering accordingly
122 122
123 * hosts.c: Make use_clustering changeable in the Scsi_Host structure. 123 * hosts.c: Make use_clustering changeable in the Scsi_Host structure.
124 124
125 Wed Apr 12 15:25:52 1995 Eric Youngdale (eric@andante) 125 Wed Apr 12 15:25:52 1995 Eric Youngdale (eric@andante)
126 126
127 * Linux 1.2.5 released. 127 * Linux 1.2.5 released.
128 128
129 * buslogic.c: Update to version 1.15 (From Leonard N. Zubkoff). 129 * buslogic.c: Update to version 1.15 (From Leonard N. Zubkoff).
130 Fixed interrupt routine to avoid races when handling multiple 130 Fixed interrupt routine to avoid races when handling multiple
131 complete commands per interrupt. Seems to come up with faster 131 complete commands per interrupt. Seems to come up with faster
132 cards. 132 cards.
133 133
134 * eata_dma.c: Update to 2.3.5r. Modularize. Improved error handling 134 * eata_dma.c: Update to 2.3.5r. Modularize. Improved error handling
135 throughout and fixed bug interrupt routine which resulted in shifted 135 throughout and fixed bug interrupt routine which resulted in shifted
136 status bytes. Added blink LED state checks for ISA and EISA HBAs. 136 status bytes. Added blink LED state checks for ISA and EISA HBAs.
137 Memory management bug seems to have disappeared ==> increasing 137 Memory management bug seems to have disappeared ==> increasing
138 C_P_L_CURRENT_MAX to 16 for now. Decreasing C_P_L_DIV to 3 for 138 C_P_L_CURRENT_MAX to 16 for now. Decreasing C_P_L_DIV to 3 for
139 performance reasons. 139 performance reasons.
140 140
141 * scsi.c: If we get a FMK, EOM, or ILI when attempting to scan 141 * scsi.c: If we get a FMK, EOM, or ILI when attempting to scan
142 the bus, assume that it was just noise on the bus, and ignore 142 the bus, assume that it was just noise on the bus, and ignore
143 the device. 143 the device.
144 144
145 * scsi.h: Update and add a bunch of missing commands which we 145 * scsi.h: Update and add a bunch of missing commands which we
146 were never using. 146 were never using.
147 147
148 * sd.c: Use restore_flags in do_sd_request - this may result in 148 * sd.c: Use restore_flags in do_sd_request - this may result in
149 latency conditions, but it gets rid of races and crashes. 149 latency conditions, but it gets rid of races and crashes.
150 Do not save flags again when searching for a second command to 150 Do not save flags again when searching for a second command to
151 queue. 151 queue.
152 152
153 * st.c: Use bytes, not STP->buffer->buffer_size when reading 153 * st.c: Use bytes, not STP->buffer->buffer_size when reading
154 from tape. 154 from tape.
155 155
156 156
157 Tue Apr 4 09:42:08 1995 Eric Youngdale (eric@andante) 157 Tue Apr 4 09:42:08 1995 Eric Youngdale (eric@andante)
158 158
159 * Linux 1.2.4 released. 159 * Linux 1.2.4 released.
160 160
161 * st.c: Fix typo - restoring wrong flags. 161 * st.c: Fix typo - restoring wrong flags.
162 162
163 Wed Mar 29 06:55:12 1995 Eric Youngdale (eric@andante) 163 Wed Mar 29 06:55:12 1995 Eric Youngdale (eric@andante)
164 164
165 * Linux 1.2.3 released. 165 * Linux 1.2.3 released.
166 166
167 * st.c: Perform some waiting operations with interrupts off. 167 * st.c: Perform some waiting operations with interrupts off.
168 Is this correct??? 168 Is this correct???
169 169
170 Wed Mar 22 10:34:26 1995 Eric Youngdale (eric@andante) 170 Wed Mar 22 10:34:26 1995 Eric Youngdale (eric@andante)
171 171
172 * Linux 1.2.2 released. 172 * Linux 1.2.2 released.
173 173
174 * aha152x.c: Modularize. Add support for PCMCIA. 174 * aha152x.c: Modularize. Add support for PCMCIA.
175 175
176 * eata.c: Update to version 2.0. Fixed bug preventing media 176 * eata.c: Update to version 2.0. Fixed bug preventing media
177 detection. If scsi_register_host returns NULL, fail gracefully. 177 detection. If scsi_register_host returns NULL, fail gracefully.
178 178
179 * scsi.c: Detect as NEC (for photo-cd purposes) for the 84 179 * scsi.c: Detect as NEC (for photo-cd purposes) for the 84
180 and 25 models as "NEC_OLDCDR". 180 and 25 models as "NEC_OLDCDR".
181 181
182 * scsi.h: Add define for NEC_OLDCDR 182 * scsi.h: Add define for NEC_OLDCDR
183 183
184 * sr.c: Add handling for NEC_OLDCDR. Treat as unknown. 184 * sr.c: Add handling for NEC_OLDCDR. Treat as unknown.
185 185
186 * u14-34f.c: Update to version 2.0. Fixed same bug as in 186 * u14-34f.c: Update to version 2.0. Fixed same bug as in
187 eata.c. 187 eata.c.
188 188
189 189
190 Mon Mar 6 11:11:20 1995 Eric Youngdale (eric@andante) 190 Mon Mar 6 11:11:20 1995 Eric Youngdale (eric@andante)
191 191
192 * Linux 1.2.0 released. Yeah!!! 192 * Linux 1.2.0 released. Yeah!!!
193 193
194 * Minor spelling/punctuation changes throughout. Nothing 194 * Minor spelling/punctuation changes throughout. Nothing
195 substantive. 195 substantive.
196 196
197 Mon Feb 20 21:33:03 1995 Eric Youngdale (eric@andante) 197 Mon Feb 20 21:33:03 1995 Eric Youngdale (eric@andante)
198 198
199 * Linux 1.1.95 released. 199 * Linux 1.1.95 released.
200 200
201 * qlogic.c: Update to version 0.41. 201 * qlogic.c: Update to version 0.41.
202 202
203 * seagate.c: Change some message to be more descriptive about what 203 * seagate.c: Change some message to be more descriptive about what
204 we detected. 204 we detected.
205 205
206 * sr.c: spelling/whitespace changes. 206 * sr.c: spelling/whitespace changes.
207 207
208 Mon Feb 20 21:33:03 1995 Eric Youngdale (eric@andante) 208 Mon Feb 20 21:33:03 1995 Eric Youngdale (eric@andante)
209 209
210 * Linux 1.1.94 released. 210 * Linux 1.1.94 released.
211 211
212 Mon Feb 20 08:57:17 1995 Eric Youngdale (eric@andante) 212 Mon Feb 20 08:57:17 1995 Eric Youngdale (eric@andante)
213 213
214 * Linux 1.1.93 released. 214 * Linux 1.1.93 released.
215 215
216 * hosts.h: Change io_port to long int from short. 216 * hosts.h: Change io_port to long int from short.
217 217
218 * 53c7,8xx.c: crash on AEN fixed, SCSI reset is no longer a NOP, 218 * 53c7,8xx.c: crash on AEN fixed, SCSI reset is no longer a NOP,
219 NULL pointer panic on odd UDCs fixed, two bugs in diagnostic output 219 NULL pointer panic on odd UDCs fixed, two bugs in diagnostic output
220 fixed, should initialize correctly if left running, now loadable, 220 fixed, should initialize correctly if left running, now loadable,
221 new memory allocation, extraneous diagnostic output suppressed, 221 new memory allocation, extraneous diagnostic output suppressed,
222 splx() replaced with save/restore flags. [ Drew ] 222 splx() replaced with save/restore flags. [ Drew ]
223 223
224 * hosts.c, hosts.h, scsi_ioctl.c, sd.c, sd_ioctl.c, sg.c, sr.c, 224 * hosts.c, hosts.h, scsi_ioctl.c, sd.c, sd_ioctl.c, sg.c, sr.c,
225 sr_ioctl.c: Add special junk at end that Emacs will use for 225 sr_ioctl.c: Add special junk at end that Emacs will use for
226 formatting the file. 226 formatting the file.
227 227
228 * qlogic.c: Update to v0.40a. Improve parity handling. 228 * qlogic.c: Update to v0.40a. Improve parity handling.
229 229
230 * scsi.c: Add Hitachi DK312C to blacklist. Change "};" to "}" in 230 * scsi.c: Add Hitachi DK312C to blacklist. Change "};" to "}" in
231 many places. Use scsi_init_malloc to get command block - may 231 many places. Use scsi_init_malloc to get command block - may
232 need this to be dma compatible for some host adapters. 232 need this to be dma compatible for some host adapters.
233 Restore interrupts after unregistering a host. 233 Restore interrupts after unregistering a host.
234 234
235 * sd.c: Use sti instead of restore flags - causes latency problems. 235 * sd.c: Use sti instead of restore flags - causes latency problems.
236 236
237 * seagate.c: Use controller_type to determine string used when 237 * seagate.c: Use controller_type to determine string used when
238 registering irq. 238 registering irq.
239 239
240 * sr.c: More photo-cd hacks to make sure we get the xa stuff right. 240 * sr.c: More photo-cd hacks to make sure we get the xa stuff right.
241 * sr.h, sr.c: Change is_xa to xa_flags field. 241 * sr.h, sr.c: Change is_xa to xa_flags field.
242 242
243 * st.c: Disable retries for write operations. 243 * st.c: Disable retries for write operations.
244 244
245 Wed Feb 15 10:52:56 1995 Eric Youngdale (eric@andante) 245 Wed Feb 15 10:52:56 1995 Eric Youngdale (eric@andante)
246 246
247 * Linux 1.1.92 released. 247 * Linux 1.1.92 released.
248 248
249 * eata.c: Update to 1.17. 249 * eata.c: Update to 1.17.
250 250
251 * eata_dma.c: Update to 2.31a. Add more support for /proc/scsi. 251 * eata_dma.c: Update to 2.31a. Add more support for /proc/scsi.
252 Continuing modularization. Less crashes because of the bug in the 252 Continuing modularization. Less crashes because of the bug in the
253 memory management ==> increase C_P_L_CURRENT_MAX to 10 253 memory management ==> increase C_P_L_CURRENT_MAX to 10
254 and decrease C_P_L_DIV to 4. 254 and decrease C_P_L_DIV to 4.
255 255
256 * hosts.c: If we remove last host registered, reuse host number. 256 * hosts.c: If we remove last host registered, reuse host number.
257 When freeing memory from host being deregistered, free extra_bytes 257 When freeing memory from host being deregistered, free extra_bytes
258 too. 258 too.
259 259
260 * scsi.c (scan_scsis): memset(SDpnt, 0) and set SCmd.device to SDpnt. 260 * scsi.c (scan_scsis): memset(SDpnt, 0) and set SCmd.device to SDpnt.
261 Change memory allocation to work around bugs in __get_dma_pages. 261 Change memory allocation to work around bugs in __get_dma_pages.
262 Do not free host if usage count is not zero (for modules). 262 Do not free host if usage count is not zero (for modules).
263 263
264 * sr_ioctl.c: Increase IOCTL_TIMEOUT to 3000. 264 * sr_ioctl.c: Increase IOCTL_TIMEOUT to 3000.
265 265
266 * st.c: Allow for ST_EXTRA_DEVS in st data structures. 266 * st.c: Allow for ST_EXTRA_DEVS in st data structures.
267 267
268 * u14-34f.c: Update to 1.17. 268 * u14-34f.c: Update to 1.17.
269 269
270 Thu Feb 9 10:11:16 1995 Eric Youngdale (eric@andante) 270 Thu Feb 9 10:11:16 1995 Eric Youngdale (eric@andante)
271 271
272 * Linux 1.1.91 released. 272 * Linux 1.1.91 released.
273 273
274 * eata.c: Update to 1.16. Use wish_block instead of host->block. 274 * eata.c: Update to 1.16. Use wish_block instead of host->block.
275 275
276 * hosts.c: Initialize wish_block to 0. 276 * hosts.c: Initialize wish_block to 0.
277 277
278 * hosts.h: Add wish_block. 278 * hosts.h: Add wish_block.
279 279
280 * scsi.c: Use wish_block as indicator that the host should be added 280 * scsi.c: Use wish_block as indicator that the host should be added
281 to block list. 281 to block list.
282 282
283 * sg.c: Add SG_EXTRA_DEVS to number of slots. 283 * sg.c: Add SG_EXTRA_DEVS to number of slots.
284 284
285 * u14-34f.c: Use wish_block. 285 * u14-34f.c: Use wish_block.
286 286
287 Tue Feb 7 11:46:04 1995 Eric Youngdale (eric@andante) 287 Tue Feb 7 11:46:04 1995 Eric Youngdale (eric@andante)
288 288
289 * Linux 1.1.90 released. 289 * Linux 1.1.90 released.
290 290
291 * eata.c: Change naming from eata_* to eata2x_*. Now at vers 1.15. 291 * eata.c: Change naming from eata_* to eata2x_*. Now at vers 1.15.
292 Update interrupt handler to take pt_regs as arg. Allow blocking 292 Update interrupt handler to take pt_regs as arg. Allow blocking
293 even if loaded as module. Initialize target_time_out array. 293 even if loaded as module. Initialize target_time_out array.
294 Do not put sti(); in timing loop. 294 Do not put sti(); in timing loop.
295 295
296 * hosts.c: Do not reuse host numbers. 296 * hosts.c: Do not reuse host numbers.
297 Use scsi_make_blocked_list to generate blocking list. 297 Use scsi_make_blocked_list to generate blocking list.
298 298
299 * script_asm.pl: Beats me. Don't know perl. Something to do with 299 * script_asm.pl: Beats me. Don't know perl. Something to do with
300 phase index. 300 phase index.
301 301
302 * scsi.c (scsi_make_blocked_list): New function - code copied from 302 * scsi.c (scsi_make_blocked_list): New function - code copied from
303 hosts.c. 303 hosts.c.
304 304
305 * scsi.c: Update code to disable photo CD for Toshiba cdroms. 305 * scsi.c: Update code to disable photo CD for Toshiba cdroms.
306 Use just manufacturer name, not model number. 306 Use just manufacturer name, not model number.
307 307
308 * sr.c: Fix setting density for Toshiba drives. 308 * sr.c: Fix setting density for Toshiba drives.
309 309
310 * u14-34f.c: Clear target_time_out array during reset. 310 * u14-34f.c: Clear target_time_out array during reset.
311 311
312 Wed Feb 1 09:20:45 1995 Eric Youngdale (eric@andante) 312 Wed Feb 1 09:20:45 1995 Eric Youngdale (eric@andante)
313 313
314 * Linux 1.1.89 released. 314 * Linux 1.1.89 released.
315 315
316 * Makefile, u14-34f.c: Modularize. 316 * Makefile, u14-34f.c: Modularize.
317 317
318 * Makefile, eata.c: Modularize. Now version 1.14 318 * Makefile, eata.c: Modularize. Now version 1.14
319 319
320 * NCR5380.c: Update interrupt handler with new arglist. Minor 320 * NCR5380.c: Update interrupt handler with new arglist. Minor
321 cleanups. 321 cleanups.
322 322
323 * eata_dma.c: Begin to modularize. Add hooks for /proc/scsi. 323 * eata_dma.c: Begin to modularize. Add hooks for /proc/scsi.
324 New version 2.3.0a. Add code in interrupt handler to allow 324 New version 2.3.0a. Add code in interrupt handler to allow
325 certain CDROM drivers to be detected which return a 325 certain CDROM drivers to be detected which return a
326 CHECK_CONDITION during SCSI bus scan. Add opcode check to get 326 CHECK_CONDITION during SCSI bus scan. Add opcode check to get
327 all DATA IN and DATA OUT phases right. Utilize HBA_interpret flag. 327 all DATA IN and DATA OUT phases right. Utilize HBA_interpret flag.
328 Improvements in HBA identification. Various other minor stuff. 328 Improvements in HBA identification. Various other minor stuff.
329 329
330 * hosts.c: Initialize ->dma_channel and ->io_port when registering 330 * hosts.c: Initialize ->dma_channel and ->io_port when registering
331 a new host. 331 a new host.
332 332
333 * qlogic.c: Modularize and add PCMCIA support. 333 * qlogic.c: Modularize and add PCMCIA support.
334 334
335 * scsi.c: Add Hitachi to blacklist. 335 * scsi.c: Add Hitachi to blacklist.
336 336
337 * scsi.c: Change default to no lun scan (too many problem devices). 337 * scsi.c: Change default to no lun scan (too many problem devices).
338 338
339 * scsi.h: Define QUEUE_FULL condition. 339 * scsi.h: Define QUEUE_FULL condition.
340 340
341 * sd.c: Do not check for non-existent partition until after 341 * sd.c: Do not check for non-existent partition until after
342 new media check. 342 new media check.
343 343
344 * sg.c: Undo previous change which was wrong. 344 * sg.c: Undo previous change which was wrong.
345 345
346 * sr_ioctl.c: Increase IOCTL_TIMEOUT to 2000. 346 * sr_ioctl.c: Increase IOCTL_TIMEOUT to 2000.
347 347
348 * st.c: Patches from Kai - improve filemark handling. 348 * st.c: Patches from Kai - improve filemark handling.
349 349
350 Tue Jan 31 17:32:12 1995 Eric Youngdale (eric@andante) 350 Tue Jan 31 17:32:12 1995 Eric Youngdale (eric@andante)
351 351
352 * Linux 1.1.88 released. 352 * Linux 1.1.88 released.
353 353
354 * Throughout - spelling/grammar fixups. 354 * Throughout - spelling/grammar fixups.
355 355
356 * scsi.c: Make sure that all buffers are 16 byte aligned - some 356 * scsi.c: Make sure that all buffers are 16 byte aligned - some
357 drivers (buslogic) need this. 357 drivers (buslogic) need this.
358 358
359 * scsi.c (scan_scsis): Remove message printed. 359 * scsi.c (scan_scsis): Remove message printed.
360 360
361 * scsi.c (scsi_init): Move message here. 361 * scsi.c (scsi_init): Move message here.
362 362
363 Mon Jan 30 06:40:25 1995 Eric Youngdale (eric@andante) 363 Mon Jan 30 06:40:25 1995 Eric Youngdale (eric@andante)
364 364
365 * Linux 1.1.87 released. 365 * Linux 1.1.87 released.
366 366
367 * sr.c: Photo-cd related changes. (Gerd Knorr??). 367 * sr.c: Photo-cd related changes. (Gerd Knorr??).
368 368
369 * st.c: Changes from Kai related to EOM detection. 369 * st.c: Changes from Kai related to EOM detection.
370 370
371 Mon Jan 23 23:53:10 1995 Eric Youngdale (eric@andante) 371 Mon Jan 23 23:53:10 1995 Eric Youngdale (eric@andante)
372 372
373 * Linux 1.1.86 released. 373 * Linux 1.1.86 released.
374 374
375 * 53c7,8xx.h: Change SG size to 127. 375 * 53c7,8xx.h: Change SG size to 127.
376 376
377 * eata_dma: Update to version 2.10i. Remove bug in the registration 377 * eata_dma: Update to version 2.10i. Remove bug in the registration
378 of multiple HBAs and channels. Minor other improvements and stylistic 378 of multiple HBAs and channels. Minor other improvements and stylistic
379 changes. 379 changes.
380 380
381 * scsi.c: Test for Toshiba XM-3401TA and exclude from detection 381 * scsi.c: Test for Toshiba XM-3401TA and exclude from detection
382 as toshiba drive - photo cd does not work with this drive. 382 as toshiba drive - photo cd does not work with this drive.
383 383
384 * sr.c: Update photocd code. 384 * sr.c: Update photocd code.
385 385
386 Mon Jan 23 23:53:10 1995 Eric Youngdale (eric@andante) 386 Mon Jan 23 23:53:10 1995 Eric Youngdale (eric@andante)
387 387
388 * Linux 1.1.85 released. 388 * Linux 1.1.85 released.
389 389
390 * st.c, st_ioctl.c, sg.c, sd_ioctl.c, scsi_ioctl.c, hosts.c: 390 * st.c, st_ioctl.c, sg.c, sd_ioctl.c, scsi_ioctl.c, hosts.c:
391 include linux/mm.h 391 include linux/mm.h
392 392
393 * qlogic.c, buslogic.c, aha1542.c: Include linux/module.h. 393 * qlogic.c, buslogic.c, aha1542.c: Include linux/module.h.
394 394
395 Sun Jan 22 22:08:46 1995 Eric Youngdale (eric@andante) 395 Sun Jan 22 22:08:46 1995 Eric Youngdale (eric@andante)
396 396
397 * Linux 1.1.84 released. 397 * Linux 1.1.84 released.
398 398
399 * Makefile: Support for loadable QLOGIC boards. 399 * Makefile: Support for loadable QLOGIC boards.
400 400
401 * aha152x.c: Update to version 1.8 from Juergen. 401 * aha152x.c: Update to version 1.8 from Juergen.
402 402
403 * eata_dma.c: Update from Michael Neuffer. 403 * eata_dma.c: Update from Michael Neuffer.
404 Remove hard limit of 2 commands per lun and make it better 404 Remove hard limit of 2 commands per lun and make it better
405 configurable. Improvements in HBA identification. 405 configurable. Improvements in HBA identification.
406 406
407 * in2000.c: Fix biosparam to support large disks. 407 * in2000.c: Fix biosparam to support large disks.
408 408
409 * qlogic.c: Minor changes (change sti -> restore_flags). 409 * qlogic.c: Minor changes (change sti -> restore_flags).
410 410
411 Wed Jan 18 23:33:09 1995 Eric Youngdale (eric@andante) 411 Wed Jan 18 23:33:09 1995 Eric Youngdale (eric@andante)
412 412
413 * Linux 1.1.83 released. 413 * Linux 1.1.83 released.
414 414
415 * aha1542.c(aha1542_intr_handle): Use arguments handed down to find 415 * aha1542.c(aha1542_intr_handle): Use arguments handed down to find
416 which irq. 416 which irq.
417 417
418 * buslogic.c: Likewise. 418 * buslogic.c: Likewise.
419 419
420 * eata_dma.c: Use min of 2 cmd_per_lun for OCS_enabled boards. 420 * eata_dma.c: Use min of 2 cmd_per_lun for OCS_enabled boards.
421 421
422 * scsi.c: Make RECOVERED_ERROR a SUGGEST_IS_OK. 422 * scsi.c: Make RECOVERED_ERROR a SUGGEST_IS_OK.
423 423
424 * sd.c: Fail if we are opening a non-existent partition. 424 * sd.c: Fail if we are opening a non-existent partition.
425 425
426 * sr.c: Bump SR_TIMEOUT to 15000. 426 * sr.c: Bump SR_TIMEOUT to 15000.
427 Do not probe for media size at boot time(hard on changers). 427 Do not probe for media size at boot time(hard on changers).
428 Flag device as needing sector size instead. 428 Flag device as needing sector size instead.
429 429
430 * sr_ioctl.c: Remove CDROMMULTISESSION_SYS ioctl. 430 * sr_ioctl.c: Remove CDROMMULTISESSION_SYS ioctl.
431 431
432 * ultrastor.c: Fix bug in call to ultrastor_interrupt (wrong #args). 432 * ultrastor.c: Fix bug in call to ultrastor_interrupt (wrong #args).
433 433
434 Mon Jan 16 07:18:23 1995 Eric Youngdale (eric@andante) 434 Mon Jan 16 07:18:23 1995 Eric Youngdale (eric@andante)
435 435
436 * Linux 1.1.82 released. 436 * Linux 1.1.82 released.
437 437
438 Throughout. 438 Throughout.
439 - Change all interrupt handlers to accept new calling convention. 439 - Change all interrupt handlers to accept new calling convention.
440 In particular, we now receive the irq number as one of the arguments. 440 In particular, we now receive the irq number as one of the arguments.
441 441
442 * More minor spelling corrections in some of the new files. 442 * More minor spelling corrections in some of the new files.
443 443
444 * aha1542.c, buslogic.c: Clean up interrupt handler a little now 444 * aha1542.c, buslogic.c: Clean up interrupt handler a little now
445 that we receive the irq as an arg. 445 that we receive the irq as an arg.
446 446
447 * aha274x.c: s/snarf_region/request_region/ 447 * aha274x.c: s/snarf_region/request_region/
448 448
449 * eata.c: Update to version 1.12. Fix some comments and display a 449 * eata.c: Update to version 1.12. Fix some comments and display a
450 message if we cannot reserve the port addresses. 450 message if we cannot reserve the port addresses.
451 451
452 * u14-34f.c: Update to version 1.13. Fix some comments and display a 452 * u14-34f.c: Update to version 1.13. Fix some comments and display a
453 message if we cannot reserve the port addresses. 453 message if we cannot reserve the port addresses.
454 454
455 * eata_dma.c: Define get_board_data function (send INQUIRY command). 455 * eata_dma.c: Define get_board_data function (send INQUIRY command).
456 Use to improve detection of variants of different DPT boards. Change 456 Use to improve detection of variants of different DPT boards. Change
457 version subnumber to "0g". 457 version subnumber to "0g".
458 458
459 * fdomain.c: Update to version 5.26. Improve detection of some boards 459 * fdomain.c: Update to version 5.26. Improve detection of some boards
460 repackaged by IBM. 460 repackaged by IBM.
461 461
462 * scsi.c (scsi_register_host): Change "name" to const char *. 462 * scsi.c (scsi_register_host): Change "name" to const char *.
463 463
464 * sr.c: Fix problem in set mode command for Toshiba drives. 464 * sr.c: Fix problem in set mode command for Toshiba drives.
465 465
466 * sr.c: Fix typo from patch 81. 466 * sr.c: Fix typo from patch 81.
467 467
468 Fri Jan 13 12:54:46 1995 Eric Youngdale (eric@andante) 468 Fri Jan 13 12:54:46 1995 Eric Youngdale (eric@andante)
469 469
470 * Linux 1.1.81 released. Codefreeze for 1.2 release announced. 470 * Linux 1.1.81 released. Codefreeze for 1.2 release announced.
471 471
472 Big changes here. 472 Big changes here.
473 473
474 * eata_dma.*: New files from Michael Neuffer. 474 * eata_dma.*: New files from Michael Neuffer.
475 (neuffer@goofy.zdv.uni-mainz.de). Should support 475 (neuffer@goofy.zdv.uni-mainz.de). Should support
476 all eata/dpt cards. 476 all eata/dpt cards.
477 477
478 * hosts.c, Makefile: Add eata_dma. 478 * hosts.c, Makefile: Add eata_dma.
479 479
480 * README.st: Document MTEOM. 480 * README.st: Document MTEOM.
481 481
482 Patches from me (ERY) to finish support for low-level loadable scsi. 482 Patches from me (ERY) to finish support for low-level loadable scsi.
483 It now works, and is actually useful. 483 It now works, and is actually useful.
484 484
485 * Throughout - add new argument to scsi_init_malloc that takes an 485 * Throughout - add new argument to scsi_init_malloc that takes an
486 additional parameter. This is used as a priority to kmalloc, 486 additional parameter. This is used as a priority to kmalloc,
487 and you can specify the GFP_DMA flag if you need DMA-able memory. 487 and you can specify the GFP_DMA flag if you need DMA-able memory.
488 488
489 * Makefile: For source files that are loadable, always add name 489 * Makefile: For source files that are loadable, always add name
490 to SCSI_SRCS. Fill in modules: target. 490 to SCSI_SRCS. Fill in modules: target.
491 491
492 * hosts.c: Change next_host to next_scsi_host, and make global. 492 * hosts.c: Change next_host to next_scsi_host, and make global.
493 Print hosts after we have identified all of them. Use info() 493 Print hosts after we have identified all of them. Use info()
494 function if present, otherwise use name field. 494 function if present, otherwise use name field.
495 495
496 * hosts.h: Change attach function to return int, not void. 496 * hosts.h: Change attach function to return int, not void.
497 Define number of device slots to allow for loadable devices. 497 Define number of device slots to allow for loadable devices.
498 Define tags to tell scsi module code what type of module we 498 Define tags to tell scsi module code what type of module we
499 are loading. 499 are loading.
500 500
501 * scsi.c: Fix scan_scsis so that it can be run by a user process. 501 * scsi.c: Fix scan_scsis so that it can be run by a user process.
502 Do not use waiting loops - use up and down mechanism as long 502 Do not use waiting loops - use up and down mechanism as long
503 as current != task[0]. 503 as current != task[0].
504 504
505 * scsi.c(scan_scsis): Do not use stack variables for I/O - this 505 * scsi.c(scan_scsis): Do not use stack variables for I/O - this
506 could be > 16Mb if we are loading a module at runtime (i.e. use 506 could be > 16Mb if we are loading a module at runtime (i.e. use
507 scsi_init_malloc to get some memory we know will be safe). 507 scsi_init_malloc to get some memory we know will be safe).
508 508
509 * scsi.c: Change dma freelist to be a set of pages. This allows 509 * scsi.c: Change dma freelist to be a set of pages. This allows
510 us to dynamically adjust the size of the list by adding more pages 510 us to dynamically adjust the size of the list by adding more pages
511 to the pagelist. Fix scsi_malloc and scsi_free accordingly. 511 to the pagelist. Fix scsi_malloc and scsi_free accordingly.
512 512
513 * scsi_module.c: Fix include. 513 * scsi_module.c: Fix include.
514 514
515 * sd.c: Declare detach function. Increment/decrement module usage 515 * sd.c: Declare detach function. Increment/decrement module usage
516 count as required. Fix init functions to allow loaded devices. 516 count as required. Fix init functions to allow loaded devices.
517 Revalidate all new disks so we get the partition tables. Define 517 Revalidate all new disks so we get the partition tables. Define
518 detach function. 518 detach function.
519 519
520 * sr.c: Likewise. 520 * sr.c: Likewise.
521 521
522 * sg.c: Declare detach function. Allow attachment of devices on 522 * sg.c: Declare detach function. Allow attachment of devices on
523 loaded drivers. 523 loaded drivers.
524 524
525 * st.c: Declare detach function. Increment/decrement module usage 525 * st.c: Declare detach function. Increment/decrement module usage
526 count as required. 526 count as required.
527 527
528 Tue Jan 10 10:09:58 1995 Eric Youngdale (eric@andante) 528 Tue Jan 10 10:09:58 1995 Eric Youngdale (eric@andante)
529 529
530 * Linux 1.1.79 released. 530 * Linux 1.1.79 released.
531 531
532 Patch from some undetermined individual who needs to get a life :-). 532 Patch from some undetermined individual who needs to get a life :-).
533 533
534 * sr.c: Attacked by spelling bee... 534 * sr.c: Attacked by spelling bee...
535 535
536 Patches from Gerd Knorr: 536 Patches from Gerd Knorr:
537 537
538 * sr.c: make printk messages for photoCD a little more informative. 538 * sr.c: make printk messages for photoCD a little more informative.
539 539
540 * sr_ioctl.c: Fix CDROMMULTISESSION_SYS ioctl. 540 * sr_ioctl.c: Fix CDROMMULTISESSION_SYS ioctl.
541 541
542 Mon Jan 9 10:01:37 1995 Eric Youngdale (eric@andante) 542 Mon Jan 9 10:01:37 1995 Eric Youngdale (eric@andante)
543 543
544 * Linux 1.1.78 released. 544 * Linux 1.1.78 released.
545 545
546 * Makefile: Add empty modules: target. 546 * Makefile: Add empty modules: target.
547 547
548 * Wheee. Now change register_iomem to request_region. 548 * Wheee. Now change register_iomem to request_region.
549 549
550 * in2000.c: Bugfix - apparently this is the fix that we have 550 * in2000.c: Bugfix - apparently this is the fix that we have
551 all been waiting for. It fixes a problem whereby the driver 551 all been waiting for. It fixes a problem whereby the driver
552 is not stable under heavy load. Race condition and all that. 552 is not stable under heavy load. Race condition and all that.
553 Patch from Peter Lu. 553 Patch from Peter Lu.
554 554
555 Wed Jan 4 21:17:40 1995 Eric Youngdale (eric@andante) 555 Wed Jan 4 21:17:40 1995 Eric Youngdale (eric@andante)
556 556
557 * Linux 1.1.77 released. 557 * Linux 1.1.77 released.
558 558
559 * 53c7,8xx.c: Fix from Linus - emulate splx. 559 * 53c7,8xx.c: Fix from Linus - emulate splx.
560 560
561 Throughout: 561 Throughout:
562 562
563 Change "snarf_region" with "register_iomem". 563 Change "snarf_region" with "register_iomem".
564 564
565 * scsi_module.c: New file. Contains support for low-level loadable 565 * scsi_module.c: New file. Contains support for low-level loadable
566 scsi drivers. [ERY]. 566 scsi drivers. [ERY].
567 567
568 * sd.c: More s/int/long/ changes. 568 * sd.c: More s/int/long/ changes.
569 569
570 * seagate.c: Explicitly include linux/config.h 570 * seagate.c: Explicitly include linux/config.h
571 571
572 * sg.c: Increment/decrement module usage count on open/close. 572 * sg.c: Increment/decrement module usage count on open/close.
573 573
574 * sg.c: Be a bit more careful about the user not supplying enough 574 * sg.c: Be a bit more careful about the user not supplying enough
575 information for a valid command. Pass correct size down to 575 information for a valid command. Pass correct size down to
576 scsi_do_cmd. 576 scsi_do_cmd.
577 577
578 * sr.c: More changes for Photo-CD. This apparently breaks NEC drives. 578 * sr.c: More changes for Photo-CD. This apparently breaks NEC drives.
579 579
580 * sr_ioctl.c: Support CDROMMULTISESSION ioctl. 580 * sr_ioctl.c: Support CDROMMULTISESSION ioctl.
581 581
582 582
583 Sun Jan 1 19:55:21 1995 Eric Youngdale (eric@andante) 583 Sun Jan 1 19:55:21 1995 Eric Youngdale (eric@andante)
584 584
585 * Linux 1.1.76 released. 585 * Linux 1.1.76 released.
586 586
587 * constants.c: Add type cast in switch statement. 587 * constants.c: Add type cast in switch statement.
588 588
589 * scsi.c (scsi_free): Change datatype of "offset" to long. 589 * scsi.c (scsi_free): Change datatype of "offset" to long.
590 (scsi_malloc): Change a few more variables to long. Who 590 (scsi_malloc): Change a few more variables to long. Who
591 did this and why was it important? 64 bit machines? 591 did this and why was it important? 64 bit machines?
592 592
593 593
594 Lots of changes to use save_state/restore_state instead of cli/sti. 594 Lots of changes to use save_state/restore_state instead of cli/sti.
595 Files changed include: 595 Files changed include:
596 596
597 * aha1542.c: 597 * aha1542.c:
598 * aha1740.c: 598 * aha1740.c:
599 * buslogic.c: 599 * buslogic.c:
600 * in2000.c: 600 * in2000.c:
601 * scsi.c: 601 * scsi.c:
602 * scsi_debug.c: 602 * scsi_debug.c:
603 * sd.c: 603 * sd.c:
604 * sr.c: 604 * sr.c:
605 * st.c: 605 * st.c:
606 606
607 Wed Dec 28 16:38:29 1994 Eric Youngdale (eric@andante) 607 Wed Dec 28 16:38:29 1994 Eric Youngdale (eric@andante)
608 608
609 * Linux 1.1.75 released. 609 * Linux 1.1.75 released.
610 610
611 * buslogic.c: Spelling fix. 611 * buslogic.c: Spelling fix.
612 612
613 * scsi.c: Add HP C1790A and C2500A scanjet to blacklist. 613 * scsi.c: Add HP C1790A and C2500A scanjet to blacklist.
614 614
615 * scsi.c: Spelling fixup. 615 * scsi.c: Spelling fixup.
616 616
617 * sd.c: Add support for sd_hardsizes (hard sector sizes). 617 * sd.c: Add support for sd_hardsizes (hard sector sizes).
618 618
619 * ultrastor.c: Use save_flags/restore_flags instead of cli/sti. 619 * ultrastor.c: Use save_flags/restore_flags instead of cli/sti.
620 620
621 Fri Dec 23 13:36:25 1994 Eric Youngdale (eric@andante) 621 Fri Dec 23 13:36:25 1994 Eric Youngdale (eric@andante)
622 622
623 * Linux 1.1.74 released. 623 * Linux 1.1.74 released.
624 624
625 * README.st: Update from Kai Makisara. 625 * README.st: Update from Kai Makisara.
626 626
627 * eata.c: New version from Dario - version 1.11. 627 * eata.c: New version from Dario - version 1.11.
628 use scsicam bios_param routine. Add support for 2011 628 use scsicam bios_param routine. Add support for 2011
629 and 2021 boards. 629 and 2021 boards.
630 630
631 * hosts.c: Add support for blocking. Linked list automatically 631 * hosts.c: Add support for blocking. Linked list automatically
632 generated when shpnt->block is set. 632 generated when shpnt->block is set.
633 633
634 * scsi.c: Add sankyo & HP scanjet to blacklist. Add support for 634 * scsi.c: Add sankyo & HP scanjet to blacklist. Add support for
635 kicking things loose when we deadlock. 635 kicking things loose when we deadlock.
636 636
637 * scsi.c: Recognize scanners and processors in scan_scsis. 637 * scsi.c: Recognize scanners and processors in scan_scsis.
638 638
639 * scsi_ioctl.h: Increase timeout to 9 seconds. 639 * scsi_ioctl.h: Increase timeout to 9 seconds.
640 640
641 * st.c: New version from Kai - add better support for backspace. 641 * st.c: New version from Kai - add better support for backspace.
642 642
643 * u14-34f.c: New version from Dario. Supports blocking. 643 * u14-34f.c: New version from Dario. Supports blocking.
644 644
645 Wed Dec 14 14:46:30 1994 Eric Youngdale (eric@andante) 645 Wed Dec 14 14:46:30 1994 Eric Youngdale (eric@andante)
646 646
647 * Linux 1.1.73 released. 647 * Linux 1.1.73 released.
648 648
649 * buslogic.c: Update from Dave Gentzel. Version 1.14. 649 * buslogic.c: Update from Dave Gentzel. Version 1.14.
650 Add module related stuff. More fault tolerant if out of 650 Add module related stuff. More fault tolerant if out of
651 DMA memory. 651 DMA memory.
652 652
653 * fdomain.c: New version from Rik Faith - version 5.22. Add support 653 * fdomain.c: New version from Rik Faith - version 5.22. Add support
654 for ISA-200S SCSI adapter. 654 for ISA-200S SCSI adapter.
655 655
656 * hosts.c: Spelling. 656 * hosts.c: Spelling.
657 657
658 * qlogic.c: Update to version 0.38a. Add more support for PCMCIA. 658 * qlogic.c: Update to version 0.38a. Add more support for PCMCIA.
659 659
660 * scsi.c: Mask device type with 0x1f during scan_scsis. 660 * scsi.c: Mask device type with 0x1f during scan_scsis.
661 Add support for deadlocking, err, make that getting out of 661 Add support for deadlocking, err, make that getting out of
662 deadlock situations that are created when we allow the user 662 deadlock situations that are created when we allow the user
663 to limit requests to one host adapter at a time. 663 to limit requests to one host adapter at a time.
664 664
665 * scsi.c: Bugfix - pass pid, not SCpnt as second arg to 665 * scsi.c: Bugfix - pass pid, not SCpnt as second arg to
666 scsi_times_out. 666 scsi_times_out.
667 667
668 * scsi.c: Restore interrupt state to previous value instead of using 668 * scsi.c: Restore interrupt state to previous value instead of using
669 cli/sti pairs. 669 cli/sti pairs.
670 670
671 * scsi.c: Add a bunch of module stuff (all commented out for now). 671 * scsi.c: Add a bunch of module stuff (all commented out for now).
672 672
673 * scsi.c: Clean up scsi_dump_status. 673 * scsi.c: Clean up scsi_dump_status.
674 674
675 Tue Dec 6 12:34:20 1994 Eric Youngdale (eric@andante) 675 Tue Dec 6 12:34:20 1994 Eric Youngdale (eric@andante)
676 676
677 * Linux 1.1.72 released. 677 * Linux 1.1.72 released.
678 678
679 * sg.c: Bugfix - always use sg_free, since we might have big buff. 679 * sg.c: Bugfix - always use sg_free, since we might have big buff.
680 680
681 Fri Dec 2 11:24:53 1994 Eric Youngdale (eric@andante) 681 Fri Dec 2 11:24:53 1994 Eric Youngdale (eric@andante)
682 682
683 * Linux 1.1.71 released. 683 * Linux 1.1.71 released.
684 684
685 * sg.c: Clear buff field when not in use. Only call scsi_free if 685 * sg.c: Clear buff field when not in use. Only call scsi_free if
686 non-null. 686 non-null.
687 687
688 * scsi.h: Call wake_up(&wait_for_request) when done with a 688 * scsi.h: Call wake_up(&wait_for_request) when done with a
689 command. 689 command.
690 690
691 * scsi.c (scsi_times_out): Pass pid down so that we can protect 691 * scsi.c (scsi_times_out): Pass pid down so that we can protect
692 against race conditions. 692 against race conditions.
693 693
694 * scsi.c (scsi_abort): Zero timeout field if we get the 694 * scsi.c (scsi_abort): Zero timeout field if we get the
695 NOT_RUNNING message back from low-level driver. 695 NOT_RUNNING message back from low-level driver.
696 696
697 697
698 * scsi.c (scsi_done): Restore cmd_len, use_sg here. 698 * scsi.c (scsi_done): Restore cmd_len, use_sg here.
699 699
700 * scsi.c (request_sense): Not here. 700 * scsi.c (request_sense): Not here.
701 701
702 * hosts.h: Add new forbidden_addr, forbidden_size fields. Who 702 * hosts.h: Add new forbidden_addr, forbidden_size fields. Who
703 added these and why???? 703 added these and why????
704 704
705 * hosts.c (scsi_mem_init): Mark pages as reserved if they fall in 705 * hosts.c (scsi_mem_init): Mark pages as reserved if they fall in
706 the forbidden regions. I am not sure - I think this is so that 706 the forbidden regions. I am not sure - I think this is so that
707 we can deal with boards that do incomplete decoding of their 707 we can deal with boards that do incomplete decoding of their
708 address lines for the bios chips, but I am not entirely sure. 708 address lines for the bios chips, but I am not entirely sure.
709 709
710 * buslogic.c: Set forbidden_addr stuff if using a buggy board. 710 * buslogic.c: Set forbidden_addr stuff if using a buggy board.
711 711
712 * aha1740.c: Test for NULL pointer in SCtmp. This should not 712 * aha1740.c: Test for NULL pointer in SCtmp. This should not
713 occur, but a nice message is better than a kernel segfault. 713 occur, but a nice message is better than a kernel segfault.
714 714
715 * 53c7,8xx.c: Add new PCI chip ID for 815. 715 * 53c7,8xx.c: Add new PCI chip ID for 815.
716 716
717 Fri Dec 2 11:24:53 1994 Eric Youngdale (eric@andante) 717 Fri Dec 2 11:24:53 1994 Eric Youngdale (eric@andante)
718 718
719 * Linux 1.1.70 released. 719 * Linux 1.1.70 released.
720 720
721 * ChangeLog, st.c: Spelling. 721 * ChangeLog, st.c: Spelling.
722 722
723 Tue Nov 29 18:48:42 1994 Eric Youngdale (eric@andante) 723 Tue Nov 29 18:48:42 1994 Eric Youngdale (eric@andante)
724 724
725 * Linux 1.1.69 released. 725 * Linux 1.1.69 released.
726 726
727 * u14-34f.h: Non-functional change. [Dario]. 727 * u14-34f.h: Non-functional change. [Dario].
728 728
729 * u14-34f.c: Use block field in Scsi_Host to prevent commands from 729 * u14-34f.c: Use block field in Scsi_Host to prevent commands from
730 being queued to more than one host at the same time (used when 730 being queued to more than one host at the same time (used when
731 motherboard does not deal with multiple bus-masters very well). 731 motherboard does not deal with multiple bus-masters very well).
732 Only when SINGLE_HOST_OPERATIONS is defined. 732 Only when SINGLE_HOST_OPERATIONS is defined.
733 Use new cmd_per_lun field. [Dario] 733 Use new cmd_per_lun field. [Dario]
734 734
735 * eata.c: Likewise. 735 * eata.c: Likewise.
736 736
737 * st.c: More changes from Kai. Add ready flag to indicate drive 737 * st.c: More changes from Kai. Add ready flag to indicate drive
738 status. 738 status.
739 739
740 * README.st: Document this. 740 * README.st: Document this.
741 741
742 * sr.c: Bugfix (do not subtract CD_BLOCK_OFFSET) for photo-cd 742 * sr.c: Bugfix (do not subtract CD_BLOCK_OFFSET) for photo-cd
743 code. 743 code.
744 744
745 * sg.c: Bugfix - fix problem where opcode is not correctly set up. 745 * sg.c: Bugfix - fix problem where opcode is not correctly set up.
746 746
747 * seagate.[c,h]: Use #defines to set driver name. 747 * seagate.[c,h]: Use #defines to set driver name.
748 748
749 * scsi_ioctl.c: Zero buffer before executing command. 749 * scsi_ioctl.c: Zero buffer before executing command.
750 750
751 * scsi.c: Use new cmd_per_lun field in Scsi_Hosts as appropriate. 751 * scsi.c: Use new cmd_per_lun field in Scsi_Hosts as appropriate.
752 Add Sony CDU55S to blacklist. 752 Add Sony CDU55S to blacklist.
753 753
754 * hosts.h: Add new cmd_per_lun field to Scsi_Hosts. 754 * hosts.h: Add new cmd_per_lun field to Scsi_Hosts.
755 755
756 * hosts.c: Initialize cmd_per_lun in Scsi_Hosts from template. 756 * hosts.c: Initialize cmd_per_lun in Scsi_Hosts from template.
757 757
758 * buslogic.c: Use cmd_per_lun field - initialize to different 758 * buslogic.c: Use cmd_per_lun field - initialize to different
759 values depending upon bus type (i.e. use 1 if ISA, so we do not 759 values depending upon bus type (i.e. use 1 if ISA, so we do not
760 hog memory). Use other patches which got lost from 1.1.68. 760 hog memory). Use other patches which got lost from 1.1.68.
761 761
762 * aha1542.c: Spelling. 762 * aha1542.c: Spelling.
763 763
764 Tue Nov 29 15:43:50 1994 Eric Youngdale (eric@andante.aib.com) 764 Tue Nov 29 15:43:50 1994 Eric Youngdale (eric@andante.aib.com)
765 765
766 * Linux 1.1.68 released. 766 * Linux 1.1.68 released.
767 767
768 Add support for 12 byte vendor specific commands in scsi-generics, 768 Add support for 12 byte vendor specific commands in scsi-generics,
769 more (i.e. the last mandatory) low-level changes to support 769 more (i.e. the last mandatory) low-level changes to support
770 loadable modules, plus a few other changes people have requested 770 loadable modules, plus a few other changes people have requested
771 lately. Changes by me (ERY) unless otherwise noted. Spelling 771 lately. Changes by me (ERY) unless otherwise noted. Spelling
772 changes appear from some unknown corner of the universe. 772 changes appear from some unknown corner of the universe.
773 773
774 * Throughout: Change COMMAND_SIZE() to use SCpnt->cmd_len. 774 * Throughout: Change COMMAND_SIZE() to use SCpnt->cmd_len.
775 775
776 * Throughout: Change info() low level function to take a Scsi_Host 776 * Throughout: Change info() low level function to take a Scsi_Host
777 pointer. This way the info function can return specific 777 pointer. This way the info function can return specific
778 information about the host in question, if desired. 778 information about the host in question, if desired.
779 779
780 * All low-level drivers: Add NULL in initializer for the 780 * All low-level drivers: Add NULL in initializer for the
781 usage_count field added to Scsi_Host_Template. 781 usage_count field added to Scsi_Host_Template.
782 782
783 * aha152x.[c,h]: Remove redundant info() function. 783 * aha152x.[c,h]: Remove redundant info() function.
784 784
785 * aha1542.[c,h]: Likewise. 785 * aha1542.[c,h]: Likewise.
786 786
787 * aha1740.[c,h]: Likewise. 787 * aha1740.[c,h]: Likewise.
788 788
789 * aha274x.[c,h]: Likewise. 789 * aha274x.[c,h]: Likewise.
790 790
791 * eata.[c,h]: Likewise. 791 * eata.[c,h]: Likewise.
792 792
793 * pas16.[c,h]: Likewise. 793 * pas16.[c,h]: Likewise.
794 794
795 * scsi_debug.[c,h]: Likewise. 795 * scsi_debug.[c,h]: Likewise.
796 796
797 * t128.[c,h]: Likewise. 797 * t128.[c,h]: Likewise.
798 798
799 * u14-34f.[c,h]: Likewise. 799 * u14-34f.[c,h]: Likewise.
800 800
801 * ultrastor.[c,h]: Likewise. 801 * ultrastor.[c,h]: Likewise.
802 802
803 * wd7000.[c,h]: Likewise. 803 * wd7000.[c,h]: Likewise.
804 804
805 * aha1542.c: Add support for command line options with lilo to set 805 * aha1542.c: Add support for command line options with lilo to set
806 DMA parameters, I/O port. From Matt Aarnio. 806 DMA parameters, I/O port. From Matt Aarnio.
807 807
808 * buslogic.[c,h]: New version (1.13) from Dave Gentzel. 808 * buslogic.[c,h]: New version (1.13) from Dave Gentzel.
809 809
810 * hosts.h: Add new field to Scsi_Hosts "block" to allow blocking 810 * hosts.h: Add new field to Scsi_Hosts "block" to allow blocking
811 all I/O to certain other cards. Helps prevent problems with some 811 all I/O to certain other cards. Helps prevent problems with some
812 ISA motherboards. 812 ISA motherboards.
813 813
814 * hosts.h: Add usage_count to Scsi_Host_Template. 814 * hosts.h: Add usage_count to Scsi_Host_Template.
815 815
816 * hosts.h: Add n_io_port to Scsi_Host (used when releasing module). 816 * hosts.h: Add n_io_port to Scsi_Host (used when releasing module).
817 817
818 * hosts.c: Initialize block field. 818 * hosts.c: Initialize block field.
819 819
820 * in2000.c: Remove "static" declarations from exported functions. 820 * in2000.c: Remove "static" declarations from exported functions.
821 821
822 * in2000.h: Likewise. 822 * in2000.h: Likewise.
823 823
824 * scsi.c: Correctly set cmd_len field as required. Save and 824 * scsi.c: Correctly set cmd_len field as required. Save and
825 change setting when doing a request_sense, restore when done. 825 change setting when doing a request_sense, restore when done.
826 Move abort timeout message. Fix panic in request_queueable to 826 Move abort timeout message. Fix panic in request_queueable to
827 print correct function name. 827 print correct function name.
828 828
829 * scsi.c: When incrementing usage count, walk block linked list 829 * scsi.c: When incrementing usage count, walk block linked list
830 for host, and or in SCSI_HOST_BLOCK bit. When decrementing usage 830 for host, and or in SCSI_HOST_BLOCK bit. When decrementing usage
831 count to 0, clear this bit to allow usage to continue, wake up 831 count to 0, clear this bit to allow usage to continue, wake up
832 processes waiting. 832 processes waiting.
833 833
834 834
835 * scsi_ioctl.c: If we have an info() function, call it, otherwise 835 * scsi_ioctl.c: If we have an info() function, call it, otherwise
836 if we have a "name" field, use it, else do nothing. 836 if we have a "name" field, use it, else do nothing.
837 837
838 * sd.c, sr.c: Clear cmd_len field prior to each command we 838 * sd.c, sr.c: Clear cmd_len field prior to each command we
839 generate. 839 generate.
840 840
841 * sd.h: Add "has_part_table" bit to rscsi_disks. 841 * sd.h: Add "has_part_table" bit to rscsi_disks.
842 842
843 * sg.[c,h]: Add support for vendor specific 12 byte commands (i.e. 843 * sg.[c,h]: Add support for vendor specific 12 byte commands (i.e.
844 override command length in COMMAND_SIZE). 844 override command length in COMMAND_SIZE).
845 845
846 * sr.c: Bugfix from Gerd in photocd code. 846 * sr.c: Bugfix from Gerd in photocd code.
847 847
848 * sr.c: Bugfix in get_sectorsize - always use scsi_malloc buffer - 848 * sr.c: Bugfix in get_sectorsize - always use scsi_malloc buffer -
849 we cannot guarantee that the stack is < 16Mb. 849 we cannot guarantee that the stack is < 16Mb.
850 850
851 Tue Nov 22 15:40:46 1994 Eric Youngdale (eric@andante.aib.com) 851 Tue Nov 22 15:40:46 1994 Eric Youngdale (eric@andante.aib.com)
852 852
853 * Linux 1.1.67 released. 853 * Linux 1.1.67 released.
854 854
855 * sr.c: Change spelling of manufactor to manufacturer. 855 * sr.c: Change spelling of manufactor to manufacturer.
856 856
857 * scsi.h: Likewise. 857 * scsi.h: Likewise.
858 858
859 * scsi.c: Likewise. 859 * scsi.c: Likewise.
860 860
861 * qlogic.c: Spelling corrections. 861 * qlogic.c: Spelling corrections.
862 862
863 * in2000.h: Spelling corrections. 863 * in2000.h: Spelling corrections.
864 864
865 * in2000.c: Update from Bill Earnest, change from 865 * in2000.c: Update from Bill Earnest, change from
866 jshiffle@netcom.com. Support new bios versions. 866 jshiffle@netcom.com. Support new bios versions.
867 867
868 * README.qlogic: Spelling correction. 868 * README.qlogic: Spelling correction.
869 869
870 Tue Nov 22 15:40:46 1994 Eric Youngdale (eric@andante.aib.com) 870 Tue Nov 22 15:40:46 1994 Eric Youngdale (eric@andante.aib.com)
871 871
872 * Linux 1.1.66 released. 872 * Linux 1.1.66 released.
873 873
874 * u14-34f.c: Spelling corrections. 874 * u14-34f.c: Spelling corrections.
875 875
876 * sr.[h,c]: Add support for multi-session CDs from Gerd Knorr. 876 * sr.[h,c]: Add support for multi-session CDs from Gerd Knorr.
877 877
878 * scsi.h: Add manufactor field for keeping track of device 878 * scsi.h: Add manufactor field for keeping track of device
879 manufacturer. 879 manufacturer.
880 880
881 * scsi.c: More spelling corrections. 881 * scsi.c: More spelling corrections.
882 882
883 * qlogic.h, qlogic.c, README.qlogic: New driver from Tom Zerucha. 883 * qlogic.h, qlogic.c, README.qlogic: New driver from Tom Zerucha.
884 884
885 * in2000.c, in2000.h: New driver from Brad McLean/Bill Earnest. 885 * in2000.c, in2000.h: New driver from Brad McLean/Bill Earnest.
886 886
887 * fdomain.c: Spelling correction. 887 * fdomain.c: Spelling correction.
888 888
889 * eata.c: Spelling correction. 889 * eata.c: Spelling correction.
890 890
891 Fri Nov 18 15:22:44 1994 Eric Youngdale (eric@andante.aib.com) 891 Fri Nov 18 15:22:44 1994 Eric Youngdale (eric@andante.aib.com)
892 892
893 * Linux 1.1.65 released. 893 * Linux 1.1.65 released.
894 894
895 * eata.h: Update version string to 1.08.00. 895 * eata.h: Update version string to 1.08.00.
896 896
897 * eata.c: Set sg_tablesize correctly for DPT PM2012 boards. 897 * eata.c: Set sg_tablesize correctly for DPT PM2012 boards.
898 898
899 * aha274x.seq: Spell checking. 899 * aha274x.seq: Spell checking.
900 900
901 * README.st: Likewise. 901 * README.st: Likewise.
902 902
903 * README.aha274x: Likewise. 903 * README.aha274x: Likewise.
904 904
905 * ChangeLog: Likewise. 905 * ChangeLog: Likewise.
906 906
907 Tue Nov 15 15:35:08 1994 Eric Youngdale (eric@andante.aib.com) 907 Tue Nov 15 15:35:08 1994 Eric Youngdale (eric@andante.aib.com)
908 908
909 * Linux 1.1.64 released. 909 * Linux 1.1.64 released.
910 910
911 * u14-34f.h: Update version number to 1.10.01. 911 * u14-34f.h: Update version number to 1.10.01.
912 912
913 * u14-34f.c: Use Scsi_Host can_queue variable instead of one from template. 913 * u14-34f.c: Use Scsi_Host can_queue variable instead of one from template.
914 914
915 * eata.[c,h]: New driver for DPT boards from Dario Ballabio. 915 * eata.[c,h]: New driver for DPT boards from Dario Ballabio.
916 916
917 * buslogic.c: Use can_queue field. 917 * buslogic.c: Use can_queue field.
918 918
919 Wed Nov 30 12:09:09 1994 Eric Youngdale (eric@andante.aib.com) 919 Wed Nov 30 12:09:09 1994 Eric Youngdale (eric@andante.aib.com)
920 920
921 * Linux 1.1.63 released. 921 * Linux 1.1.63 released.
922 922
923 * sd.c: Give I/O error if we attempt 512 byte I/O to a disk with 923 * sd.c: Give I/O error if we attempt 512 byte I/O to a disk with
924 1024 byte sectors. 924 1024 byte sectors.
925 925
926 * scsicam.c: Make sure we do read from whole disk (mask off 926 * scsicam.c: Make sure we do read from whole disk (mask off
927 partition). 927 partition).
928 928
929 * scsi.c: Use can_queue in Scsi_Host structure. 929 * scsi.c: Use can_queue in Scsi_Host structure.
930 Fix panic message about invalid host. 930 Fix panic message about invalid host.
931 931
932 * hosts.c: Initialize can_queue from template. 932 * hosts.c: Initialize can_queue from template.
933 933
934 * hosts.h: Add can_queue to Scsi_Host structure. 934 * hosts.h: Add can_queue to Scsi_Host structure.
935 935
936 * aha1740.c: Print out warning about NULL ecbptr. 936 * aha1740.c: Print out warning about NULL ecbptr.
937 937
938 Fri Nov 4 12:40:30 1994 Eric Youngdale (eric@andante.aib.com) 938 Fri Nov 4 12:40:30 1994 Eric Youngdale (eric@andante.aib.com)
939 939
940 * Linux 1.1.62 released. 940 * Linux 1.1.62 released.
941 941
942 * fdomain.c: Update to version 5.20. (From Rik Faith). Support 942 * fdomain.c: Update to version 5.20. (From Rik Faith). Support
943 BIOS version 3.5. 943 BIOS version 3.5.
944 944
945 * st.h: Add ST_EOD symbol. 945 * st.h: Add ST_EOD symbol.
946 946
947 * st.c: Patches from Kai Makisara - support additional densities, 947 * st.c: Patches from Kai Makisara - support additional densities,
948 add support for MTFSS, MTBSS, MTWSM commands. 948 add support for MTFSS, MTBSS, MTWSM commands.
949 949
950 * README.st: Update to document new commands. 950 * README.st: Update to document new commands.
951 951
952 * scsi.c: Add Mediavision CDR-H93MV to blacklist. 952 * scsi.c: Add Mediavision CDR-H93MV to blacklist.
953 953
954 Sat Oct 29 20:57:36 1994 Eric Youngdale (eric@andante.aib.com) 954 Sat Oct 29 20:57:36 1994 Eric Youngdale (eric@andante.aib.com)
955 955
956 * Linux 1.1.60 released. 956 * Linux 1.1.60 released.
957 957
958 * u14-34f.[c,h]: New driver from Dario Ballabio. 958 * u14-34f.[c,h]: New driver from Dario Ballabio.
959 959
960 * aic7770.c, aha274x_seq.h, aha274x.seq, aha274x.h, aha274x.c, 960 * aic7770.c, aha274x_seq.h, aha274x.seq, aha274x.h, aha274x.c,
961 README.aha274x: New files, new driver from John Aycock. 961 README.aha274x: New files, new driver from John Aycock.
962 962
963 963
964 Tue Oct 11 08:47:39 1994 Eric Youngdale (eric@andante) 964 Tue Oct 11 08:47:39 1994 Eric Youngdale (eric@andante)
965 965
966 * Linux 1.1.54 released. 966 * Linux 1.1.54 released.
967 967
968 * Add third PCI chip id. [Drew] 968 * Add third PCI chip id. [Drew]
969 969
970 * buslogic.c: Set BUSLOGIC_CMDLUN back to 1 [Eric]. 970 * buslogic.c: Set BUSLOGIC_CMDLUN back to 1 [Eric].
971 971
972 * ultrastor.c: Fix asm directives for new GCC. 972 * ultrastor.c: Fix asm directives for new GCC.
973 973
974 * sr.c, sd.c: Use new end_scsi_request function. 974 * sr.c, sd.c: Use new end_scsi_request function.
975 975
976 * scsi.h(end_scsi_request): Return pointer to block if still 976 * scsi.h(end_scsi_request): Return pointer to block if still
977 active, else return NULL if inactive. Fixes race condition. 977 active, else return NULL if inactive. Fixes race condition.
978 978
979 Sun Oct 9 20:23:14 1994 Eric Youngdale (eric@andante) 979 Sun Oct 9 20:23:14 1994 Eric Youngdale (eric@andante)
980 980
981 * Linux 1.1.53 released. 981 * Linux 1.1.53 released.
982 982
983 * scsi.c: Do not allocate dma bounce buffers if we have exactly 983 * scsi.c: Do not allocate dma bounce buffers if we have exactly
984 16Mb. 984 16Mb.
985 985
986 Fri Sep 9 05:35:30 1994 Eric Youngdale (eric@andante) 986 Fri Sep 9 05:35:30 1994 Eric Youngdale (eric@andante)
987 987
988 * Linux 1.1.51 released. 988 * Linux 1.1.51 released.
989 989
990 * aha152x.c: Add support for disabling the parity check. Update 990 * aha152x.c: Add support for disabling the parity check. Update
991 to version 1.4. [Juergen]. 991 to version 1.4. [Juergen].
992 992
993 * seagate.c: Tweak debugging message. 993 * seagate.c: Tweak debugging message.
994 994
995 Wed Aug 31 10:15:55 1994 Eric Youngdale (eric@andante) 995 Wed Aug 31 10:15:55 1994 Eric Youngdale (eric@andante)
996 996
997 * Linux 1.1.50 released. 997 * Linux 1.1.50 released.
998 998
999 * aha152x.c: Add eb800 for Vtech Platinum SMP boards. [Juergen]. 999 * aha152x.c: Add eb800 for Vtech Platinum SMP boards. [Juergen].
1000 1000
1001 * scsi.c: Add Quantum PD1225S to blacklist. 1001 * scsi.c: Add Quantum PD1225S to blacklist.
1002 1002
1003 Fri Aug 26 09:38:45 1994 Eric Youngdale (eric@andante) 1003 Fri Aug 26 09:38:45 1994 Eric Youngdale (eric@andante)
1004 1004
1005 * Linux 1.1.49 released. 1005 * Linux 1.1.49 released.
1006 1006
1007 * sd.c: Fix bug when we were deleting the wrong entry if we 1007 * sd.c: Fix bug when we were deleting the wrong entry if we
1008 get an unsupported sector size device. 1008 get an unsupported sector size device.
1009 1009
1010 * sr.c: Another spelling patch. 1010 * sr.c: Another spelling patch.
1011 1011
1012 Thu Aug 25 09:15:27 1994 Eric Youngdale (eric@andante) 1012 Thu Aug 25 09:15:27 1994 Eric Youngdale (eric@andante)
1013 1013
1014 * Linux 1.1.48 released. 1014 * Linux 1.1.48 released.
1015 1015
1016 * Throughout: Use new semantics for request_dma, as appropriate. 1016 * Throughout: Use new semantics for request_dma, as appropriate.
1017 1017
1018 * sr.c: Print correct device number. 1018 * sr.c: Print correct device number.
1019 1019
1020 Sun Aug 21 17:49:23 1994 Eric Youngdale (eric@andante) 1020 Sun Aug 21 17:49:23 1994 Eric Youngdale (eric@andante)
1021 1021
1022 * Linux 1.1.47 released. 1022 * Linux 1.1.47 released.
1023 1023
1024 * NCR5380.c: Add support for LIMIT_TRANSFERSIZE. 1024 * NCR5380.c: Add support for LIMIT_TRANSFERSIZE.
1025 1025
1026 * constants.h: Add prototype for print_Scsi_Cmnd. 1026 * constants.h: Add prototype for print_Scsi_Cmnd.
1027 1027
1028 * pas16.c: Some more minor tweaks. Test for Mediavision board. 1028 * pas16.c: Some more minor tweaks. Test for Mediavision board.
1029 Allow for disks > 1Gb. [Drew??] 1029 Allow for disks > 1Gb. [Drew??]
1030 1030
1031 * sr.c: Set SCpnt->transfersize. 1031 * sr.c: Set SCpnt->transfersize.
1032 1032
1033 Tue Aug 16 17:29:35 1994 Eric Youngdale (eric@andante) 1033 Tue Aug 16 17:29:35 1994 Eric Youngdale (eric@andante)
1034 1034
1035 * Linux 1.1.46 released. 1035 * Linux 1.1.46 released.
1036 1036
1037 * Throughout: More spelling fixups. 1037 * Throughout: More spelling fixups.
1038 1038
1039 * buslogic.c: Add a few more fixups from Dave. Disk translation 1039 * buslogic.c: Add a few more fixups from Dave. Disk translation
1040 mainly. 1040 mainly.
1041 1041
1042 * pas16.c: Add a few patches (Drew?). 1042 * pas16.c: Add a few patches (Drew?).
1043 1043
1044 1044
1045 Thu Aug 11 20:45:15 1994 Eric Youngdale (eric@andante) 1045 Thu Aug 11 20:45:15 1994 Eric Youngdale (eric@andante)
1046 1046
1047 * Linux 1.1.44 released. 1047 * Linux 1.1.44 released.
1048 1048
1049 * hosts.c: Add type casts for scsi_init_malloc. 1049 * hosts.c: Add type casts for scsi_init_malloc.
1050 1050
1051 * scsicam.c: Add type cast. 1051 * scsicam.c: Add type cast.
1052 1052
1053 Wed Aug 10 19:23:01 1994 Eric Youngdale (eric@andante) 1053 Wed Aug 10 19:23:01 1994 Eric Youngdale (eric@andante)
1054 1054
1055 * Linux 1.1.43 released. 1055 * Linux 1.1.43 released.
1056 1056
1057 * Throughout: Spelling cleanups. [??] 1057 * Throughout: Spelling cleanups. [??]
1058 1058
1059 * aha152x.c, NCR53*.c, fdomain.c, g_NCR5380.c, pas16.c, seagate.c, 1059 * aha152x.c, NCR53*.c, fdomain.c, g_NCR5380.c, pas16.c, seagate.c,
1060 t128.c: Use request_irq, not irqaction. [??] 1060 t128.c: Use request_irq, not irqaction. [??]
1061 1061
1062 * aha1542.c: Move test for shost before we start to use shost. 1062 * aha1542.c: Move test for shost before we start to use shost.
1063 1063
1064 * aha1542.c, aha1740.c, ultrastor.c, wd7000.c: Use new 1064 * aha1542.c, aha1740.c, ultrastor.c, wd7000.c: Use new
1065 calling sequence for request_irq. 1065 calling sequence for request_irq.
1066 1066
1067 * buslogic.c: Update from Dave Gentzel. 1067 * buslogic.c: Update from Dave Gentzel.
1068 1068
1069 Tue Aug 9 09:32:59 1994 Eric Youngdale (eric@andante) 1069 Tue Aug 9 09:32:59 1994 Eric Youngdale (eric@andante)
1070 1070
1071 * Linux 1.1.42 released. 1071 * Linux 1.1.42 released.
1072 1072
1073 * NCR5380.c: Change NCR5380_print_status to static. 1073 * NCR5380.c: Change NCR5380_print_status to static.
1074 1074
1075 * seagate.c: A few more bugfixes. Only Drew knows what they are 1075 * seagate.c: A few more bugfixes. Only Drew knows what they are
1076 for. 1076 for.
1077 1077
1078 * ultrastor.c: Tweak some __asm__ directives so that it works 1078 * ultrastor.c: Tweak some __asm__ directives so that it works
1079 with newer compilers. [??] 1079 with newer compilers. [??]
1080 1080
1081 Sat Aug 6 21:29:36 1994 Eric Youngdale (eric@andante) 1081 Sat Aug 6 21:29:36 1994 Eric Youngdale (eric@andante)
1082 1082
1083 * Linux 1.1.40 released. 1083 * Linux 1.1.40 released.
1084 1084
1085 * NCR5380.c: Return SCSI_RESET_WAKEUP from reset function. 1085 * NCR5380.c: Return SCSI_RESET_WAKEUP from reset function.
1086 1086
1087 * aha1542.c: Reset mailbox status after a bus device reset. 1087 * aha1542.c: Reset mailbox status after a bus device reset.
1088 1088
1089 * constants.c: Fix typo (;;). 1089 * constants.c: Fix typo (;;).
1090 1090
1091 * g_NCR5380.c: 1091 * g_NCR5380.c:
1092 * pas16.c: Correct usage of NCR5380_init. 1092 * pas16.c: Correct usage of NCR5380_init.
1093 1093
1094 * scsi.c: Remove redundant (and unused variables). 1094 * scsi.c: Remove redundant (and unused variables).
1095 1095
1096 * sd.c: Use memset to clear all of rscsi_disks before we use it. 1096 * sd.c: Use memset to clear all of rscsi_disks before we use it.
1097 1097
1098 * sg.c: Ditto, except for scsi_generics. 1098 * sg.c: Ditto, except for scsi_generics.
1099 1099
1100 * sr.c: Ditto, except for scsi_CDs. 1100 * sr.c: Ditto, except for scsi_CDs.
1101 1101
1102 * st.c: Initialize STp->device. 1102 * st.c: Initialize STp->device.
1103 1103
1104 * seagate.c: Fix bug. [Drew] 1104 * seagate.c: Fix bug. [Drew]
1105 1105
1106 Thu Aug 4 08:47:27 1994 Eric Youngdale (eric@andante) 1106 Thu Aug 4 08:47:27 1994 Eric Youngdale (eric@andante)
1107 1107
1108 * Linux 1.1.39 released. 1108 * Linux 1.1.39 released.
1109 1109
1110 * Makefile: Fix typo in NCR53C7xx. 1110 * Makefile: Fix typo in NCR53C7xx.
1111 1111
1112 * st.c: Print correct number for device. 1112 * st.c: Print correct number for device.
1113 1113
1114 Tue Aug 2 11:29:14 1994 Eric Youngdale (eric@esp22) 1114 Tue Aug 2 11:29:14 1994 Eric Youngdale (eric@esp22)
1115 1115
1116 * Linux 1.1.38 released. 1116 * Linux 1.1.38 released.
1117 1117
1118 Lots of changes in 1.1.38. All from Drew unless otherwise noted. 1118 Lots of changes in 1.1.38. All from Drew unless otherwise noted.
1119 1119
1120 * 53c7,8xx.c: New file from Drew. PCI driver. 1120 * 53c7,8xx.c: New file from Drew. PCI driver.
1121 1121
1122 * 53c7,8xx.h: Likewise. 1122 * 53c7,8xx.h: Likewise.
1123 1123
1124 * 53c7,8xx.scr: Likewise. 1124 * 53c7,8xx.scr: Likewise.
1125 1125
1126 * 53c8xx_d.h, 53c8xx_u.h, script_asm.pl: Likewise. 1126 * 53c8xx_d.h, 53c8xx_u.h, script_asm.pl: Likewise.
1127 1127
1128 * scsicam.c: New file from Drew. Read block 0 on the disk and 1128 * scsicam.c: New file from Drew. Read block 0 on the disk and
1129 read the partition table. Attempt to deduce the geometry from 1129 read the partition table. Attempt to deduce the geometry from
1130 the partition table if possible. Only used by 53c[7,8]xx right 1130 the partition table if possible. Only used by 53c[7,8]xx right
1131 now, but could be used by any device for which we have no way 1131 now, but could be used by any device for which we have no way
1132 of identifying the geometry. 1132 of identifying the geometry.
1133 1133
1134 * sd.c: Use device letters instead of sd%d in a lot of messages. 1134 * sd.c: Use device letters instead of sd%d in a lot of messages.
1135 1135
1136 * seagate.c: Fix bug that resulted in lockups with some devices. 1136 * seagate.c: Fix bug that resulted in lockups with some devices.
1137 1137
1138 * sr.c (sr_open): Return -EROFS, not -EACCES if we attempt to open 1138 * sr.c (sr_open): Return -EROFS, not -EACCES if we attempt to open
1139 device for write. 1139 device for write.
1140 1140
1141 * hosts.c, Makefile: Update for new driver. 1141 * hosts.c, Makefile: Update for new driver.
1142 1142
1143 * NCR5380.c, NCR5380.h, g_NCR5380.h: Update from Drew to support 1143 * NCR5380.c, NCR5380.h, g_NCR5380.h: Update from Drew to support
1144 53C400 chip. 1144 53C400 chip.
1145 1145
1146 * constants.c: Define CONST_CMND and CONST_MSG. Other minor 1146 * constants.c: Define CONST_CMND and CONST_MSG. Other minor
1147 cleanups along the way. Improve handling of CONST_MSG. 1147 cleanups along the way. Improve handling of CONST_MSG.
1148 1148
1149 * fdomain.c, fdomain.h: New version from Rik Faith. Update to 1149 * fdomain.c, fdomain.h: New version from Rik Faith. Update to
1150 5.18. Should now support TMC-3260 PCI card with 18C30 chip. 1150 5.18. Should now support TMC-3260 PCI card with 18C30 chip.
1151 1151
1152 * pas16.c: Update with new irq initialization. 1152 * pas16.c: Update with new irq initialization.
1153 1153
1154 * t128.c: Update with minor cleanups. 1154 * t128.c: Update with minor cleanups.
1155 1155
1156 * scsi.c (scsi_pid): New variable - gives each command a unique 1156 * scsi.c (scsi_pid): New variable - gives each command a unique
1157 id. Add Quantum LPS5235S to blacklist. Change in_scan to 1157 id. Add Quantum LPS5235S to blacklist. Change in_scan to
1158 in_scan_scsis and make global. 1158 in_scan_scsis and make global.
1159 1159
1160 * scsi.h: Add some defines for extended message handling, 1160 * scsi.h: Add some defines for extended message handling,
1161 INITIATE/RELEASE_RECOVERY. Add a few new fields to support sync 1161 INITIATE/RELEASE_RECOVERY. Add a few new fields to support sync
1162 transfers. 1162 transfers.
1163 1163
1164 * scsi_ioctl.h: Add ioctl to request synchronous transfers. 1164 * scsi_ioctl.h: Add ioctl to request synchronous transfers.
1165 1165
1166 1166
1167 Tue Jul 26 21:36:58 1994 Eric Youngdale (eric@esp22) 1167 Tue Jul 26 21:36:58 1994 Eric Youngdale (eric@esp22)
1168 1168
1169 * Linux 1.1.37 released. 1169 * Linux 1.1.37 released.
1170 1170
1171 * aha1542.c: Always call aha1542_mbenable, use new udelay 1171 * aha1542.c: Always call aha1542_mbenable, use new udelay
1172 mechanism so we do not wait a long time if the board does not 1172 mechanism so we do not wait a long time if the board does not
1173 implement this command. 1173 implement this command.
1174 1174
1175 * g_NCR5380.c: Remove #include <linux/config.h> and #if 1175 * g_NCR5380.c: Remove #include <linux/config.h> and #if
1176 defined(CONFIG_SCSI_*). 1176 defined(CONFIG_SCSI_*).
1177 1177
1178 * seagate.c: Likewise. 1178 * seagate.c: Likewise.
1179 1179
1180 Next round of changes to support loadable modules. Getting closer 1180 Next round of changes to support loadable modules. Getting closer
1181 now, still not possible to do anything remotely usable. 1181 now, still not possible to do anything remotely usable.
1182 1182
1183 hosts.c: Create a linked list of detected high level devices. 1183 hosts.c: Create a linked list of detected high level devices.
1184 (scsi_register_device): New function to insert into this list. 1184 (scsi_register_device): New function to insert into this list.
1185 (scsi_init): Call scsi_register_device for each of the known high 1185 (scsi_init): Call scsi_register_device for each of the known high
1186 level drivers. 1186 level drivers.
1187 1187
1188 hosts.h: Add prototype for linked list header. Add structure 1188 hosts.h: Add prototype for linked list header. Add structure
1189 definition for device template structure which defines the linked 1189 definition for device template structure which defines the linked
1190 list. 1190 list.
1191 1191
1192 scsi.c: (scan_scsis): Use linked list instead of knowledge about 1192 scsi.c: (scan_scsis): Use linked list instead of knowledge about
1193 existing high level device drivers. 1193 existing high level device drivers.
1194 (scsi_dev_init): Use init functions for drivers on linked list 1194 (scsi_dev_init): Use init functions for drivers on linked list
1195 instead of explicit list to initialize and attach devices to high 1195 instead of explicit list to initialize and attach devices to high
1196 level drivers. 1196 level drivers.
1197 1197
1198 scsi.h: Add new field "attached" to scsi_device - count of number 1198 scsi.h: Add new field "attached" to scsi_device - count of number
1199 of high level devices attached. 1199 of high level devices attached.
1200 1200
1201 sd.c, sr.c, sg.c, st.c: Adjust init/attach functions to use new 1201 sd.c, sr.c, sg.c, st.c: Adjust init/attach functions to use new
1202 scheme. 1202 scheme.
1203 1203
1204 Sat Jul 23 13:03:17 1994 Eric Youngdale (eric@esp22) 1204 Sat Jul 23 13:03:17 1994 Eric Youngdale (eric@esp22)
1205 1205
1206 * Linux 1.1.35 released. 1206 * Linux 1.1.35 released.
1207 1207
1208 * ultrastor.c: Change constraint on asm() operand so that it works 1208 * ultrastor.c: Change constraint on asm() operand so that it works
1209 with gcc 2.6.0. 1209 with gcc 2.6.0.
1210 1210
1211 Thu Jul 21 10:37:39 1994 Eric Youngdale (eric@esp22) 1211 Thu Jul 21 10:37:39 1994 Eric Youngdale (eric@esp22)
1212 1212
1213 * Linux 1.1.33 released. 1213 * Linux 1.1.33 released.
1214 1214
1215 * sr.c(sr_open): Do not allow opens with write access. 1215 * sr.c(sr_open): Do not allow opens with write access.
1216 1216
1217 Mon Jul 18 09:51:22 1994 1994 Eric Youngdale (eric@esp22) 1217 Mon Jul 18 09:51:22 1994 Eric Youngdale (eric@esp22)
1218 1218
1219 * Linux 1.1.31 released. 1219 * Linux 1.1.31 released.
1220 1220
1221 * sd.c: Increase SD_TIMEOUT from 300 to 600. 1221 * sd.c: Increase SD_TIMEOUT from 300 to 600.
1222 1222
1223 * sr.c: Remove stray task_struct* variable that was no longer 1223 * sr.c: Remove stray task_struct* variable that was no longer
1224 used. 1224 used.
1225 1225
1226 * sr_ioctl.c: Fix typo in up() call. 1226 * sr_ioctl.c: Fix typo in up() call.
1227 1227
1228 Sun Jul 17 16:25:29 1994 Eric Youngdale (eric@esp22) 1228 Sun Jul 17 16:25:29 1994 Eric Youngdale (eric@esp22)
1229 1229
1230 * Linux 1.1.30 released. 1230 * Linux 1.1.30 released.
1231 1231
1232 * scsi.c (scan_scsis): Fix detection of some Toshiba CDROM drives 1232 * scsi.c (scan_scsis): Fix detection of some Toshiba CDROM drives
1233 that report themselves as disk drives. 1233 that report themselves as disk drives.
1234 1234
1235 * (Throughout): Use request.sem instead of request.waiting. 1235 * (Throughout): Use request.sem instead of request.waiting.
1236 Should fix swap problem with fdomain. 1236 Should fix swap problem with fdomain.
1237 1237
1238 Thu Jul 14 10:51:42 1994 Eric Youngdale (eric@esp22) 1238 Thu Jul 14 10:51:42 1994 Eric Youngdale (eric@esp22)
1239 1239
1240 * Linux 1.1.29 released. 1240 * Linux 1.1.29 released.
1241 1241
1242 * scsi.c (scan_scsis): Add new devices to end of linked list, not 1242 * scsi.c (scan_scsis): Add new devices to end of linked list, not
1243 to the beginning. 1243 to the beginning.
1244 1244
1245 * scsi.h (SCSI_SLEEP): Remove brain dead hack to try to save 1245 * scsi.h (SCSI_SLEEP): Remove brain dead hack to try to save
1246 the task state before sleeping. 1246 the task state before sleeping.
1247 1247
1248 Sat Jul 9 15:01:03 1994 Eric Youngdale (eric@esp22) 1248 Sat Jul 9 15:01:03 1994 Eric Youngdale (eric@esp22)
1249 1249
1250 More changes to eventually support loadable modules. Mainly 1250 More changes to eventually support loadable modules. Mainly
1251 we want to use linked lists instead of arrays because it is easier 1251 we want to use linked lists instead of arrays because it is easier
1252 to dynamically add and remove things this way. 1252 to dynamically add and remove things this way.
1253 1253
1254 Quite a bit more work is needed before loadable modules are 1254 Quite a bit more work is needed before loadable modules are
1255 possible (and usable) with scsi, but this is most of the grunge 1255 possible (and usable) with scsi, but this is most of the grunge
1256 work. 1256 work.
1257 1257
1258 * Linux 1.1.28 released. 1258 * Linux 1.1.28 released.
1259 1259
1260 * scsi.c, scsi.h (allocate_device, request_queueable): Change 1260 * scsi.c, scsi.h (allocate_device, request_queueable): Change
1261 argument from index into scsi_devices to a pointer to the 1261 argument from index into scsi_devices to a pointer to the
1262 Scsi_Device struct. 1262 Scsi_Device struct.
1263 1263
1264 * Throughout: Change all calls to allocate_device, 1264 * Throughout: Change all calls to allocate_device,
1265 request_queueable to use new calling sequence. 1265 request_queueable to use new calling sequence.
1266 1266
1267 * Throughout: Use SCpnt->device instead of 1267 * Throughout: Use SCpnt->device instead of
1268 scsi_devices[SCpnt->index]. Ugh - the pointer was there all along 1268 scsi_devices[SCpnt->index]. Ugh - the pointer was there all along
1269 - much cleaner this way. 1269 - much cleaner this way.
1270 1270
1271 * scsi.c (scsi_init_malloc, scsi_free_malloc): New functions - 1271 * scsi.c (scsi_init_malloc, scsi_free_malloc): New functions -
1272 allow us to pretend that we have a working malloc when we 1272 allow us to pretend that we have a working malloc when we
1273 initialize. Use this instead of passing memory_start, memory_end 1273 initialize. Use this instead of passing memory_start, memory_end
1274 around all over the place. 1274 around all over the place.
1275 1275
1276 * scsi.h, st.c, sr.c, sd.c, sg.c: Change *_init1 functions to use 1276 * scsi.h, st.c, sr.c, sd.c, sg.c: Change *_init1 functions to use
1277 scsi_init_malloc, remove all arguments, no return value. 1277 scsi_init_malloc, remove all arguments, no return value.
1278 1278
1279 * scsi.h: Remove index field from Scsi_Device and Scsi_Cmnd 1279 * scsi.h: Remove index field from Scsi_Device and Scsi_Cmnd
1280 structs. 1280 structs.
1281 1281
1282 * scsi.c (scsi_dev_init): Set up for scsi_init_malloc. 1282 * scsi.c (scsi_dev_init): Set up for scsi_init_malloc.
1283 (scan_scsis): Get SDpnt from scsi_init_malloc, and refresh 1283 (scan_scsis): Get SDpnt from scsi_init_malloc, and refresh
1284 when we discover a device. Free pointer before returning. 1284 when we discover a device. Free pointer before returning.
1285 Change scsi_devices into a linked list. 1285 Change scsi_devices into a linked list.
1286 1286
1287 * scsi.c (scan_scsis): Change to only scan one host. 1287 * scsi.c (scan_scsis): Change to only scan one host.
1288 (scsi_dev_init): Loop over all detected hosts, and scan them. 1288 (scsi_dev_init): Loop over all detected hosts, and scan them.
1289 1289
1290 * hosts.c (scsi_init_free): Change so that number of extra bytes 1290 * hosts.c (scsi_init_free): Change so that number of extra bytes
1291 is stored in struct, and we do not have to pass it each time. 1291 is stored in struct, and we do not have to pass it each time.
1292 1292
1293 * hosts.h: Change Scsi_Host_Template struct to include "next" and 1293 * hosts.h: Change Scsi_Host_Template struct to include "next" and
1294 "release" functions. Initialize to NULL in all low level 1294 "release" functions. Initialize to NULL in all low level
1295 adapters. 1295 adapters.
1296 1296
1297 * hosts.c: Rename scsi_hosts to builtin_scsi_hosts, create linked 1297 * hosts.c: Rename scsi_hosts to builtin_scsi_hosts, create linked
1298 list scsi_hosts, linked together with the new "next" field. 1298 list scsi_hosts, linked together with the new "next" field.
1299 1299
1300 Wed Jul 6 05:45:02 1994 Eric Youngdale (eric@esp22) 1300 Wed Jul 6 05:45:02 1994 Eric Youngdale (eric@esp22)
1301 1301
1302 * Linux 1.1.25 released. 1302 * Linux 1.1.25 released.
1303 1303
1304 * aha152x.c: Changes from Juergen - cleanups and updates. 1304 * aha152x.c: Changes from Juergen - cleanups and updates.
1305 1305
1306 * sd.c, sr.c: Use new check_media_change and revalidate 1306 * sd.c, sr.c: Use new check_media_change and revalidate
1307 file_operations fields. 1307 file_operations fields.
1308 1308
1309 * st.c, st.h: Add changes from Kai Makisara, dated Jun 22. 1309 * st.c, st.h: Add changes from Kai Makisara, dated Jun 22.
1310 1310
1311 * hosts.h: Change SG_ALL back to 0xff. Apparently soft error 1311 * hosts.h: Change SG_ALL back to 0xff. Apparently soft error
1312 in /dev/brain resulted in having this bumped up. 1312 in /dev/brain resulted in having this bumped up.
1313 Change first parameter in bios_param function to be Disk * instead 1313 Change first parameter in bios_param function to be Disk * instead
1314 of index into rscsi_disks. 1314 of index into rscsi_disks.
1315 1315
1316 * sd_ioctl.c: Pass pointer to rscsi_disks element instead of index 1316 * sd_ioctl.c: Pass pointer to rscsi_disks element instead of index
1317 to array. 1317 to array.
1318 1318
1319 * sd.h: Add struct name "scsi_disk" to typedef for Scsi_Disk. 1319 * sd.h: Add struct name "scsi_disk" to typedef for Scsi_Disk.
1320 1320
1321 * scsi.c: Remove redundant Maxtor XT8760S from blacklist. 1321 * scsi.c: Remove redundant Maxtor XT8760S from blacklist.
1322 In scsi_reset, add printk when DEBUG defined. 1322 In scsi_reset, add printk when DEBUG defined.
1323 1323
1324 * All low level drivers: Modify definitions of bios_param in 1324 * All low level drivers: Modify definitions of bios_param in
1325 appropriate way. 1325 appropriate way.
1326 1326
1327 Thu Jun 16 10:31:59 1994 Eric Youngdale (eric@esp22) 1327 Thu Jun 16 10:31:59 1994 Eric Youngdale (eric@esp22)
1328 1328
1329 * Linux 1.1.20 released. 1329 * Linux 1.1.20 released.
1330 1330
1331 * scsi_ioctl.c: Only pass down the actual number of characters 1331 * scsi_ioctl.c: Only pass down the actual number of characters
1332 required to scsi_do_cmd, not the one rounded up to a even number 1332 required to scsi_do_cmd, not the one rounded up to a even number
1333 of sectors. 1333 of sectors.
1334 1334
1335 * ultrastor.c: Changes from Caleb Epstein for 24f cards. Support 1335 * ultrastor.c: Changes from Caleb Epstein for 24f cards. Support
1336 larger SG lists. 1336 larger SG lists.
1337 1337
1338 * ultrastor.c: Changes from me - use scsi_register to register 1338 * ultrastor.c: Changes from me - use scsi_register to register
1339 host. Add some consistency checking, 1339 host. Add some consistency checking,
1340 1340
1341 Wed Jun 1 21:12:13 1994 Eric Youngdale (eric@esp22) 1341 Wed Jun 1 21:12:13 1994 Eric Youngdale (eric@esp22)
1342 1342
1343 * Linux 1.1.19 released. 1343 * Linux 1.1.19 released.
1344 1344
1345 * scsi.h: Add new return code for reset() function: 1345 * scsi.h: Add new return code for reset() function:
1346 SCSI_RESET_PUNT. 1346 SCSI_RESET_PUNT.
1347 1347
1348 * scsi.c: Make SCSI_RESET_PUNT the same as SCSI_RESET_WAKEUP for 1348 * scsi.c: Make SCSI_RESET_PUNT the same as SCSI_RESET_WAKEUP for
1349 now. 1349 now.
1350 1350
1351 * aha1542.c: If the command responsible for the reset is not 1351 * aha1542.c: If the command responsible for the reset is not
1352 pending, return SCSI_RESET_PUNT. 1352 pending, return SCSI_RESET_PUNT.
1353 1353
1354 * aha1740.c, buslogic.c, wd7000.c, ultrastor.c: Return 1354 * aha1740.c, buslogic.c, wd7000.c, ultrastor.c: Return
1355 SCSI_RESET_PUNT instead of SCSI_RESET_SNOOZE. 1355 SCSI_RESET_PUNT instead of SCSI_RESET_SNOOZE.
1356 1356
1357 Tue May 31 19:36:01 1994 Eric Youngdale (eric@esp22) 1357 Tue May 31 19:36:01 1994 Eric Youngdale (eric@esp22)
1358 1358
1359 * buslogic.c: Do not print out message about "must be Adaptec" 1359 * buslogic.c: Do not print out message about "must be Adaptec"
1360 if we have detected a buslogic card. Print out a warning message 1360 if we have detected a buslogic card. Print out a warning message
1361 if we are configuring for >16Mb, since the 445S at board level 1361 if we are configuring for >16Mb, since the 445S at board level
1362 D or earlier does not work right. The "D" level board can be made 1362 D or earlier does not work right. The "D" level board can be made
1363 to work by flipping an undocumented switch, but this is too subtle. 1363 to work by flipping an undocumented switch, but this is too subtle.
1364 1364
1365 Changes based upon patches in Yggdrasil distribution. 1365 Changes based upon patches in Yggdrasil distribution.
1366 1366
1367 * sg.c, sg.h: Return sense data to user. 1367 * sg.c, sg.h: Return sense data to user.
1368 1368
1369 * aha1542.c, aha1740.c, buslogic.c: Do not panic if 1369 * aha1542.c, aha1740.c, buslogic.c: Do not panic if
1370 sense buffer is wrong size. 1370 sense buffer is wrong size.
1371 1371
1372 * hosts.c: Test for ultrastor card before any of the others. 1372 * hosts.c: Test for ultrastor card before any of the others.
1373 1373
1374 * scsi.c: Allow boot-time option for max_scsi_luns=? so that 1374 * scsi.c: Allow boot-time option for max_scsi_luns=? so that
1375 buggy firmware has an easy work-around. 1375 buggy firmware has an easy work-around.
1376 1376
1377 Sun May 15 20:24:34 1994 Eric Youngdale (eric@esp22) 1377 Sun May 15 20:24:34 1994 Eric Youngdale (eric@esp22)
1378 1378
1379 * Linux 1.1.15 released. 1379 * Linux 1.1.15 released.
1380 1380
1381 Post-codefreeze thaw... 1381 Post-codefreeze thaw...
1382 1382
1383 * buslogic.[c,h]: New driver from David Gentzel. 1383 * buslogic.[c,h]: New driver from David Gentzel.
1384 1384
1385 * hosts.h: Add use_clustering field to explicitly say whether 1385 * hosts.h: Add use_clustering field to explicitly say whether
1386 clustering should be used for devices attached to this host 1386 clustering should be used for devices attached to this host
1387 adapter. The buslogic board apparently supports large SG lists, 1387 adapter. The buslogic board apparently supports large SG lists,
1388 but it is apparently faster if sd.c condenses this into a smaller 1388 but it is apparently faster if sd.c condenses this into a smaller
1389 list. 1389 list.
1390 1390
1391 * sd.c: Use this field instead of heuristic. 1391 * sd.c: Use this field instead of heuristic.
1392 1392
1393 * All host adapter include files: Add appropriate initializer for 1393 * All host adapter include files: Add appropriate initializer for
1394 use_clustering field. 1394 use_clustering field.
1395 1395
1396 * scsi.h: Add #defines for return codes for the abort and reset 1396 * scsi.h: Add #defines for return codes for the abort and reset
1397 functions. There are now a specific set of return codes to fully 1397 functions. There are now a specific set of return codes to fully
1398 specify all of the possible things that the low-level adapter 1398 specify all of the possible things that the low-level adapter
1399 could do. 1399 could do.
1400 1400
1401 * scsi.c: Act based upon return codes from abort/reset functions. 1401 * scsi.c: Act based upon return codes from abort/reset functions.
1402 1402
1403 * All host adapter abort/reset functions: Return new return code. 1403 * All host adapter abort/reset functions: Return new return code.
1404 1404
1405 * Add code in scsi.c to help debug timeouts. Use #define 1405 * Add code in scsi.c to help debug timeouts. Use #define
1406 DEBUG_TIMEOUT to enable this. 1406 DEBUG_TIMEOUT to enable this.
1407 1407
1408 * scsi.c: If the host->irq field is set, use 1408 * scsi.c: If the host->irq field is set, use
1409 disable_irq/enable_irq before calling queuecommand if we 1409 disable_irq/enable_irq before calling queuecommand if we
1410 are not already in an interrupt. Reduce races, and we 1410 are not already in an interrupt. Reduce races, and we
1411 can be sloppier about cli/sti in the interrupt routines now 1411 can be sloppier about cli/sti in the interrupt routines now
1412 (reduce interrupt latency). 1412 (reduce interrupt latency).
1413 1413
1414 * constants.c: Fix some things to eliminate warnings. Add some 1414 * constants.c: Fix some things to eliminate warnings. Add some
1415 sense descriptions that were omitted before. 1415 sense descriptions that were omitted before.
1416 1416
1417 * aha1542.c: Watch for SCRD from host adapter - if we see it, set 1417 * aha1542.c: Watch for SCRD from host adapter - if we see it, set
1418 a flag. Currently we only print out the number of pending 1418 a flag. Currently we only print out the number of pending
1419 commands that might need to be restarted. 1419 commands that might need to be restarted.
1420 1420
1421 * aha1542.c (aha1542_abort): Look for lost interrupts, OGMB still 1421 * aha1542.c (aha1542_abort): Look for lost interrupts, OGMB still
1422 full, and attempt to recover. Otherwise give up. 1422 full, and attempt to recover. Otherwise give up.
1423 1423
1424 * aha1542.c (aha1542_reset): Try BUS DEVICE RESET, and then pass 1424 * aha1542.c (aha1542_reset): Try BUS DEVICE RESET, and then pass
1425 DID_RESET back up to the upper level code for all commands running 1425 DID_RESET back up to the upper level code for all commands running
1426 on this target (even on different LUNs). 1426 on this target (even on different LUNs).
1427 1427
1428 Sat May 7 14:54:01 1994 1428 Sat May 7 14:54:01 1994
1429 1429
1430 * Linux 1.1.12 released. 1430 * Linux 1.1.12 released.
1431 1431
1432 * st.c, st.h: New version from Kai. Supports boot time 1432 * st.c, st.h: New version from Kai. Supports boot time
1433 specification of number of buffers. 1433 specification of number of buffers.
1434 1434
1435 * wd7000.[c,h]: Updated driver from John Boyd. Now supports 1435 * wd7000.[c,h]: Updated driver from John Boyd. Now supports
1436 more than one wd7000 board in machine at one time, among other things. 1436 more than one wd7000 board in machine at one time, among other things.
1437 1437
1438 Wed Apr 20 22:20:35 1994 1438 Wed Apr 20 22:20:35 1994
1439 1439
1440 * Linux 1.1.8 released. 1440 * Linux 1.1.8 released.
1441 1441
1442 * sd.c: Add a few type casts where scsi_malloc is called. 1442 * sd.c: Add a few type casts where scsi_malloc is called.
1443 1443
1444 Wed Apr 13 12:53:29 1994 1444 Wed Apr 13 12:53:29 1994
1445 1445
1446 * Linux 1.1.4 released. 1446 * Linux 1.1.4 released.
1447 1447
1448 * scsi.c: Clean up a few printks (use %p to print pointers). 1448 * scsi.c: Clean up a few printks (use %p to print pointers).
1449 1449
1450 Wed Apr 13 11:33:02 1994 1450 Wed Apr 13 11:33:02 1994
1451 1451
1452 * Linux 1.1.3 released. 1452 * Linux 1.1.3 released.
1453 1453
1454 * fdomain.c: Update to version 5.16 (Handle different FIFO sizes 1454 * fdomain.c: Update to version 5.16 (Handle different FIFO sizes
1455 better). 1455 better).
1456 1456
1457 Fri Apr 8 08:57:19 1994 1457 Fri Apr 8 08:57:19 1994
1458 1458
1459 * Linux 1.1.2 released. 1459 * Linux 1.1.2 released.
1460 1460
1461 * Throughout: SCSI portion of cluster diffs added. 1461 * Throughout: SCSI portion of cluster diffs added.
1462 1462
1463 Tue Apr 5 07:41:50 1994 1463 Tue Apr 5 07:41:50 1994
1464 1464
1465 * Linux 1.1 development tree initiated. 1465 * Linux 1.1 development tree initiated.
1466 1466
1467 * The linux 1.0 development tree is now effectively frozen except 1467 * The linux 1.0 development tree is now effectively frozen except
1468 for obvious bugfixes. 1468 for obvious bugfixes.
1469 1469
1470 ****************************************************************** 1470 ******************************************************************
1471 ****************************************************************** 1471 ******************************************************************
1472 ****************************************************************** 1472 ******************************************************************
1473 ****************************************************************** 1473 ******************************************************************
1474 1474
1475 Sun Apr 17 00:17:39 1994 1475 Sun Apr 17 00:17:39 1994
1476 1476
1477 * Linux 1.0, patchlevel 9 released. 1477 * Linux 1.0, patchlevel 9 released.
1478 1478
1479 * fdomain.c: Update to version 5.16 (Handle different FIFO sizes 1479 * fdomain.c: Update to version 5.16 (Handle different FIFO sizes
1480 better). 1480 better).
1481 1481
1482 Thu Apr 7 08:36:20 1994 1482 Thu Apr 7 08:36:20 1994
1483 1483
1484 * Linux 1.0, patchlevel8 released. 1484 * Linux 1.0, patchlevel8 released.
1485 1485
1486 * fdomain.c: Update to version 5.15 from 5.9. Handles 3.4 bios. 1486 * fdomain.c: Update to version 5.15 from 5.9. Handles 3.4 bios.
1487 1487
1488 Sun Apr 3 14:43:03 1994 1488 Sun Apr 3 14:43:03 1994
1489 1489
1490 * Linux 1.0, patchlevel6 released. 1490 * Linux 1.0, patchlevel6 released.
1491 1491
1492 * wd7000.c: Make stab at fixing race condition. 1492 * wd7000.c: Make stab at fixing race condition.
1493 1493
1494 Sat Mar 26 14:14:50 1994 1494 Sat Mar 26 14:14:50 1994
1495 1495
1496 * Linux 1.0, patchlevel5 released. 1496 * Linux 1.0, patchlevel5 released.
1497 1497
1498 * aha152x.c, Makefile: Fix a few bugs (too much data message). 1498 * aha152x.c, Makefile: Fix a few bugs (too much data message).
1499 Add a few more bios signatures. (Patches from Juergen). 1499 Add a few more bios signatures. (Patches from Juergen).
1500 1500
1501 * aha1542.c: Fix race condition in aha1542_out. 1501 * aha1542.c: Fix race condition in aha1542_out.
1502 1502
1503 Mon Mar 21 16:36:20 1994 1503 Mon Mar 21 16:36:20 1994
1504 1504
1505 * Linux 1.0, patchlevel3 released. 1505 * Linux 1.0, patchlevel3 released.
1506 1506
1507 * sd.c, st.c, sr.c, sg.c: Return -ENXIO, not -ENODEV if we attempt 1507 * sd.c, st.c, sr.c, sg.c: Return -ENXIO, not -ENODEV if we attempt
1508 to open a non-existent device. 1508 to open a non-existent device.
1509 1509
1510 * scsi.c: Add Chinon cdrom to blacklist. 1510 * scsi.c: Add Chinon cdrom to blacklist.
1511 1511
1512 * sr_ioctl.c: Check return status of verify_area. 1512 * sr_ioctl.c: Check return status of verify_area.
1513 1513
1514 Sat Mar 6 16:06:19 1994 1514 Sat Mar 6 16:06:19 1994
1515 1515
1516 * Linux 1.0 released (technically a pre-release). 1516 * Linux 1.0 released (technically a pre-release).
1517 1517
1518 * scsi.c: Add IMS CDD521, Maxtor XT-8760S to blacklist. 1518 * scsi.c: Add IMS CDD521, Maxtor XT-8760S to blacklist.
1519 1519
1520 Tue Feb 15 10:58:20 1994 1520 Tue Feb 15 10:58:20 1994
1521 1521
1522 * pl15e released. 1522 * pl15e released.
1523 1523
1524 * aha1542.c: For 1542C, allow dynamic device scan with >1Gb turned 1524 * aha1542.c: For 1542C, allow dynamic device scan with >1Gb turned
1525 off. 1525 off.
1526 1526
1527 * constants.c: Fix typo in definition of CONSTANTS. 1527 * constants.c: Fix typo in definition of CONSTANTS.
1528 1528
1529 * pl15d released. 1529 * pl15d released.
1530 1530
1531 Fri Feb 11 10:10:16 1994 1531 Fri Feb 11 10:10:16 1994
1532 1532
1533 * pl15c released. 1533 * pl15c released.
1534 1534
1535 * scsi.c: Add Maxtor XT-3280 and Rodime RO3000S to blacklist. 1535 * scsi.c: Add Maxtor XT-3280 and Rodime RO3000S to blacklist.
1536 1536
1537 * scsi.c: Allow tagged queueing for scsi 3 devices as well. 1537 * scsi.c: Allow tagged queueing for scsi 3 devices as well.
1538 Some really old devices report a version number of 0. Disallow 1538 Some really old devices report a version number of 0. Disallow
1539 LUN != 0 for these. 1539 LUN != 0 for these.
1540 1540
1541 Thu Feb 10 09:48:57 1994 1541 Thu Feb 10 09:48:57 1994
1542 1542
1543 * pl15b released. 1543 * pl15b released.
1544 1544
1545 Sun Feb 6 12:19:46 1994 1545 Sun Feb 6 12:19:46 1994
1546 1546
1547 * pl15a released. 1547 * pl15a released.
1548 1548
1549 Fri Feb 4 09:02:17 1994 1549 Fri Feb 4 09:02:17 1994
1550 1550
1551 * scsi.c: Add Teac cdrom to blacklist. 1551 * scsi.c: Add Teac cdrom to blacklist.
1552 1552
1553 Thu Feb 3 14:16:43 1994 1553 Thu Feb 3 14:16:43 1994
1554 1554
1555 * pl15 released. 1555 * pl15 released.
1556 1556
1557 Tue Feb 1 15:47:43 1994 1557 Tue Feb 1 15:47:43 1994
1558 1558
1559 * pl14w released. 1559 * pl14w released.
1560 1560
1561 * wd7000.c (wd_bases): Fix typo in last change. 1561 * wd7000.c (wd_bases): Fix typo in last change.
1562 1562
1563 Mon Jan 24 17:37:23 1994 1563 Mon Jan 24 17:37:23 1994
1564 1564
1565 * pl14u released. 1565 * pl14u released.
1566 1566
1567 * aha1542.c: Support 1542CF/extended bios. Different from 1542C 1567 * aha1542.c: Support 1542CF/extended bios. Different from 1542C
1568 1568
1569 * wd7000.c: Allow bios at 0xd8000 as well. 1569 * wd7000.c: Allow bios at 0xd8000 as well.
1570 1570
1571 * ultrastor.c: Do not truncate cylinders to 1024. 1571 * ultrastor.c: Do not truncate cylinders to 1024.
1572 1572
1573 * fdomain.c: Update to version 5.9 (add new bios signature). 1573 * fdomain.c: Update to version 5.9 (add new bios signature).
1574 1574
1575 * NCR5380.c: Update from Drew - should work a lot better now. 1575 * NCR5380.c: Update from Drew - should work a lot better now.
1576 1576
1577 Sat Jan 8 15:13:10 1994 1577 Sat Jan 8 15:13:10 1994
1578 1578
1579 * pl14o released. 1579 * pl14o released.
1580 1580
1581 * sr_ioctl.c: Zero reserved field before trying to set audio volume. 1581 * sr_ioctl.c: Zero reserved field before trying to set audio volume.
1582 1582
1583 Wed Jan 5 13:21:10 1994 1583 Wed Jan 5 13:21:10 1994
1584 1584
1585 * pl14m released. 1585 * pl14m released.
1586 1586
1587 * fdomain.c: Update to version 5.8. No functional difference??? 1587 * fdomain.c: Update to version 5.8. No functional difference???
1588 1588
1589 Tue Jan 4 14:26:13 1994 1589 Tue Jan 4 14:26:13 1994
1590 1590
1591 * pl14l released. 1591 * pl14l released.
1592 1592
1593 * ultrastor.c: Remove outl, inl functions (now provided elsewhere). 1593 * ultrastor.c: Remove outl, inl functions (now provided elsewhere).
1594 1594
1595 Mon Jan 3 12:27:25 1994 1595 Mon Jan 3 12:27:25 1994
1596 1596
1597 * pl14k released. 1597 * pl14k released.
1598 1598
1599 * aha152x.c: Remove insw and outsw functions. 1599 * aha152x.c: Remove insw and outsw functions.
1600 1600
1601 * fdomain.c: Ditto. 1601 * fdomain.c: Ditto.
1602 1602
1603 Wed Dec 29 09:47:20 1993 1603 Wed Dec 29 09:47:20 1993
1604 1604
1605 * pl14i released. 1605 * pl14i released.
1606 1606
1607 * scsi.c: Support RECOVERED_ERROR for tape drives. 1607 * scsi.c: Support RECOVERED_ERROR for tape drives.
1608 1608
1609 * st.c: Update of tape driver from Kai. 1609 * st.c: Update of tape driver from Kai.
1610 1610
1611 Tue Dec 21 09:18:30 1993 1611 Tue Dec 21 09:18:30 1993
1612 1612
1613 * pl14g released. 1613 * pl14g released.
1614 1614
1615 * aha1542.[c,h]: Support extended BIOS stuff. 1615 * aha1542.[c,h]: Support extended BIOS stuff.
1616 1616
1617 * scsi.c: Clean up messages about disks, so they are displayed as 1617 * scsi.c: Clean up messages about disks, so they are displayed as
1618 sda, sdb, etc instead of sd0, sd1, etc. 1618 sda, sdb, etc instead of sd0, sd1, etc.
1619 1619
1620 * sr.c: Force reread of capacity if disk was changed. 1620 * sr.c: Force reread of capacity if disk was changed.
1621 Clear buffer before asking for capacity/sectorsize (some drives 1621 Clear buffer before asking for capacity/sectorsize (some drives
1622 do not report this properly). Set needs_sector_size flag if 1622 do not report this properly). Set needs_sector_size flag if
1623 drive did not return sensible sector size. 1623 drive did not return sensible sector size.
1624 1624
1625 Mon Dec 13 12:13:47 1993 1625 Mon Dec 13 12:13:47 1993
1626 1626
1627 * aha152x.c: Update to version .101 from Juergen. 1627 * aha152x.c: Update to version .101 from Juergen.
1628 1628
1629 Mon Nov 29 03:03:00 1993 1629 Mon Nov 29 03:03:00 1993
1630 1630
1631 * linux 0.99.14 released. 1631 * linux 0.99.14 released.
1632 1632
1633 * All scsi stuff moved from kernel/blk_drv/scsi to drivers/scsi. 1633 * All scsi stuff moved from kernel/blk_drv/scsi to drivers/scsi.
1634 1634
1635 * Throughout: Grammatical corrections to various comments. 1635 * Throughout: Grammatical corrections to various comments.
1636 1636
1637 * Makefile: fix so that we do not need to compile things we are 1637 * Makefile: fix so that we do not need to compile things we are
1638 not going to use. 1638 not going to use.
1639 1639
1640 * NCR5380.c, NCR5380.h, g_NCR5380.c, g_NCR5380.h, pas16.c, 1640 * NCR5380.c, NCR5380.h, g_NCR5380.c, g_NCR5380.h, pas16.c,
1641 pas16.h, t128.c, t128.h: New files from Drew. 1641 pas16.h, t128.c, t128.h: New files from Drew.
1642 1642
1643 * aha152x.c, aha152x.h: New files from Juergen Fischer. 1643 * aha152x.c, aha152x.h: New files from Juergen Fischer.
1644 1644
1645 * aha1542.c: Support for more than one 1542 in the machine 1645 * aha1542.c: Support for more than one 1542 in the machine
1646 at the same time. Make functions static that do not need 1646 at the same time. Make functions static that do not need
1647 visibility. 1647 visibility.
1648 1648
1649 * aha1740.c: Set NEEDS_JUMPSTART flag in reset function, so we 1649 * aha1740.c: Set NEEDS_JUMPSTART flag in reset function, so we
1650 know to restart the command. Change prototype of aha1740_reset 1650 know to restart the command. Change prototype of aha1740_reset
1651 to take a command pointer. 1651 to take a command pointer.
1652 1652
1653 * constants.c: Clean up a few things. 1653 * constants.c: Clean up a few things.
1654 1654
1655 * fdomain.c: Update to version 5.6. Move snarf_region. Allow 1655 * fdomain.c: Update to version 5.6. Move snarf_region. Allow
1656 board to be set at different SCSI ids. Remove support for 1656 board to be set at different SCSI ids. Remove support for
1657 reselection (did not work well). Set JUMPSTART flag in reset 1657 reselection (did not work well). Set JUMPSTART flag in reset
1658 code. 1658 code.
1659 1659
1660 * hosts.c: Support new low-level adapters. Allow for more than 1660 * hosts.c: Support new low-level adapters. Allow for more than
1661 one adapter of a given type. 1661 one adapter of a given type.
1662 1662
1663 * hosts.h: Allow for more than one adapter of a given type. 1663 * hosts.h: Allow for more than one adapter of a given type.
1664 1664
1665 * scsi.c: Add scsi_device_types array, if NEEDS_JUMPSTART is set 1665 * scsi.c: Add scsi_device_types array, if NEEDS_JUMPSTART is set
1666 after a low-level reset, start the command again. Sort blacklist, 1666 after a low-level reset, start the command again. Sort blacklist,
1667 and add Maxtor MXT-1240S, XT-4170S, NEC CDROM 84, Seagate ST157N. 1667 and add Maxtor MXT-1240S, XT-4170S, NEC CDROM 84, Seagate ST157N.
1668 1668
1669 * scsi.h: Add constants for tagged queueing. 1669 * scsi.h: Add constants for tagged queueing.
1670 1670
1671 * Throughout: Use constants from major.h instead of hardcoded 1671 * Throughout: Use constants from major.h instead of hardcoded
1672 numbers for major numbers. 1672 numbers for major numbers.
1673 1673
1674 * scsi_ioctl.c: Fix bug in buffer length in ioctl_command. Use 1674 * scsi_ioctl.c: Fix bug in buffer length in ioctl_command. Use
1675 verify_area in GET_IDLUN ioctl. Add new ioctls for 1675 verify_area in GET_IDLUN ioctl. Add new ioctls for
1676 TAGGED_QUEUE_ENABLE, DISABLE. Only allow IOCTL_SEND_COMMAND by 1676 TAGGED_QUEUE_ENABLE, DISABLE. Only allow IOCTL_SEND_COMMAND by
1677 superuser. 1677 superuser.
1678 1678
1679 * sd.c: Only pay attention to UNIT_ATTENTION for removable disks. 1679 * sd.c: Only pay attention to UNIT_ATTENTION for removable disks.
1680 Fix bug where sometimes portions of blocks would get lost 1680 Fix bug where sometimes portions of blocks would get lost
1681 resulting in processes hanging. Add messages when we spin up a 1681 resulting in processes hanging. Add messages when we spin up a
1682 disk, and fix a bug in the timing. Increase read-ahead for disks 1682 disk, and fix a bug in the timing. Increase read-ahead for disks
1683 that are on a scatter-gather capable host adapter. 1683 that are on a scatter-gather capable host adapter.
1684 1684
1685 * seagate.c: Fix so that some parameters can be set from the lilo 1685 * seagate.c: Fix so that some parameters can be set from the lilo
1686 prompt. Supply jumpstart flag if we are resetting and need the 1686 prompt. Supply jumpstart flag if we are resetting and need the
1687 command restarted. Fix so that we return 1 if we detect a card 1687 command restarted. Fix so that we return 1 if we detect a card
1688 so that multiple card detection works correctly. Add yet another 1688 so that multiple card detection works correctly. Add yet another
1689 signature for FD cards (950). Add another signature for ST0x. 1689 signature for FD cards (950). Add another signature for ST0x.
1690 1690
1691 * sg.c, sg.h: New files from Lawrence Foard for generic scsi 1691 * sg.c, sg.h: New files from Lawrence Foard for generic scsi
1692 access. 1692 access.
1693 1693
1694 * sr.c: Add type casts for (void*) so that we can do pointer 1694 * sr.c: Add type casts for (void*) so that we can do pointer
1695 arithmetic. Works with GCC without this, but it is not strictly 1695 arithmetic. Works with GCC without this, but it is not strictly
1696 correct. Same bugfix as was in sd.c. Increase read-ahead a la 1696 correct. Same bugfix as was in sd.c. Increase read-ahead a la
1697 disk driver. 1697 disk driver.
1698 1698
1699 * sr_ioctl.c: Use scsi_malloc buffer instead of buffer from stack 1699 * sr_ioctl.c: Use scsi_malloc buffer instead of buffer from stack
1700 since we cannot guarantee that the stack is < 16Mb. 1700 since we cannot guarantee that the stack is < 16Mb.
1701 1701
1702 ultrastor.c: Update to support 24f properly (JFC's driver). 1702 ultrastor.c: Update to support 24f properly (JFC's driver).
1703 1703
1704 wd7000.c: Supply jumpstart flag for reset. Do not round up 1704 wd7000.c: Supply jumpstart flag for reset. Do not round up
1705 number of cylinders in biosparam function. 1705 number of cylinders in biosparam function.
1706 1706
1707 Sat Sep 4 20:49:56 1993 1707 Sat Sep 4 20:49:56 1993
1708 1708
1709 * 0.99pl13 released. 1709 * 0.99pl13 released.
1710 1710
1711 * Throughout: Use check_region/snarf_region for all low-level 1711 * Throughout: Use check_region/snarf_region for all low-level
1712 drivers. 1712 drivers.
1713 1713
1714 * aha1542.c: Do hard reset instead of soft (some ethercard probes 1714 * aha1542.c: Do hard reset instead of soft (some ethercard probes
1715 screw us up). 1715 screw us up).
1716 1716
1717 * scsi.c: Add new flag ASKED_FOR_SENSE so that we can tell if we are 1717 * scsi.c: Add new flag ASKED_FOR_SENSE so that we can tell if we are
1718 in a loop whereby the device returns null sense data. 1718 in a loop whereby the device returns null sense data.
1719 1719
1720 * sd.c: Add code to spin up a drive if it is not already spinning. 1720 * sd.c: Add code to spin up a drive if it is not already spinning.
1721 Do this one at a time to make it easier on power supplies. 1721 Do this one at a time to make it easier on power supplies.
1722 1722
1723 * sd_ioctl.c: Use sync_dev instead of fsync_dev in BLKFLSBUF ioctl. 1723 * sd_ioctl.c: Use sync_dev instead of fsync_dev in BLKFLSBUF ioctl.
1724 1724
1725 * seagate.c: Switch around DATA/CONTROL lines. 1725 * seagate.c: Switch around DATA/CONTROL lines.
1726 1726
1727 * st.c: Change sense to unsigned. 1727 * st.c: Change sense to unsigned.
1728 1728
1729 Thu Aug 5 11:59:18 1993 1729 Thu Aug 5 11:59:18 1993
1730 1730
1731 * 0.99pl12 released. 1731 * 0.99pl12 released.
1732 1732
1733 * constants.c, constants.h: New files with ascii descriptions of 1733 * constants.c, constants.h: New files with ascii descriptions of
1734 various conditions. 1734 various conditions.
1735 1735
1736 * Makefile: Do not try to count the number of low-level drivers, 1736 * Makefile: Do not try to count the number of low-level drivers,
1737 just generate the list of .o files. 1737 just generate the list of .o files.
1738 1738
1739 * aha1542.c: Replace 16 with sizeof(SCpnt->sense_buffer). Add tests 1739 * aha1542.c: Replace 16 with sizeof(SCpnt->sense_buffer). Add tests
1740 for addresses > 16Mb, panic if we find one. 1740 for addresses > 16Mb, panic if we find one.
1741 1741
1742 * aha1740.c: Ditto with sizeof(). 1742 * aha1740.c: Ditto with sizeof().
1743 1743
1744 * fdomain.c: Update to version 3.18. Add new signature, register IRQ 1744 * fdomain.c: Update to version 3.18. Add new signature, register IRQ
1745 with irqaction. Use ID 7 for new board. Be more intelligent about 1745 with irqaction. Use ID 7 for new board. Be more intelligent about
1746 obtaining the h/s/c numbers for biosparam. 1746 obtaining the h/s/c numbers for biosparam.
1747 1747
1748 * hosts.c: Do not depend upon Makefile generated count of the number 1748 * hosts.c: Do not depend upon Makefile generated count of the number
1749 of low-level host adapters. 1749 of low-level host adapters.
1750 1750
1751 * scsi.c: Use array for scsi_command_size instead of a function. Add 1751 * scsi.c: Use array for scsi_command_size instead of a function. Add
1752 Texel cdrom and Maxtor XT-4380S to blacklist. Allow compile time 1752 Texel cdrom and Maxtor XT-4380S to blacklist. Allow compile time
1753 option for no-multi lun scan. Add semaphore for possible problems 1753 option for no-multi lun scan. Add semaphore for possible problems
1754 with handshaking, assume device is faulty until we know it not to be 1754 with handshaking, assume device is faulty until we know it not to be
1755 the case. Add DEBUG_INIT symbol to dump info as we scan for devices. 1755 the case. Add DEBUG_INIT symbol to dump info as we scan for devices.
1756 Zero sense buffer so we can tell if we need to request it. When 1756 Zero sense buffer so we can tell if we need to request it. When
1757 examining sense information, request sense if buffer is all zero. 1757 examining sense information, request sense if buffer is all zero.
1758 If RESET, request sense information to see what to do next. 1758 If RESET, request sense information to see what to do next.
1759 1759
1760 * scsi_debug.c: Change some constants to use symbols like INT_MAX. 1760 * scsi_debug.c: Change some constants to use symbols like INT_MAX.
1761 1761
1762 * scsi_ioctl.c (kernel_scsi_ioctl): New function -for making ioctl 1762 * scsi_ioctl.c (kernel_scsi_ioctl): New function -for making ioctl
1763 calls from kernel space. 1763 calls from kernel space.
1764 1764
1765 * sd.c: Increase timeout to 300. Use functions in constants.h to 1765 * sd.c: Increase timeout to 300. Use functions in constants.h to
1766 display info. Use scsi_malloc buffer for READ_CAPACITY, since 1766 display info. Use scsi_malloc buffer for READ_CAPACITY, since
1767 we cannot guarantee that a stack based buffer is < 16Mb. 1767 we cannot guarantee that a stack based buffer is < 16Mb.
1768 1768
1769 * sd_ioctl.c: Add BLKFLSBUF ioctl. 1769 * sd_ioctl.c: Add BLKFLSBUF ioctl.
1770 1770
1771 * seagate.c: Add new compile time options for ARBITRATE, 1771 * seagate.c: Add new compile time options for ARBITRATE,
1772 SLOW_HANDSHAKE, and SLOW_RATE. Update assembly loops for transferring 1772 SLOW_HANDSHAKE, and SLOW_RATE. Update assembly loops for transferring
1773 data. Use kernel_scsi_ioctl to request mode page with geometry. 1773 data. Use kernel_scsi_ioctl to request mode page with geometry.
1774 1774
1775 * sr.c: Use functions in constants.c to display messages. 1775 * sr.c: Use functions in constants.c to display messages.
1776 1776
1777 * st.c: Support for variable block size. 1777 * st.c: Support for variable block size.
1778 1778
1779 * ultrastor.c: Do not use cache for tape drives. Set 1779 * ultrastor.c: Do not use cache for tape drives. Set
1780 unchecked_isa_dma flag, even though this may not be needed (gets set 1780 unchecked_isa_dma flag, even though this may not be needed (gets set
1781 later). 1781 later).
1782 1782
1783 Sat Jul 17 18:32:44 1993 1783 Sat Jul 17 18:32:44 1993
1784 1784
1785 * 0.99pl11 released. C++ compilable. 1785 * 0.99pl11 released. C++ compilable.
1786 1786
1787 * Throughout: Add type casts all over the place, and use "ip" instead 1787 * Throughout: Add type casts all over the place, and use "ip" instead
1788 of "info" in the various biosparam functions. 1788 of "info" in the various biosparam functions.
1789 1789
1790 * Makefile: Compile seagate.c with C++ compiler. 1790 * Makefile: Compile seagate.c with C++ compiler.
1791 1791
1792 * aha1542.c: Always set ccb pointer as this gets trashed somehow on 1792 * aha1542.c: Always set ccb pointer as this gets trashed somehow on
1793 some systems. Add a few type casts. Update biosparam function a little. 1793 some systems. Add a few type casts. Update biosparam function a little.
1794 1794
1795 * aha1740.c: Add a few type casts. 1795 * aha1740.c: Add a few type casts.
1796 1796
1797 * fdomain.c: Update to version 3.17 from 3.6. Now works with 1797 * fdomain.c: Update to version 3.17 from 3.6. Now works with
1798 TMC-18C50. 1798 TMC-18C50.
1799 1799
1800 * scsi.c: Minor changes here and there with datatypes. Save use_sg 1800 * scsi.c: Minor changes here and there with datatypes. Save use_sg
1801 when requesting sense information so that this can properly be 1801 when requesting sense information so that this can properly be
1802 restored if we retry the command. Set aside dma buffers assuming each 1802 restored if we retry the command. Set aside dma buffers assuming each
1803 block is 1 page, not 1Kb minix block. 1803 block is 1 page, not 1Kb minix block.
1804 1804
1805 * scsi_ioctl.c: Add a few type casts. Other minor changes. 1805 * scsi_ioctl.c: Add a few type casts. Other minor changes.
1806 1806
1807 * sd.c: Correctly free all scsi_malloc'd memory if we run out of 1807 * sd.c: Correctly free all scsi_malloc'd memory if we run out of
1808 dma_pool. Store blocksize information for each partition. 1808 dma_pool. Store blocksize information for each partition.
1809 1809
1810 * seagate.c: Minor cleanups here and there. 1810 * seagate.c: Minor cleanups here and there.
1811 1811
1812 * sr.c: Set up blocksize array for all discs. Fix bug in freeing 1812 * sr.c: Set up blocksize array for all discs. Fix bug in freeing
1813 buffers if we run out of dma pool. 1813 buffers if we run out of dma pool.
1814 1814
1815 Thu Jun 2 17:58:11 1993 1815 Thu Jun 2 17:58:11 1993
1816 1816
1817 * 0.99pl10 released. 1817 * 0.99pl10 released.
1818 1818
1819 * aha1542.c: Support for BT 445S (VL-bus board with no dma channel). 1819 * aha1542.c: Support for BT 445S (VL-bus board with no dma channel).
1820 1820
1821 * fdomain.c: Upgrade to version 3.6. Preliminary support for TNC-18C50. 1821 * fdomain.c: Upgrade to version 3.6. Preliminary support for TNC-18C50.
1822 1822
1823 * scsi.c: First attempt to fix problem with old_use_sg. Change 1823 * scsi.c: First attempt to fix problem with old_use_sg. Change
1824 NOT_READY to a SUGGEST_ABORT. Fix timeout race where time might 1824 NOT_READY to a SUGGEST_ABORT. Fix timeout race where time might
1825 get decremented past zero. 1825 get decremented past zero.
1826 1826
1827 * sd.c: Add block_fsync function to dispatch table. 1827 * sd.c: Add block_fsync function to dispatch table.
1828 1828
1829 * sr.c: Increase timeout to 500 from 250. Add entry for sync in 1829 * sr.c: Increase timeout to 500 from 250. Add entry for sync in
1830 dispatch table (supply NULL). If we do not have a sectorsize, 1830 dispatch table (supply NULL). If we do not have a sectorsize,
1831 try to get it in the sd_open function. Add new function just to 1831 try to get it in the sd_open function. Add new function just to
1832 obtain sectorsize. 1832 obtain sectorsize.
1833 1833
1834 * sr.h: Add needs_sector_size semaphore. 1834 * sr.h: Add needs_sector_size semaphore.
1835 1835
1836 * st.c: Add NULL for fsync in dispatch table. 1836 * st.c: Add NULL for fsync in dispatch table.
1837 1837
1838 * wd7000.c: Allow another condition for power on that are normal 1838 * wd7000.c: Allow another condition for power on that are normal
1839 and do not require a panic. 1839 and do not require a panic.
1840 1840
1841 Thu Apr 22 23:10:11 1993 1841 Thu Apr 22 23:10:11 1993
1842 1842
1843 * 0.99pl9 released. 1843 * 0.99pl9 released.
1844 1844
1845 * aha1542.c: Use (void) instead of () in setup_mailboxes. 1845 * aha1542.c: Use (void) instead of () in setup_mailboxes.
1846 1846
1847 * scsi.c: Initialize transfersize and underflow fields in SCmd to 0. 1847 * scsi.c: Initialize transfersize and underflow fields in SCmd to 0.
1848 Do not panic for unsupported message bytes. 1848 Do not panic for unsupported message bytes.
1849 1849
1850 * scsi.h: Allocate 12 bytes instead of 10 for commands. Add 1850 * scsi.h: Allocate 12 bytes instead of 10 for commands. Add
1851 transfersize and underflow fields. 1851 transfersize and underflow fields.
1852 1852
1853 * scsi_ioctl.c: Further bugfix to ioctl_probe. 1853 * scsi_ioctl.c: Further bugfix to ioctl_probe.
1854 1854
1855 * sd.c: Use long instead of int for last parameter in sd_ioctl. 1855 * sd.c: Use long instead of int for last parameter in sd_ioctl.
1856 Initialize transfersize and underflow fields. 1856 Initialize transfersize and underflow fields.
1857 1857
1858 * sd_ioctl.c: Ditto for sd_ioctl(,,,,); 1858 * sd_ioctl.c: Ditto for sd_ioctl(,,,,);
1859 1859
1860 * seagate.c: New version from Drew. Includes new signatures for FD 1860 * seagate.c: New version from Drew. Includes new signatures for FD
1861 cards. Support for 0ws jumper. Correctly initialize 1861 cards. Support for 0ws jumper. Correctly initialize
1862 scsi_hosts[hostnum].this_id. Improved handing of 1862 scsi_hosts[hostnum].this_id. Improved handing of
1863 disconnect/reconnect, and support command linking. Use 1863 disconnect/reconnect, and support command linking. Use
1864 transfersize and underflow fields. Support scatter-gather. 1864 transfersize and underflow fields. Support scatter-gather.
1865 1865
1866 * sr.c, sr_ioctl.c: Use long instead of int for last parameter in sr_ioctl. 1866 * sr.c, sr_ioctl.c: Use long instead of int for last parameter in sr_ioctl.
1867 Use buffer and buflength in do_ioctl. Patches from Chris Newbold for 1867 Use buffer and buflength in do_ioctl. Patches from Chris Newbold for
1868 scsi-2 audio commands. 1868 scsi-2 audio commands.
1869 1869
1870 * ultrastor.c: Comment out in_byte (compiler warning). 1870 * ultrastor.c: Comment out in_byte (compiler warning).
1871 1871
1872 * wd7000.c: Change () to (void) in wd7000_enable_dma. 1872 * wd7000.c: Change () to (void) in wd7000_enable_dma.
1873 1873
1874 Wed Mar 31 16:36:25 1993 1874 Wed Mar 31 16:36:25 1993
1875 1875
1876 * 0.99pl8 released. 1876 * 0.99pl8 released.
1877 1877
1878 * aha1542.c: Handle mailboxes better for 1542C. 1878 * aha1542.c: Handle mailboxes better for 1542C.
1879 Do not truncate number of cylinders at 1024 for biosparam call. 1879 Do not truncate number of cylinders at 1024 for biosparam call.
1880 1880
1881 * aha1740.c: Fix a few minor bugs for multiple devices. 1881 * aha1740.c: Fix a few minor bugs for multiple devices.
1882 Same as above for biosparam. 1882 Same as above for biosparam.
1883 1883
1884 * scsi.c: Add lockable semaphore for removable devices that can have 1884 * scsi.c: Add lockable semaphore for removable devices that can have
1885 media removal prevented. Add another signature for flopticals. 1885 media removal prevented. Add another signature for flopticals.
1886 (allocate_device): Fix race condition. Allow more space in dma pool 1886 (allocate_device): Fix race condition. Allow more space in dma pool
1887 for blocksizes of up to 4Kb. 1887 for blocksizes of up to 4Kb.
1888 1888
1889 * scsi.h: Define COMMAND_SIZE. Define a SCSI specific version of 1889 * scsi.h: Define COMMAND_SIZE. Define a SCSI specific version of
1890 INIT_REQUEST that can run with interrupts off. 1890 INIT_REQUEST that can run with interrupts off.
1891 1891
1892 * scsi_ioctl.c: Make ioctl_probe function more idiot-proof. If 1892 * scsi_ioctl.c: Make ioctl_probe function more idiot-proof. If
1893 a removable device says ILLEGAL REQUEST to a door-locking command, 1893 a removable device says ILLEGAL REQUEST to a door-locking command,
1894 clear lockable flag. Add SCSI_IOCTL_GET_IDLUN ioctl. Do not attempt 1894 clear lockable flag. Add SCSI_IOCTL_GET_IDLUN ioctl. Do not attempt
1895 to lock door for devices that do not have lockable semaphore set. 1895 to lock door for devices that do not have lockable semaphore set.
1896 1896
1897 * sd.c: Fix race condition for multiple disks. Use INIT_SCSI_REQUEST 1897 * sd.c: Fix race condition for multiple disks. Use INIT_SCSI_REQUEST
1898 instead of INIT_REQUEST. Allow sector sizes of 1024 and 256. For 1898 instead of INIT_REQUEST. Allow sector sizes of 1024 and 256. For
1899 removable disks that are not ready, mark them as having a media change 1899 removable disks that are not ready, mark them as having a media change
1900 (some drives do not report this later). 1900 (some drives do not report this later).
1901 1901
1902 * seagate.c: Use volatile keyword for memory-mapped register pointers. 1902 * seagate.c: Use volatile keyword for memory-mapped register pointers.
1903 1903
1904 * sr.c: Fix race condition, a la sd.c. Increase the number of retries 1904 * sr.c: Fix race condition, a la sd.c. Increase the number of retries
1905 to 1. Use INIT_SCSI_REQUEST. Allow 512 byte sector sizes. Do a 1905 to 1. Use INIT_SCSI_REQUEST. Allow 512 byte sector sizes. Do a
1906 read_capacity when we init the device so we know the size and 1906 read_capacity when we init the device so we know the size and
1907 sectorsize. 1907 sectorsize.
1908 1908
1909 * st.c: If ioctl not found in st.c, try scsi_ioctl for others. 1909 * st.c: If ioctl not found in st.c, try scsi_ioctl for others.
1910 1910
1911 * ultrastor.c: Do not truncate number of cylinders at 1024 for 1911 * ultrastor.c: Do not truncate number of cylinders at 1024 for
1912 biosparam call. 1912 biosparam call.
1913 1913
1914 * wd7000.c: Ditto. 1914 * wd7000.c: Ditto.
1915 Throughout: Use COMMAND_SIZE macro to determine length of scsi 1915 Throughout: Use COMMAND_SIZE macro to determine length of scsi
1916 command. 1916 command.
1917 1917
1918 1918
1919 1919
1920 Sat Mar 13 17:31:29 1993 1920 Sat Mar 13 17:31:29 1993
1921 1921
1922 * 0.99pl7 released. 1922 * 0.99pl7 released.
1923 1923
1924 Throughout: Improve punctuation in some messages, and use new 1924 Throughout: Improve punctuation in some messages, and use new
1925 verify_area syntax. 1925 verify_area syntax.
1926 1926
1927 * aha1542.c: Handle unexpected interrupts better. 1927 * aha1542.c: Handle unexpected interrupts better.
1928 1928
1929 * scsi.c: Ditto. Handle reset conditions a bit better, asking for 1929 * scsi.c: Ditto. Handle reset conditions a bit better, asking for
1930 sense information and retrying if required. 1930 sense information and retrying if required.
1931 1931
1932 * scsi_ioctl.c: Allow for 12 byte scsi commands. 1932 * scsi_ioctl.c: Allow for 12 byte scsi commands.
1933 1933
1934 * ultrastor.c: Update to use scatter-gather. 1934 * ultrastor.c: Update to use scatter-gather.
1935 1935
1936 Sat Feb 20 17:57:15 1993 1936 Sat Feb 20 17:57:15 1993
1937 1937
1938 * 0.99pl6 released. 1938 * 0.99pl6 released.
1939 1939
1940 * fdomain.c: Update to version 3.5. Handle spurious interrupts 1940 * fdomain.c: Update to version 3.5. Handle spurious interrupts
1941 better. 1941 better.
1942 1942
1943 * sd.c: Use register_blkdev function. 1943 * sd.c: Use register_blkdev function.
1944 1944
1945 * sr.c: Ditto. 1945 * sr.c: Ditto.
1946 1946
1947 * st.c: Use register_chrdev function. 1947 * st.c: Use register_chrdev function.
1948 1948
1949 * wd7000.c: Undo previous change. 1949 * wd7000.c: Undo previous change.
1950 1950
1951 Sat Feb 6 11:20:43 1993 1951 Sat Feb 6 11:20:43 1993
1952 1952
1953 * 0.99pl5 released. 1953 * 0.99pl5 released.
1954 1954
1955 * scsi.c: Fix bug in testing for UNIT_ATTENTION. 1955 * scsi.c: Fix bug in testing for UNIT_ATTENTION.
1956 1956
1957 * wd7000.c: Check at more addresses for bios. Fix bug in biosparam 1957 * wd7000.c: Check at more addresses for bios. Fix bug in biosparam
1958 (heads & sectors turned around). 1958 (heads & sectors turned around).
1959 1959
1960 Wed Jan 20 18:13:59 1993 1960 Wed Jan 20 18:13:59 1993
1961 1961
1962 * 0.99pl4 released. 1962 * 0.99pl4 released.
1963 1963
1964 * scsi.c: Ignore leading spaces when looking for blacklisted devices. 1964 * scsi.c: Ignore leading spaces when looking for blacklisted devices.
1965 1965
1966 * seagate.c: Add a few new signatures for FD cards. Another patch 1966 * seagate.c: Add a few new signatures for FD cards. Another patch
1967 with SCint to fix race condition. Use recursion_depth to keep track 1967 with SCint to fix race condition. Use recursion_depth to keep track
1968 of how many times we have been recursively called, and do not start 1968 of how many times we have been recursively called, and do not start
1969 another command unless we are on the outer level. Fixes bug 1969 another command unless we are on the outer level. Fixes bug
1970 with Syquest cartridge drives (used to crash kernel), because 1970 with Syquest cartridge drives (used to crash kernel), because
1971 they do not disconnect with large data transfers. 1971 they do not disconnect with large data transfers.
1972 1972
1973 Tue Jan 12 14:33:36 1993 1973 Tue Jan 12 14:33:36 1993
1974 1974
1975 * 0.99pl3 released. 1975 * 0.99pl3 released.
1976 1976
1977 * fdomain.c: Update to version 3.3 (a few new signatures). 1977 * fdomain.c: Update to version 3.3 (a few new signatures).
1978 1978
1979 * scsi.c: Add CDU-541, Denon DRD-25X to blacklist. 1979 * scsi.c: Add CDU-541, Denon DRD-25X to blacklist.
1980 (allocate_request, request_queueable): Init request.waiting to NULL if 1980 (allocate_request, request_queueable): Init request.waiting to NULL if
1981 non-buffer type of request. 1981 non-buffer type of request.
1982 1982
1983 * seagate.c: Allow controller to be overridden with CONTROLLER symbol. 1983 * seagate.c: Allow controller to be overridden with CONTROLLER symbol.
1984 Set SCint=NULL when we are done, to remove race condition. 1984 Set SCint=NULL when we are done, to remove race condition.
1985 1985
1986 * st.c: Changes from Kai. 1986 * st.c: Changes from Kai.
1987 1987
1988 Wed Dec 30 20:03:47 1992 1988 Wed Dec 30 20:03:47 1992
1989 1989
1990 * 0.99pl2 released. 1990 * 0.99pl2 released.
1991 1991
1992 * scsi.c: Blacklist back in. Remove Newbury drive as other bugfix 1992 * scsi.c: Blacklist back in. Remove Newbury drive as other bugfix
1993 eliminates need for it here. 1993 eliminates need for it here.
1994 1994
1995 * sd.c: Return ENODEV instead of EACCES if no such device available. 1995 * sd.c: Return ENODEV instead of EACCES if no such device available.
1996 (sd_init) Init blkdev_fops earlier so that sd_open is available sooner. 1996 (sd_init) Init blkdev_fops earlier so that sd_open is available sooner.
1997 1997
1998 * sr.c: Same as above for sd.c. 1998 * sr.c: Same as above for sd.c.
1999 1999
2000 * st.c: Return ENODEV instead of ENXIO if no device. Init chrdev_fops 2000 * st.c: Return ENODEV instead of ENXIO if no device. Init chrdev_fops
2001 sooner, so that it is always there even if no tapes. 2001 sooner, so that it is always there even if no tapes.
2002 2002
2003 * seagate.c (controller_type): New variable to keep track of ST0x or 2003 * seagate.c (controller_type): New variable to keep track of ST0x or
2004 FD. Modify signatures list to indicate controller type, and init 2004 FD. Modify signatures list to indicate controller type, and init
2005 controller_type once we find a match. 2005 controller_type once we find a match.
2006 2006
2007 * wd7000.c (wd7000_set_sync): Remove redundant function. 2007 * wd7000.c (wd7000_set_sync): Remove redundant function.
2008 2008
2009 Sun Dec 20 16:26:24 1992 2009 Sun Dec 20 16:26:24 1992
2010 2010
2011 * 0.99pl1 released. 2011 * 0.99pl1 released.
2012 2012
2013 * scsi_ioctl.c: Bugfix - check dev->index, not dev->id against 2013 * scsi_ioctl.c: Bugfix - check dev->index, not dev->id against
2014 NR_SCSI_DEVICES. 2014 NR_SCSI_DEVICES.
2015 2015
2016 * sr_ioctl.c: Verify that device exists before allowing an ioctl. 2016 * sr_ioctl.c: Verify that device exists before allowing an ioctl.
2017 2017
2018 * st.c: Patches from Kai - change timeout values, improve end of tape 2018 * st.c: Patches from Kai - change timeout values, improve end of tape
2019 handling. 2019 handling.
2020 2020
2021 Sun Dec 13 18:15:23 1992 2021 Sun Dec 13 18:15:23 1992
2022 2022
2023 * 0.99 kernel released. Baseline for this ChangeLog. 2023 * 0.99 kernel released. Baseline for this ChangeLog.
2024 2024
Documentation/scsi/st.txt
1 This file contains brief information about the SCSI tape driver. 1 This file contains brief information about the SCSI tape driver.
2 The driver is currently maintained by Kai Mรคkisara (email 2 The driver is currently maintained by Kai Mรคkisara (email
3 Kai.Makisara@kolumbus.fi) 3 Kai.Makisara@kolumbus.fi)
4 4
5 Last modified: Mon Mar 7 21:14:44 2005 by kai.makisara 5 Last modified: Mon Mar 7 21:14:44 2005 by kai.makisara
6 6
7 7
8 BASICS 8 BASICS
9 9
10 The driver is generic, i.e., it does not contain any code tailored 10 The driver is generic, i.e., it does not contain any code tailored
11 to any specific tape drive. The tape parameters can be specified with 11 to any specific tape drive. The tape parameters can be specified with
12 one of the following three methods: 12 one of the following three methods:
13 13
14 1. Each user can specify the tape parameters he/she wants to use 14 1. Each user can specify the tape parameters he/she wants to use
15 directly with ioctls. This is administratively a very simple and 15 directly with ioctls. This is administratively a very simple and
16 flexible method and applicable to single-user workstations. However, 16 flexible method and applicable to single-user workstations. However,
17 in a multiuser environment the next user finds the tape parameters in 17 in a multiuser environment the next user finds the tape parameters in
18 state the previous user left them. 18 state the previous user left them.
19 19
20 2. The system manager (root) can define default values for some tape 20 2. The system manager (root) can define default values for some tape
21 parameters, like block size and density using the MTSETDRVBUFFER ioctl. 21 parameters, like block size and density using the MTSETDRVBUFFER ioctl.
22 These parameters can be programmed to come into effect either when a 22 These parameters can be programmed to come into effect either when a
23 new tape is loaded into the drive or if writing begins at the 23 new tape is loaded into the drive or if writing begins at the
24 beginning of the tape. The second method is applicable if the tape 24 beginning of the tape. The second method is applicable if the tape
25 drive performs auto-detection of the tape format well (like some 25 drive performs auto-detection of the tape format well (like some
26 QIC-drives). The result is that any tape can be read, writing can be 26 QIC-drives). The result is that any tape can be read, writing can be
27 continued using existing format, and the default format is used if 27 continued using existing format, and the default format is used if
28 the tape is rewritten from the beginning (or a new tape is written 28 the tape is rewritten from the beginning (or a new tape is written
29 for the first time). The first method is applicable if the drive 29 for the first time). The first method is applicable if the drive
30 does not perform auto-detection well enough and there is a single 30 does not perform auto-detection well enough and there is a single
31 "sensible" mode for the device. An example is a DAT drive that is 31 "sensible" mode for the device. An example is a DAT drive that is
32 used only in variable block mode (I don't know if this is sensible 32 used only in variable block mode (I don't know if this is sensible
33 or not :-). 33 or not :-).
34 34
35 The user can override the parameters defined by the system 35 The user can override the parameters defined by the system
36 manager. The changes persist until the defaults again come into 36 manager. The changes persist until the defaults again come into
37 effect. 37 effect.
38 38
39 3. By default, up to four modes can be defined and selected using the minor 39 3. By default, up to four modes can be defined and selected using the minor
40 number (bits 5 and 6). The number of modes can be changed by changing 40 number (bits 5 and 6). The number of modes can be changed by changing
41 ST_NBR_MODE_BITS in st.h. Mode 0 corresponds to the defaults discussed 41 ST_NBR_MODE_BITS in st.h. Mode 0 corresponds to the defaults discussed
42 above. Additional modes are dormant until they are defined by the 42 above. Additional modes are dormant until they are defined by the
43 system manager (root). When specification of a new mode is started, 43 system manager (root). When specification of a new mode is started,
44 the configuration of mode 0 is used to provide a starting point for 44 the configuration of mode 0 is used to provide a starting point for
45 definition of the new mode. 45 definition of the new mode.
46 46
47 Using the modes allows the system manager to give the users choices 47 Using the modes allows the system manager to give the users choices
48 over some of the buffering parameters not directly accessible to the 48 over some of the buffering parameters not directly accessible to the
49 users (buffered and asynchronous writes). The modes also allow choices 49 users (buffered and asynchronous writes). The modes also allow choices
50 between formats in multi-tape operations (the explicitly overridden 50 between formats in multi-tape operations (the explicitly overridden
51 parameters are reset when a new tape is loaded). 51 parameters are reset when a new tape is loaded).
52 52
53 If more than one mode is used, all modes should contain definitions 53 If more than one mode is used, all modes should contain definitions
54 for the same set of parameters. 54 for the same set of parameters.
55 55
56 Many Unices contain internal tables that associate different modes to 56 Many Unices contain internal tables that associate different modes to
57 supported devices. The Linux SCSI tape driver does not contain such 57 supported devices. The Linux SCSI tape driver does not contain such
58 tables (and will not do that in future). Instead of that, a utility 58 tables (and will not do that in future). Instead of that, a utility
59 program can be made that fetches the inquiry data sent by the device, 59 program can be made that fetches the inquiry data sent by the device,
60 scans its database, and sets up the modes using the ioctls. Another 60 scans its database, and sets up the modes using the ioctls. Another
61 alternative is to make a small script that uses mt to set the defaults 61 alternative is to make a small script that uses mt to set the defaults
62 tailored to the system. 62 tailored to the system.
63 63
64 The driver supports fixed and variable block size (within buffer 64 The driver supports fixed and variable block size (within buffer
65 limits). Both the auto-rewind (minor equals device number) and 65 limits). Both the auto-rewind (minor equals device number) and
66 non-rewind devices (minor is 128 + device number) are implemented. 66 non-rewind devices (minor is 128 + device number) are implemented.
67 67
68 In variable block mode, the byte count in write() determines the size 68 In variable block mode, the byte count in write() determines the size
69 of the physical block on tape. When reading, the drive reads the next 69 of the physical block on tape. When reading, the drive reads the next
70 tape block and returns to the user the data if the read() byte count 70 tape block and returns to the user the data if the read() byte count
71 is at least the block size. Otherwise, error ENOMEM is returned. 71 is at least the block size. Otherwise, error ENOMEM is returned.
72 72
73 In fixed block mode, the data transfer between the drive and the 73 In fixed block mode, the data transfer between the drive and the
74 driver is in multiples of the block size. The write() byte count must 74 driver is in multiples of the block size. The write() byte count must
75 be a multiple of the block size. This is not required when reading but 75 be a multiple of the block size. This is not required when reading but
76 may be advisable for portability. 76 may be advisable for portability.
77 77
78 Support is provided for changing the tape partition and partitioning 78 Support is provided for changing the tape partition and partitioning
79 of the tape with one or two partitions. By default support for 79 of the tape with one or two partitions. By default support for
80 partitioned tape is disabled for each driver and it can be enabled 80 partitioned tape is disabled for each driver and it can be enabled
81 with the ioctl MTSETDRVBUFFER. 81 with the ioctl MTSETDRVBUFFER.
82 82
83 By default the driver writes one filemark when the device is closed after 83 By default the driver writes one filemark when the device is closed after
84 writing and the last operation has been a write. Two filemarks can be 84 writing and the last operation has been a write. Two filemarks can be
85 optionally written. In both cases end of data is signified by 85 optionally written. In both cases end of data is signified by
86 returning zero bytes for two consecutive reads. 86 returning zero bytes for two consecutive reads.
87 87
88 If rewind, offline, bsf, or seek is done and previous tape operation was 88 If rewind, offline, bsf, or seek is done and previous tape operation was
89 write, a filemark is written before moving tape. 89 write, a filemark is written before moving tape.
90 90
91 The compile options are defined in the file linux/drivers/scsi/st_options.h. 91 The compile options are defined in the file linux/drivers/scsi/st_options.h.
92 92
93 4. If the open option O_NONBLOCK is used, open succeeds even if the 93 4. If the open option O_NONBLOCK is used, open succeeds even if the
94 drive is not ready. If O_NONBLOCK is not used, the driver waits for 94 drive is not ready. If O_NONBLOCK is not used, the driver waits for
95 the drive to become ready. If this does not happen in ST_BLOCK_SECONDS 95 the drive to become ready. If this does not happen in ST_BLOCK_SECONDS
96 seconds, open fails with the errno value EIO. With O_NONBLOCK the 96 seconds, open fails with the errno value EIO. With O_NONBLOCK the
97 device can be opened for writing even if there is a write protected 97 device can be opened for writing even if there is a write protected
98 tape in the drive (commands trying to write something return error if 98 tape in the drive (commands trying to write something return error if
99 attempted). 99 attempted).
100 100
101 101
102 MINOR NUMBERS 102 MINOR NUMBERS
103 103
104 The tape driver currently supports 128 drives by default. This number 104 The tape driver currently supports 128 drives by default. This number
105 can be increased by editing st.h and recompiling the driver if 105 can be increased by editing st.h and recompiling the driver if
106 necessary. The upper limit is 2^17 drives if 4 modes for each drive 106 necessary. The upper limit is 2^17 drives if 4 modes for each drive
107 are used. 107 are used.
108 108
109 The minor numbers consist of the following bit fields: 109 The minor numbers consist of the following bit fields:
110 110
111 dev_upper non-rew mode dev-lower 111 dev_upper non-rew mode dev-lower
112 20 - 8 7 6 5 4 0 112 20 - 8 7 6 5 4 0
113 The non-rewind bit is always bit 7 (the uppermost bit in the lowermost 113 The non-rewind bit is always bit 7 (the uppermost bit in the lowermost
114 byte). The bits defining the mode are below the non-rewind bit. The 114 byte). The bits defining the mode are below the non-rewind bit. The
115 remaining bits define the tape device number. This numbering is 115 remaining bits define the tape device number. This numbering is
116 backward compatible with the numbering used when the minor number was 116 backward compatible with the numbering used when the minor number was
117 only 8 bits wide. 117 only 8 bits wide.
118 118
119 119
120 SYSFS SUPPORT 120 SYSFS SUPPORT
121 121
122 The driver creates the directory /sys/class/scsi_tape and populates it with 122 The driver creates the directory /sys/class/scsi_tape and populates it with
123 directories corresponding to the existing tape devices. There are autorewind 123 directories corresponding to the existing tape devices. There are autorewind
124 and non-rewind entries for each mode. The names are stxy and nstxy, where x 124 and non-rewind entries for each mode. The names are stxy and nstxy, where x
125 is the tape number and y a character corresponding to the mode (none, l, m, 125 is the tape number and y a character corresponding to the mode (none, l, m,
126 a). For example, the directories for the first tape device are (assuming four 126 a). For example, the directories for the first tape device are (assuming four
127 modes): st0 nst0 st0l nst0l st0m nst0m st0a nst0a. 127 modes): st0 nst0 st0l nst0l st0m nst0m st0a nst0a.
128 128
129 Each directory contains the entries: default_blksize default_compression 129 Each directory contains the entries: default_blksize default_compression
130 default_density defined dev device driver. The file 'defined' contains 1 130 default_density defined dev device driver. The file 'defined' contains 1
131 if the mode is defined and zero if not defined. The files 'default_*' contain 131 if the mode is defined and zero if not defined. The files 'default_*' contain
132 the defaults set by the user. The value -1 means the default is not set. The 132 the defaults set by the user. The value -1 means the default is not set. The
133 file 'dev' contains the device numbers corresponding to this device. The links 133 file 'dev' contains the device numbers corresponding to this device. The links
134 'device' and 'driver' point to the SCSI device and driver entries. 134 'device' and 'driver' point to the SCSI device and driver entries.
135 135
136 A link named 'tape' is made from the SCSI device directory to the class 136 A link named 'tape' is made from the SCSI device directory to the class
137 directory corresponding to the mode 0 auto-rewind device (e.g., st0). 137 directory corresponding to the mode 0 auto-rewind device (e.g., st0).
138 138
139 139
140 BSD AND SYS V SEMANTICS 140 BSD AND SYS V SEMANTICS
141 141
142 The user can choose between these two behaviours of the tape driver by 142 The user can choose between these two behaviours of the tape driver by
143 defining the value of the symbol ST_SYSV. The semantics differ when a 143 defining the value of the symbol ST_SYSV. The semantics differ when a
144 file being read is closed. The BSD semantics leaves the tape where it 144 file being read is closed. The BSD semantics leaves the tape where it
145 currently is whereas the SYS V semantics moves the tape past the next 145 currently is whereas the SYS V semantics moves the tape past the next
146 filemark unless the filemark has just been crossed. 146 filemark unless the filemark has just been crossed.
147 147
148 The default is BSD semantics. 148 The default is BSD semantics.
149 149
150 150
151 BUFFERING 151 BUFFERING
152 152
153 The driver tries to do transfers directly to/from user space. If this 153 The driver tries to do transfers directly to/from user space. If this
154 is not possible, a driver buffer allocated at run-time is used. If 154 is not possible, a driver buffer allocated at run-time is used. If
155 direct i/o is not possible for the whole transfer, the driver buffer 155 direct i/o is not possible for the whole transfer, the driver buffer
156 is used (i.e., bounce buffers for individual pages are not 156 is used (i.e., bounce buffers for individual pages are not
157 used). Direct i/o can be impossible because of several reasons, e.g.: 157 used). Direct i/o can be impossible because of several reasons, e.g.:
158 - one or more pages are at addresses not reachable by the HBA 158 - one or more pages are at addresses not reachable by the HBA
159 - the number of pages in the transfer exceeds the number of 159 - the number of pages in the transfer exceeds the number of
160 scatter/gather segments permitted by the HBA 160 scatter/gather segments permitted by the HBA
161 - one or more pages can't be locked into memory (should not happen in 161 - one or more pages can't be locked into memory (should not happen in
162 any reasonable situation) 162 any reasonable situation)
163 163
164 The size of the driver buffers is always at least one tape block. In fixed 164 The size of the driver buffers is always at least one tape block. In fixed
165 block mode, the minimum buffer size is defined (in 1024 byte units) by 165 block mode, the minimum buffer size is defined (in 1024 byte units) by
166 ST_FIXED_BUFFER_BLOCKS. With small block size this allows buffering of 166 ST_FIXED_BUFFER_BLOCKS. With small block size this allows buffering of
167 several blocks and using one SCSI read or write to transfer all of the 167 several blocks and using one SCSI read or write to transfer all of the
168 blocks. Buffering of data across write calls in fixed block mode is 168 blocks. Buffering of data across write calls in fixed block mode is
169 allowed if ST_BUFFER_WRITES is non-zero and direct i/o is not used. 169 allowed if ST_BUFFER_WRITES is non-zero and direct i/o is not used.
170 Buffer allocation uses chunks of memory having sizes 2^n * (page 170 Buffer allocation uses chunks of memory having sizes 2^n * (page
171 size). Because of this the actual buffer size may be larger than the 171 size). Because of this the actual buffer size may be larger than the
172 minimum allowable buffer size. 172 minimum allowable buffer size.
173 173
174 NOTE that if direct i/o is used, the small writes are not buffered. This may 174 NOTE that if direct i/o is used, the small writes are not buffered. This may
175 cause a surprise when moving from 2.4. There small writes (e.g., tar without 175 cause a surprise when moving from 2.4. There small writes (e.g., tar without
176 -b option) may have had good throughput but this is not true any more with 176 -b option) may have had good throughput but this is not true any more with
177 2.6. Direct i/o can be turned off to solve this problem but a better solution 177 2.6. Direct i/o can be turned off to solve this problem but a better solution
178 is to use bigger write() byte counts (e.g., tar -b 64). 178 is to use bigger write() byte counts (e.g., tar -b 64).
179 179
180 Asynchronous writing. Writing the buffer contents to the tape is 180 Asynchronous writing. Writing the buffer contents to the tape is
181 started and the write call returns immediately. The status is checked 181 started and the write call returns immediately. The status is checked
182 at the next tape operation. Asynchronous writes are not done with 182 at the next tape operation. Asynchronous writes are not done with
183 direct i/o and not in fixed block mode. 183 direct i/o and not in fixed block mode.
184 184
185 Buffered writes and asynchronous writes may in some rare cases cause 185 Buffered writes and asynchronous writes may in some rare cases cause
186 problems in multivolume operations if there is not enough space on the 186 problems in multivolume operations if there is not enough space on the
187 tape after the early-warning mark to flush the driver buffer. 187 tape after the early-warning mark to flush the driver buffer.
188 188
189 Read ahead for fixed block mode (ST_READ_AHEAD). Filling the buffer is 189 Read ahead for fixed block mode (ST_READ_AHEAD). Filling the buffer is
190 attempted even if the user does not want to get all of the data at 190 attempted even if the user does not want to get all of the data at
191 this read command. Should be disabled for those drives that don't like 191 this read command. Should be disabled for those drives that don't like
192 a filemark to truncate a read request or that don't like backspacing. 192 a filemark to truncate a read request or that don't like backspacing.
193 193
194 Scatter/gather buffers (buffers that consist of chunks non-contiguous 194 Scatter/gather buffers (buffers that consist of chunks non-contiguous
195 in the physical memory) are used if contiguous buffers can't be 195 in the physical memory) are used if contiguous buffers can't be
196 allocated. To support all SCSI adapters (including those not 196 allocated. To support all SCSI adapters (including those not
197 supporting scatter/gather), buffer allocation is using the following 197 supporting scatter/gather), buffer allocation is using the following
198 three kinds of chunks: 198 three kinds of chunks:
199 1. The initial segment that is used for all SCSI adapters including 199 1. The initial segment that is used for all SCSI adapters including
200 those not supporting scatter/gather. The size of this buffer will be 200 those not supporting scatter/gather. The size of this buffer will be
201 (PAGE_SIZE << ST_FIRST_ORDER) bytes if the system can give a chunk of 201 (PAGE_SIZE << ST_FIRST_ORDER) bytes if the system can give a chunk of
202 this size (and it is not larger than the buffer size specified by 202 this size (and it is not larger than the buffer size specified by
203 ST_BUFFER_BLOCKS). If this size is not available, the driver halves 203 ST_BUFFER_BLOCKS). If this size is not available, the driver halves
204 the size and tries again until the size of one page. The default 204 the size and tries again until the size of one page. The default
205 settings in st_options.h make the driver to try to allocate all of the 205 settings in st_options.h make the driver to try to allocate all of the
206 buffer as one chunk. 206 buffer as one chunk.
207 2. The scatter/gather segments to fill the specified buffer size are 207 2. The scatter/gather segments to fill the specified buffer size are
208 allocated so that as many segments as possible are used but the number 208 allocated so that as many segments as possible are used but the number
209 of segments does not exceed ST_FIRST_SG. 209 of segments does not exceed ST_FIRST_SG.
210 3. The remaining segments between ST_MAX_SG (or the module parameter 210 3. The remaining segments between ST_MAX_SG (or the module parameter
211 max_sg_segs) and the number of segments used in phases 1 and 2 211 max_sg_segs) and the number of segments used in phases 1 and 2
212 are used to extend the buffer at run-time if this is necessary. The 212 are used to extend the buffer at run-time if this is necessary. The
213 number of scatter/gather segments allowed for the SCSI adapter is not 213 number of scatter/gather segments allowed for the SCSI adapter is not
214 exceeded if it is smaller than the maximum number of scatter/gather 214 exceeded if it is smaller than the maximum number of scatter/gather
215 segments specified. If the maximum number allowed for the SCSI adapter 215 segments specified. If the maximum number allowed for the SCSI adapter
216 is smaller than the number of segments used in phases 1 and 2, 216 is smaller than the number of segments used in phases 1 and 2,
217 extending the buffer will always fail. 217 extending the buffer will always fail.
218 218
219 219
220 EOM BEHAVIOUR WHEN WRITING 220 EOM BEHAVIOUR WHEN WRITING
221 221
222 When the end of medium early warning is encountered, the current write 222 When the end of medium early warning is encountered, the current write
223 is finished and the number of bytes is returned. The next write 223 is finished and the number of bytes is returned. The next write
224 returns -1 and errno is set to ENOSPC. To enable writing a trailer, 224 returns -1 and errno is set to ENOSPC. To enable writing a trailer,
225 the next write is allowed to proceed and, if successful, the number of 225 the next write is allowed to proceed and, if successful, the number of
226 bytes is returned. After this, -1 and the number of bytes are 226 bytes is returned. After this, -1 and the number of bytes are
227 alternately returned until the physical end of medium (or some other 227 alternately returned until the physical end of medium (or some other
228 error) is encountered. 228 error) is encountered.
229 229
230 230
231 MODULE PARAMETERS 231 MODULE PARAMETERS
232 232
233 The buffer size, write threshold, and the maximum number of allocated buffers 233 The buffer size, write threshold, and the maximum number of allocated buffers
234 are configurable when the driver is loaded as a module. The keywords are: 234 are configurable when the driver is loaded as a module. The keywords are:
235 235
236 buffer_kbs=xxx the buffer size for fixed block mode is set 236 buffer_kbs=xxx the buffer size for fixed block mode is set
237 to xxx kilobytes 237 to xxx kilobytes
238 write_threshold_kbs=xxx the write threshold in kilobytes set to xxx 238 write_threshold_kbs=xxx the write threshold in kilobytes set to xxx
239 max_sg_segs=xxx the maximum number of scatter/gather 239 max_sg_segs=xxx the maximum number of scatter/gather
240 segments 240 segments
241 try_direct_io=x try direct transfer between user buffer and 241 try_direct_io=x try direct transfer between user buffer and
242 tape drive if this is non-zero 242 tape drive if this is non-zero
243 243
244 Note that if the buffer size is changed but the write threshold is not 244 Note that if the buffer size is changed but the write threshold is not
245 set, the write threshold is set to the new buffer size - 2 kB. 245 set, the write threshold is set to the new buffer size - 2 kB.
246 246
247 247
248 BOOT TIME CONFIGURATION 248 BOOT TIME CONFIGURATION
249 249
250 If the driver is compiled into the kernel, the same parameters can be 250 If the driver is compiled into the kernel, the same parameters can be
251 also set using, e.g., the LILO command line. The preferred syntax is 251 also set using, e.g., the LILO command line. The preferred syntax is
252 is to use the same keyword used when loading as module but prepended 252 to use the same keyword used when loading as module but prepended
253 with 'st.'. For instance, to set the maximum number of scatter/gather 253 with 'st.'. For instance, to set the maximum number of scatter/gather
254 segments, the parameter 'st.max_sg_segs=xx' should be used (xx is the 254 segments, the parameter 'st.max_sg_segs=xx' should be used (xx is the
255 number of scatter/gather segments). 255 number of scatter/gather segments).
256 256
257 For compatibility, the old syntax from early 2.5 and 2.4 kernel 257 For compatibility, the old syntax from early 2.5 and 2.4 kernel
258 versions is supported. The same keywords can be used as when loading 258 versions is supported. The same keywords can be used as when loading
259 the driver as module. If several parameters are set, the keyword-value 259 the driver as module. If several parameters are set, the keyword-value
260 pairs are separated with a comma (no spaces allowed). A colon can be 260 pairs are separated with a comma (no spaces allowed). A colon can be
261 used instead of the equal mark. The definition is prepended by the 261 used instead of the equal mark. The definition is prepended by the
262 string st=. Here is an example: 262 string st=. Here is an example:
263 263
264 st=buffer_kbs:64,write_threhold_kbs:60 264 st=buffer_kbs:64,write_threhold_kbs:60
265 265
266 The following syntax used by the old kernel versions is also supported: 266 The following syntax used by the old kernel versions is also supported:
267 267
268 st=aa[,bb[,dd]] 268 st=aa[,bb[,dd]]
269 269
270 where 270 where
271 aa is the buffer size for fixed block mode in 1024 byte units 271 aa is the buffer size for fixed block mode in 1024 byte units
272 bb is the write threshold in 1024 byte units 272 bb is the write threshold in 1024 byte units
273 dd is the maximum number of scatter/gather segments 273 dd is the maximum number of scatter/gather segments
274 274
275 275
276 IOCTLS 276 IOCTLS
277 277
278 The tape is positioned and the drive parameters are set with ioctls 278 The tape is positioned and the drive parameters are set with ioctls
279 defined in mtio.h The tape control program 'mt' uses these ioctls. Try 279 defined in mtio.h The tape control program 'mt' uses these ioctls. Try
280 to find an mt that supports all of the Linux SCSI tape ioctls and 280 to find an mt that supports all of the Linux SCSI tape ioctls and
281 opens the device for writing if the tape contents will be modified 281 opens the device for writing if the tape contents will be modified
282 (look for a package mt-st* from the Linux ftp sites; the GNU mt does 282 (look for a package mt-st* from the Linux ftp sites; the GNU mt does
283 not open for writing for, e.g., erase). 283 not open for writing for, e.g., erase).
284 284
285 The supported ioctls are: 285 The supported ioctls are:
286 286
287 The following use the structure mtop: 287 The following use the structure mtop:
288 288
289 MTFSF Space forward over count filemarks. Tape positioned after filemark. 289 MTFSF Space forward over count filemarks. Tape positioned after filemark.
290 MTFSFM As above but tape positioned before filemark. 290 MTFSFM As above but tape positioned before filemark.
291 MTBSF Space backward over count filemarks. Tape positioned before 291 MTBSF Space backward over count filemarks. Tape positioned before
292 filemark. 292 filemark.
293 MTBSFM As above but ape positioned after filemark. 293 MTBSFM As above but ape positioned after filemark.
294 MTFSR Space forward over count records. 294 MTFSR Space forward over count records.
295 MTBSR Space backward over count records. 295 MTBSR Space backward over count records.
296 MTFSS Space forward over count setmarks. 296 MTFSS Space forward over count setmarks.
297 MTBSS Space backward over count setmarks. 297 MTBSS Space backward over count setmarks.
298 MTWEOF Write count filemarks. 298 MTWEOF Write count filemarks.
299 MTWSM Write count setmarks. 299 MTWSM Write count setmarks.
300 MTREW Rewind tape. 300 MTREW Rewind tape.
301 MTOFFL Set device off line (often rewind plus eject). 301 MTOFFL Set device off line (often rewind plus eject).
302 MTNOP Do nothing except flush the buffers. 302 MTNOP Do nothing except flush the buffers.
303 MTRETEN Re-tension tape. 303 MTRETEN Re-tension tape.
304 MTEOM Space to end of recorded data. 304 MTEOM Space to end of recorded data.
305 MTERASE Erase tape. If the argument is zero, the short erase command 305 MTERASE Erase tape. If the argument is zero, the short erase command
306 is used. The long erase command is used with all other values 306 is used. The long erase command is used with all other values
307 of the argument. 307 of the argument.
308 MTSEEK Seek to tape block count. Uses Tandberg-compatible seek (QFA) 308 MTSEEK Seek to tape block count. Uses Tandberg-compatible seek (QFA)
309 for SCSI-1 drives and SCSI-2 seek for SCSI-2 drives. The file and 309 for SCSI-1 drives and SCSI-2 seek for SCSI-2 drives. The file and
310 block numbers in the status are not valid after a seek. 310 block numbers in the status are not valid after a seek.
311 MTSETBLK Set the drive block size. Setting to zero sets the drive into 311 MTSETBLK Set the drive block size. Setting to zero sets the drive into
312 variable block mode (if applicable). 312 variable block mode (if applicable).
313 MTSETDENSITY Sets the drive density code to arg. See drive 313 MTSETDENSITY Sets the drive density code to arg. See drive
314 documentation for available codes. 314 documentation for available codes.
315 MTLOCK and MTUNLOCK Explicitly lock/unlock the tape drive door. 315 MTLOCK and MTUNLOCK Explicitly lock/unlock the tape drive door.
316 MTLOAD and MTUNLOAD Explicitly load and unload the tape. If the 316 MTLOAD and MTUNLOAD Explicitly load and unload the tape. If the
317 command argument x is between MT_ST_HPLOADER_OFFSET + 1 and 317 command argument x is between MT_ST_HPLOADER_OFFSET + 1 and
318 MT_ST_HPLOADER_OFFSET + 6, the number x is used sent to the 318 MT_ST_HPLOADER_OFFSET + 6, the number x is used sent to the
319 drive with the command and it selects the tape slot to use of 319 drive with the command and it selects the tape slot to use of
320 HP C1553A changer. 320 HP C1553A changer.
321 MTCOMPRESSION Sets compressing or uncompressing drive mode using the 321 MTCOMPRESSION Sets compressing or uncompressing drive mode using the
322 SCSI mode page 15. Note that some drives other methods for 322 SCSI mode page 15. Note that some drives other methods for
323 control of compression. Some drives (like the Exabytes) use 323 control of compression. Some drives (like the Exabytes) use
324 density codes for compression control. Some drives use another 324 density codes for compression control. Some drives use another
325 mode page but this page has not been implemented in the 325 mode page but this page has not been implemented in the
326 driver. Some drives without compression capability will accept 326 driver. Some drives without compression capability will accept
327 any compression mode without error. 327 any compression mode without error.
328 MTSETPART Moves the tape to the partition given by the argument at the 328 MTSETPART Moves the tape to the partition given by the argument at the
329 next tape operation. The block at which the tape is positioned 329 next tape operation. The block at which the tape is positioned
330 is the block where the tape was previously positioned in the 330 is the block where the tape was previously positioned in the
331 new active partition unless the next tape operation is 331 new active partition unless the next tape operation is
332 MTSEEK. In this case the tape is moved directly to the block 332 MTSEEK. In this case the tape is moved directly to the block
333 specified by MTSEEK. MTSETPART is inactive unless 333 specified by MTSEEK. MTSETPART is inactive unless
334 MT_ST_CAN_PARTITIONS set. 334 MT_ST_CAN_PARTITIONS set.
335 MTMKPART Formats the tape with one partition (argument zero) or two 335 MTMKPART Formats the tape with one partition (argument zero) or two
336 partitions (the argument gives in megabytes the size of 336 partitions (the argument gives in megabytes the size of
337 partition 1 that is physically the first partition of the 337 partition 1 that is physically the first partition of the
338 tape). The drive has to support partitions with size specified 338 tape). The drive has to support partitions with size specified
339 by the initiator. Inactive unless MT_ST_CAN_PARTITIONS set. 339 by the initiator. Inactive unless MT_ST_CAN_PARTITIONS set.
340 MTSETDRVBUFFER 340 MTSETDRVBUFFER
341 Is used for several purposes. The command is obtained from count 341 Is used for several purposes. The command is obtained from count
342 with mask MT_SET_OPTIONS, the low order bits are used as argument. 342 with mask MT_SET_OPTIONS, the low order bits are used as argument.
343 This command is only allowed for the superuser (root). The 343 This command is only allowed for the superuser (root). The
344 subcommands are: 344 subcommands are:
345 0 345 0
346 The drive buffer option is set to the argument. Zero means 346 The drive buffer option is set to the argument. Zero means
347 no buffering. 347 no buffering.
348 MT_ST_BOOLEANS 348 MT_ST_BOOLEANS
349 Sets the buffering options. The bits are the new states 349 Sets the buffering options. The bits are the new states
350 (enabled/disabled) the following options (in the 350 (enabled/disabled) the following options (in the
351 parenthesis is specified whether the option is global or 351 parenthesis is specified whether the option is global or
352 can be specified differently for each mode): 352 can be specified differently for each mode):
353 MT_ST_BUFFER_WRITES write buffering (mode) 353 MT_ST_BUFFER_WRITES write buffering (mode)
354 MT_ST_ASYNC_WRITES asynchronous writes (mode) 354 MT_ST_ASYNC_WRITES asynchronous writes (mode)
355 MT_ST_READ_AHEAD read ahead (mode) 355 MT_ST_READ_AHEAD read ahead (mode)
356 MT_ST_TWO_FM writing of two filemarks (global) 356 MT_ST_TWO_FM writing of two filemarks (global)
357 MT_ST_FAST_EOM using the SCSI spacing to EOD (global) 357 MT_ST_FAST_EOM using the SCSI spacing to EOD (global)
358 MT_ST_AUTO_LOCK automatic locking of the drive door (global) 358 MT_ST_AUTO_LOCK automatic locking of the drive door (global)
359 MT_ST_DEF_WRITES the defaults are meant only for writes (mode) 359 MT_ST_DEF_WRITES the defaults are meant only for writes (mode)
360 MT_ST_CAN_BSR backspacing over more than one records can 360 MT_ST_CAN_BSR backspacing over more than one records can
361 be used for repositioning the tape (global) 361 be used for repositioning the tape (global)
362 MT_ST_NO_BLKLIMS the driver does not ask the block limits 362 MT_ST_NO_BLKLIMS the driver does not ask the block limits
363 from the drive (block size can be changed only to 363 from the drive (block size can be changed only to
364 variable) (global) 364 variable) (global)
365 MT_ST_CAN_PARTITIONS enables support for partitioned 365 MT_ST_CAN_PARTITIONS enables support for partitioned
366 tapes (global) 366 tapes (global)
367 MT_ST_SCSI2LOGICAL the logical block number is used in 367 MT_ST_SCSI2LOGICAL the logical block number is used in
368 the MTSEEK and MTIOCPOS for SCSI-2 drives instead of 368 the MTSEEK and MTIOCPOS for SCSI-2 drives instead of
369 the device dependent address. It is recommended to set 369 the device dependent address. It is recommended to set
370 this flag unless there are tapes using the device 370 this flag unless there are tapes using the device
371 dependent (from the old times) (global) 371 dependent (from the old times) (global)
372 MT_ST_SYSV sets the SYSV semantics (mode) 372 MT_ST_SYSV sets the SYSV semantics (mode)
373 MT_ST_NOWAIT enables immediate mode (i.e., don't wait for 373 MT_ST_NOWAIT enables immediate mode (i.e., don't wait for
374 the command to finish) for some commands (e.g., rewind) 374 the command to finish) for some commands (e.g., rewind)
375 MT_ST_DEBUGGING debugging (global; debugging must be 375 MT_ST_DEBUGGING debugging (global; debugging must be
376 compiled into the driver) 376 compiled into the driver)
377 MT_ST_SETBOOLEANS 377 MT_ST_SETBOOLEANS
378 MT_ST_CLEARBOOLEANS 378 MT_ST_CLEARBOOLEANS
379 Sets or clears the option bits. 379 Sets or clears the option bits.
380 MT_ST_WRITE_THRESHOLD 380 MT_ST_WRITE_THRESHOLD
381 Sets the write threshold for this device to kilobytes 381 Sets the write threshold for this device to kilobytes
382 specified by the lowest bits. 382 specified by the lowest bits.
383 MT_ST_DEF_BLKSIZE 383 MT_ST_DEF_BLKSIZE
384 Defines the default block size set automatically. Value 384 Defines the default block size set automatically. Value
385 0xffffff means that the default is not used any more. 385 0xffffff means that the default is not used any more.
386 MT_ST_DEF_DENSITY 386 MT_ST_DEF_DENSITY
387 MT_ST_DEF_DRVBUFFER 387 MT_ST_DEF_DRVBUFFER
388 Used to set or clear the density (8 bits), and drive buffer 388 Used to set or clear the density (8 bits), and drive buffer
389 state (3 bits). If the value is MT_ST_CLEAR_DEFAULT 389 state (3 bits). If the value is MT_ST_CLEAR_DEFAULT
390 (0xfffff) the default will not be used any more. Otherwise 390 (0xfffff) the default will not be used any more. Otherwise
391 the lowermost bits of the value contain the new value of 391 the lowermost bits of the value contain the new value of
392 the parameter. 392 the parameter.
393 MT_ST_DEF_COMPRESSION 393 MT_ST_DEF_COMPRESSION
394 The compression default will not be used if the value of 394 The compression default will not be used if the value of
395 the lowermost byte is 0xff. Otherwise the lowermost bit 395 the lowermost byte is 0xff. Otherwise the lowermost bit
396 contains the new default. If the bits 8-15 are set to a 396 contains the new default. If the bits 8-15 are set to a
397 non-zero number, and this number is not 0xff, the number is 397 non-zero number, and this number is not 0xff, the number is
398 used as the compression algorithm. The value 398 used as the compression algorithm. The value
399 MT_ST_CLEAR_DEFAULT can be used to clear the compression 399 MT_ST_CLEAR_DEFAULT can be used to clear the compression
400 default. 400 default.
401 MT_ST_SET_TIMEOUT 401 MT_ST_SET_TIMEOUT
402 Set the normal timeout in seconds for this device. The 402 Set the normal timeout in seconds for this device. The
403 default is 900 seconds (15 minutes). The timeout should be 403 default is 900 seconds (15 minutes). The timeout should be
404 long enough for the retries done by the device while 404 long enough for the retries done by the device while
405 reading/writing. 405 reading/writing.
406 MT_ST_SET_LONG_TIMEOUT 406 MT_ST_SET_LONG_TIMEOUT
407 Set the long timeout that is used for operations that are 407 Set the long timeout that is used for operations that are
408 known to take a long time. The default is 14000 seconds 408 known to take a long time. The default is 14000 seconds
409 (3.9 hours). For erase this value is further multiplied by 409 (3.9 hours). For erase this value is further multiplied by
410 eight. 410 eight.
411 MT_ST_SET_CLN 411 MT_ST_SET_CLN
412 Set the cleaning request interpretation parameters using 412 Set the cleaning request interpretation parameters using
413 the lowest 24 bits of the argument. The driver can set the 413 the lowest 24 bits of the argument. The driver can set the
414 generic status bit GMT_CLN if a cleaning request bit pattern 414 generic status bit GMT_CLN if a cleaning request bit pattern
415 is found from the extended sense data. Many drives set one or 415 is found from the extended sense data. Many drives set one or
416 more bits in the extended sense data when the drive needs 416 more bits in the extended sense data when the drive needs
417 cleaning. The bits are device-dependent. The driver is 417 cleaning. The bits are device-dependent. The driver is
418 given the number of the sense data byte (the lowest eight 418 given the number of the sense data byte (the lowest eight
419 bits of the argument; must be >= 18 (values 1 - 17 419 bits of the argument; must be >= 18 (values 1 - 17
420 reserved) and <= the maximum requested sense data sixe), 420 reserved) and <= the maximum requested sense data sixe),
421 a mask to select the relevant bits (the bits 9-16), and the 421 a mask to select the relevant bits (the bits 9-16), and the
422 bit pattern (bits 17-23). If the bit pattern is zero, one 422 bit pattern (bits 17-23). If the bit pattern is zero, one
423 or more bits under the mask indicate cleaning request. If 423 or more bits under the mask indicate cleaning request. If
424 the pattern is non-zero, the pattern must match the masked 424 the pattern is non-zero, the pattern must match the masked
425 sense data byte. 425 sense data byte.
426 426
427 (The cleaning bit is set if the additional sense code and 427 (The cleaning bit is set if the additional sense code and
428 qualifier 00h 17h are seen regardless of the setting of 428 qualifier 00h 17h are seen regardless of the setting of
429 MT_ST_SET_CLN.) 429 MT_ST_SET_CLN.)
430 430
431 The following ioctl uses the structure mtpos: 431 The following ioctl uses the structure mtpos:
432 MTIOCPOS Reads the current position from the drive. Uses 432 MTIOCPOS Reads the current position from the drive. Uses
433 Tandberg-compatible QFA for SCSI-1 drives and the SCSI-2 433 Tandberg-compatible QFA for SCSI-1 drives and the SCSI-2
434 command for the SCSI-2 drives. 434 command for the SCSI-2 drives.
435 435
436 The following ioctl uses the structure mtget to return the status: 436 The following ioctl uses the structure mtget to return the status:
437 MTIOCGET Returns some status information. 437 MTIOCGET Returns some status information.
438 The file number and block number within file are returned. The 438 The file number and block number within file are returned. The
439 block is -1 when it can't be determined (e.g., after MTBSF). 439 block is -1 when it can't be determined (e.g., after MTBSF).
440 The drive type is either MTISSCSI1 or MTISSCSI2. 440 The drive type is either MTISSCSI1 or MTISSCSI2.
441 The number of recovered errors since the previous status call 441 The number of recovered errors since the previous status call
442 is stored in the lower word of the field mt_erreg. 442 is stored in the lower word of the field mt_erreg.
443 The current block size and the density code are stored in the field 443 The current block size and the density code are stored in the field
444 mt_dsreg (shifts for the subfields are MT_ST_BLKSIZE_SHIFT and 444 mt_dsreg (shifts for the subfields are MT_ST_BLKSIZE_SHIFT and
445 MT_ST_DENSITY_SHIFT). 445 MT_ST_DENSITY_SHIFT).
446 The GMT_xxx status bits reflect the drive status. GMT_DR_OPEN 446 The GMT_xxx status bits reflect the drive status. GMT_DR_OPEN
447 is set if there is no tape in the drive. GMT_EOD means either 447 is set if there is no tape in the drive. GMT_EOD means either
448 end of recorded data or end of tape. GMT_EOT means end of tape. 448 end of recorded data or end of tape. GMT_EOT means end of tape.
449 449
450 450
451 MISCELLANEOUS COMPILE OPTIONS 451 MISCELLANEOUS COMPILE OPTIONS
452 452
453 The recovered write errors are considered fatal if ST_RECOVERED_WRITE_FATAL 453 The recovered write errors are considered fatal if ST_RECOVERED_WRITE_FATAL
454 is defined. 454 is defined.
455 455
456 The maximum number of tape devices is determined by the define 456 The maximum number of tape devices is determined by the define
457 ST_MAX_TAPES. If more tapes are detected at driver initialization, the 457 ST_MAX_TAPES. If more tapes are detected at driver initialization, the
458 maximum is adjusted accordingly. 458 maximum is adjusted accordingly.
459 459
460 Immediate return from tape positioning SCSI commands can be enabled by 460 Immediate return from tape positioning SCSI commands can be enabled by
461 defining ST_NOWAIT. If this is defined, the user should take care that 461 defining ST_NOWAIT. If this is defined, the user should take care that
462 the next tape operation is not started before the previous one has 462 the next tape operation is not started before the previous one has
463 finished. The drives and SCSI adapters should handle this condition 463 finished. The drives and SCSI adapters should handle this condition
464 gracefully, but some drive/adapter combinations are known to hang the 464 gracefully, but some drive/adapter combinations are known to hang the
465 SCSI bus in this case. 465 SCSI bus in this case.
466 466
467 The MTEOM command is by default implemented as spacing over 32767 467 The MTEOM command is by default implemented as spacing over 32767
468 filemarks. With this method the file number in the status is 468 filemarks. With this method the file number in the status is
469 correct. The user can request using direct spacing to EOD by setting 469 correct. The user can request using direct spacing to EOD by setting
470 ST_FAST_EOM 1 (or using the MT_ST_OPTIONS ioctl). In this case the file 470 ST_FAST_EOM 1 (or using the MT_ST_OPTIONS ioctl). In this case the file
471 number will be invalid. 471 number will be invalid.
472 472
473 When using read ahead or buffered writes the position within the file 473 When using read ahead or buffered writes the position within the file
474 may not be correct after the file is closed (correct position may 474 may not be correct after the file is closed (correct position may
475 require backspacing over more than one record). The correct position 475 require backspacing over more than one record). The correct position
476 within file can be obtained if ST_IN_FILE_POS is defined at compile 476 within file can be obtained if ST_IN_FILE_POS is defined at compile
477 time or the MT_ST_CAN_BSR bit is set for the drive with an ioctl. 477 time or the MT_ST_CAN_BSR bit is set for the drive with an ioctl.
478 (The driver always backs over a filemark crossed by read ahead if the 478 (The driver always backs over a filemark crossed by read ahead if the
479 user does not request data that far.) 479 user does not request data that far.)
480 480
481 481
482 DEBUGGING HINTS 482 DEBUGGING HINTS
483 483
484 To enable debugging messages, edit st.c and #define DEBUG 1. As seen 484 To enable debugging messages, edit st.c and #define DEBUG 1. As seen
485 above, debugging can be switched off with an ioctl if debugging is 485 above, debugging can be switched off with an ioctl if debugging is
486 compiled into the driver. The debugging output is not voluminous. 486 compiled into the driver. The debugging output is not voluminous.
487 487
488 If the tape seems to hang, I would be very interested to hear where 488 If the tape seems to hang, I would be very interested to hear where
489 the driver is waiting. With the command 'ps -l' you can see the state 489 the driver is waiting. With the command 'ps -l' you can see the state
490 of the process using the tape. If the state is D, the process is 490 of the process using the tape. If the state is D, the process is
491 waiting for something. The field WCHAN tells where the driver is 491 waiting for something. The field WCHAN tells where the driver is
492 waiting. If you have the current System.map in the correct place (in 492 waiting. If you have the current System.map in the correct place (in
493 /boot for the procps I use) or have updated /etc/psdatabase (for kmem 493 /boot for the procps I use) or have updated /etc/psdatabase (for kmem
494 ps), ps writes the function name in the WCHAN field. If not, you have 494 ps), ps writes the function name in the WCHAN field. If not, you have
495 to look up the function from System.map. 495 to look up the function from System.map.
496 496
497 Note also that the timeouts are very long compared to most other 497 Note also that the timeouts are very long compared to most other
498 drivers. This means that the Linux driver may appear hung although the 498 drivers. This means that the Linux driver may appear hung although the
499 real reason is that the tape firmware has got confused. 499 real reason is that the tape firmware has got confused.
500 500
Documentation/sound/alsa/DocBook/writing-an-alsa-driver.tmpl
1 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.1//EN"> 1 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.1//EN">
2 2
3 <book> 3 <book>
4 <?dbhtml filename="index.html"> 4 <?dbhtml filename="index.html">
5 5
6 <!-- ****************************************************** --> 6 <!-- ****************************************************** -->
7 <!-- Header --> 7 <!-- Header -->
8 <!-- ****************************************************** --> 8 <!-- ****************************************************** -->
9 <bookinfo> 9 <bookinfo>
10 <title>Writing an ALSA Driver</title> 10 <title>Writing an ALSA Driver</title>
11 <author> 11 <author>
12 <firstname>Takashi</firstname> 12 <firstname>Takashi</firstname>
13 <surname>Iwai</surname> 13 <surname>Iwai</surname>
14 <affiliation> 14 <affiliation>
15 <address> 15 <address>
16 <email>tiwai@suse.de</email> 16 <email>tiwai@suse.de</email>
17 </address> 17 </address>
18 </affiliation> 18 </affiliation>
19 </author> 19 </author>
20 20
21 <date>November 17, 2005</date> 21 <date>November 17, 2005</date>
22 <edition>0.3.6</edition> 22 <edition>0.3.6</edition>
23 23
24 <abstract> 24 <abstract>
25 <para> 25 <para>
26 This document describes how to write an ALSA (Advanced Linux 26 This document describes how to write an ALSA (Advanced Linux
27 Sound Architecture) driver. 27 Sound Architecture) driver.
28 </para> 28 </para>
29 </abstract> 29 </abstract>
30 30
31 <legalnotice> 31 <legalnotice>
32 <para> 32 <para>
33 Copyright (c) 2002-2005 Takashi Iwai <email>tiwai@suse.de</email> 33 Copyright (c) 2002-2005 Takashi Iwai <email>tiwai@suse.de</email>
34 </para> 34 </para>
35 35
36 <para> 36 <para>
37 This document is free; you can redistribute it and/or modify it 37 This document is free; you can redistribute it and/or modify it
38 under the terms of the GNU General Public License as published by 38 under the terms of the GNU General Public License as published by
39 the Free Software Foundation; either version 2 of the License, or 39 the Free Software Foundation; either version 2 of the License, or
40 (at your option) any later version. 40 (at your option) any later version.
41 </para> 41 </para>
42 42
43 <para> 43 <para>
44 This document is distributed in the hope that it will be useful, 44 This document is distributed in the hope that it will be useful,
45 but <emphasis>WITHOUT ANY WARRANTY</emphasis>; without even the 45 but <emphasis>WITHOUT ANY WARRANTY</emphasis>; without even the
46 implied warranty of <emphasis>MERCHANTABILITY or FITNESS FOR A 46 implied warranty of <emphasis>MERCHANTABILITY or FITNESS FOR A
47 PARTICULAR PURPOSE</emphasis>. See the GNU General Public License 47 PARTICULAR PURPOSE</emphasis>. See the GNU General Public License
48 for more details. 48 for more details.
49 </para> 49 </para>
50 50
51 <para> 51 <para>
52 You should have received a copy of the GNU General Public 52 You should have received a copy of the GNU General Public
53 License along with this program; if not, write to the Free 53 License along with this program; if not, write to the Free
54 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, 54 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
55 MA 02111-1307 USA 55 MA 02111-1307 USA
56 </para> 56 </para>
57 </legalnotice> 57 </legalnotice>
58 58
59 </bookinfo> 59 </bookinfo>
60 60
61 <!-- ****************************************************** --> 61 <!-- ****************************************************** -->
62 <!-- Preface --> 62 <!-- Preface -->
63 <!-- ****************************************************** --> 63 <!-- ****************************************************** -->
64 <preface id="preface"> 64 <preface id="preface">
65 <title>Preface</title> 65 <title>Preface</title>
66 <para> 66 <para>
67 This document describes how to write an 67 This document describes how to write an
68 <ulink url="http://www.alsa-project.org/"><citetitle> 68 <ulink url="http://www.alsa-project.org/"><citetitle>
69 ALSA (Advanced Linux Sound Architecture)</citetitle></ulink> 69 ALSA (Advanced Linux Sound Architecture)</citetitle></ulink>
70 driver. The document focuses mainly on the PCI soundcard. 70 driver. The document focuses mainly on the PCI soundcard.
71 In the case of other device types, the API might 71 In the case of other device types, the API might
72 be different, too. However, at least the ALSA kernel API is 72 be different, too. However, at least the ALSA kernel API is
73 consistent, and therefore it would be still a bit help for 73 consistent, and therefore it would be still a bit help for
74 writing them. 74 writing them.
75 </para> 75 </para>
76 76
77 <para> 77 <para>
78 The target of this document is ones who already have enough 78 The target of this document is ones who already have enough
79 skill of C language and have the basic knowledge of linux 79 skill of C language and have the basic knowledge of linux
80 kernel programming. This document doesn't explain the general 80 kernel programming. This document doesn't explain the general
81 topics of linux kernel codes and doesn't cover the detail of 81 topics of linux kernel codes and doesn't cover the detail of
82 implementation of each low-level driver. It describes only how is 82 implementation of each low-level driver. It describes only how is
83 the standard way to write a PCI sound driver on ALSA. 83 the standard way to write a PCI sound driver on ALSA.
84 </para> 84 </para>
85 85
86 <para> 86 <para>
87 If you are already familiar with the older ALSA ver.0.5.x, you 87 If you are already familiar with the older ALSA ver.0.5.x, you
88 can check the drivers such as <filename>es1938.c</filename> or 88 can check the drivers such as <filename>es1938.c</filename> or
89 <filename>maestro3.c</filename> which have also almost the same 89 <filename>maestro3.c</filename> which have also almost the same
90 code-base in the ALSA 0.5.x tree, so you can compare the differences. 90 code-base in the ALSA 0.5.x tree, so you can compare the differences.
91 </para> 91 </para>
92 92
93 <para> 93 <para>
94 This document is still a draft version. Any feedbacks and 94 This document is still a draft version. Any feedbacks and
95 corrections, please!! 95 corrections, please!!
96 </para> 96 </para>
97 </preface> 97 </preface>
98 98
99 99
100 <!-- ****************************************************** --> 100 <!-- ****************************************************** -->
101 <!-- File Tree Structure --> 101 <!-- File Tree Structure -->
102 <!-- ****************************************************** --> 102 <!-- ****************************************************** -->
103 <chapter id="file-tree"> 103 <chapter id="file-tree">
104 <title>File Tree Structure</title> 104 <title>File Tree Structure</title>
105 105
106 <section id="file-tree-general"> 106 <section id="file-tree-general">
107 <title>General</title> 107 <title>General</title>
108 <para> 108 <para>
109 The ALSA drivers are provided in the two ways. 109 The ALSA drivers are provided in the two ways.
110 </para> 110 </para>
111 111
112 <para> 112 <para>
113 One is the trees provided as a tarball or via cvs from the 113 One is the trees provided as a tarball or via cvs from the
114 ALSA's ftp site, and another is the 2.6 (or later) Linux kernel 114 ALSA's ftp site, and another is the 2.6 (or later) Linux kernel
115 tree. To synchronize both, the ALSA driver tree is split into 115 tree. To synchronize both, the ALSA driver tree is split into
116 two different trees: alsa-kernel and alsa-driver. The former 116 two different trees: alsa-kernel and alsa-driver. The former
117 contains purely the source codes for the Linux 2.6 (or later) 117 contains purely the source codes for the Linux 2.6 (or later)
118 tree. This tree is designed only for compilation on 2.6 or 118 tree. This tree is designed only for compilation on 2.6 or
119 later environment. The latter, alsa-driver, contains many subtle 119 later environment. The latter, alsa-driver, contains many subtle
120 files for compiling the ALSA driver on the outside of Linux 120 files for compiling the ALSA driver on the outside of Linux
121 kernel like configure script, the wrapper functions for older, 121 kernel like configure script, the wrapper functions for older,
122 2.2 and 2.4 kernels, to adapt the latest kernel API, 122 2.2 and 2.4 kernels, to adapt the latest kernel API,
123 and additional drivers which are still in development or in 123 and additional drivers which are still in development or in
124 tests. The drivers in alsa-driver tree will be moved to 124 tests. The drivers in alsa-driver tree will be moved to
125 alsa-kernel (eventually 2.6 kernel tree) once when they are 125 alsa-kernel (eventually 2.6 kernel tree) once when they are
126 finished and confirmed to work fine. 126 finished and confirmed to work fine.
127 </para> 127 </para>
128 128
129 <para> 129 <para>
130 The file tree structure of ALSA driver is depicted below. Both 130 The file tree structure of ALSA driver is depicted below. Both
131 alsa-kernel and alsa-driver have almost the same file 131 alsa-kernel and alsa-driver have almost the same file
132 structure, except for <quote>core</quote> directory. It's 132 structure, except for <quote>core</quote> directory. It's
133 named as <quote>acore</quote> in alsa-driver tree. 133 named as <quote>acore</quote> in alsa-driver tree.
134 134
135 <example> 135 <example>
136 <title>ALSA File Tree Structure</title> 136 <title>ALSA File Tree Structure</title>
137 <literallayout> 137 <literallayout>
138 sound 138 sound
139 /core 139 /core
140 /oss 140 /oss
141 /seq 141 /seq
142 /oss 142 /oss
143 /instr 143 /instr
144 /ioctl32 144 /ioctl32
145 /include 145 /include
146 /drivers 146 /drivers
147 /mpu401 147 /mpu401
148 /opl3 148 /opl3
149 /i2c 149 /i2c
150 /l3 150 /l3
151 /synth 151 /synth
152 /emux 152 /emux
153 /pci 153 /pci
154 /(cards) 154 /(cards)
155 /isa 155 /isa
156 /(cards) 156 /(cards)
157 /arm 157 /arm
158 /ppc 158 /ppc
159 /sparc 159 /sparc
160 /usb 160 /usb
161 /pcmcia /(cards) 161 /pcmcia /(cards)
162 /oss 162 /oss
163 </literallayout> 163 </literallayout>
164 </example> 164 </example>
165 </para> 165 </para>
166 </section> 166 </section>
167 167
168 <section id="file-tree-core-directory"> 168 <section id="file-tree-core-directory">
169 <title>core directory</title> 169 <title>core directory</title>
170 <para> 170 <para>
171 This directory contains the middle layer, that is, the heart 171 This directory contains the middle layer, that is, the heart
172 of ALSA drivers. In this directory, the native ALSA modules are 172 of ALSA drivers. In this directory, the native ALSA modules are
173 stored. The sub-directories contain different modules and are 173 stored. The sub-directories contain different modules and are
174 dependent upon the kernel config. 174 dependent upon the kernel config.
175 </para> 175 </para>
176 176
177 <section id="file-tree-core-directory-oss"> 177 <section id="file-tree-core-directory-oss">
178 <title>core/oss</title> 178 <title>core/oss</title>
179 179
180 <para> 180 <para>
181 The codes for PCM and mixer OSS emulation modules are stored 181 The codes for PCM and mixer OSS emulation modules are stored
182 in this directory. The rawmidi OSS emulation is included in 182 in this directory. The rawmidi OSS emulation is included in
183 the ALSA rawmidi code since it's quite small. The sequencer 183 the ALSA rawmidi code since it's quite small. The sequencer
184 code is stored in core/seq/oss directory (see 184 code is stored in core/seq/oss directory (see
185 <link linkend="file-tree-core-directory-seq-oss"><citetitle> 185 <link linkend="file-tree-core-directory-seq-oss"><citetitle>
186 below</citetitle></link>). 186 below</citetitle></link>).
187 </para> 187 </para>
188 </section> 188 </section>
189 189
190 <section id="file-tree-core-directory-ioctl32"> 190 <section id="file-tree-core-directory-ioctl32">
191 <title>core/ioctl32</title> 191 <title>core/ioctl32</title>
192 192
193 <para> 193 <para>
194 This directory contains the 32bit-ioctl wrappers for 64bit 194 This directory contains the 32bit-ioctl wrappers for 64bit
195 architectures such like x86-64, ppc64 and sparc64. For 32bit 195 architectures such like x86-64, ppc64 and sparc64. For 32bit
196 and alpha architectures, these are not compiled. 196 and alpha architectures, these are not compiled.
197 </para> 197 </para>
198 </section> 198 </section>
199 199
200 <section id="file-tree-core-directory-seq"> 200 <section id="file-tree-core-directory-seq">
201 <title>core/seq</title> 201 <title>core/seq</title>
202 <para> 202 <para>
203 This and its sub-directories are for the ALSA 203 This and its sub-directories are for the ALSA
204 sequencer. This directory contains the sequencer core and 204 sequencer. This directory contains the sequencer core and
205 primary sequencer modules such like snd-seq-midi, 205 primary sequencer modules such like snd-seq-midi,
206 snd-seq-virmidi, etc. They are compiled only when 206 snd-seq-virmidi, etc. They are compiled only when
207 <constant>CONFIG_SND_SEQUENCER</constant> is set in the kernel 207 <constant>CONFIG_SND_SEQUENCER</constant> is set in the kernel
208 config. 208 config.
209 </para> 209 </para>
210 </section> 210 </section>
211 211
212 <section id="file-tree-core-directory-seq-oss"> 212 <section id="file-tree-core-directory-seq-oss">
213 <title>core/seq/oss</title> 213 <title>core/seq/oss</title>
214 <para> 214 <para>
215 This contains the OSS sequencer emulation codes. 215 This contains the OSS sequencer emulation codes.
216 </para> 216 </para>
217 </section> 217 </section>
218 218
219 <section id="file-tree-core-directory-deq-instr"> 219 <section id="file-tree-core-directory-deq-instr">
220 <title>core/seq/instr</title> 220 <title>core/seq/instr</title>
221 <para> 221 <para>
222 This directory contains the modules for the sequencer 222 This directory contains the modules for the sequencer
223 instrument layer. 223 instrument layer.
224 </para> 224 </para>
225 </section> 225 </section>
226 </section> 226 </section>
227 227
228 <section id="file-tree-include-directory"> 228 <section id="file-tree-include-directory">
229 <title>include directory</title> 229 <title>include directory</title>
230 <para> 230 <para>
231 This is the place for the public header files of ALSA drivers, 231 This is the place for the public header files of ALSA drivers,
232 which are to be exported to the user-space, or included by 232 which are to be exported to the user-space, or included by
233 several files at different directories. Basically, the private 233 several files at different directories. Basically, the private
234 header files should not be placed in this directory, but you may 234 header files should not be placed in this directory, but you may
235 still find files there, due to historical reason :) 235 still find files there, due to historical reason :)
236 </para> 236 </para>
237 </section> 237 </section>
238 238
239 <section id="file-tree-drivers-directory"> 239 <section id="file-tree-drivers-directory">
240 <title>drivers directory</title> 240 <title>drivers directory</title>
241 <para> 241 <para>
242 This directory contains the codes shared among different drivers 242 This directory contains the codes shared among different drivers
243 on the different architectures. They are hence supposed not to be 243 on the different architectures. They are hence supposed not to be
244 architecture-specific. 244 architecture-specific.
245 For example, the dummy pcm driver and the serial MIDI 245 For example, the dummy pcm driver and the serial MIDI
246 driver are found in this directory. In the sub-directories, 246 driver are found in this directory. In the sub-directories,
247 there are the codes for components which are independent from 247 there are the codes for components which are independent from
248 bus and cpu architectures. 248 bus and cpu architectures.
249 </para> 249 </para>
250 250
251 <section id="file-tree-drivers-directory-mpu401"> 251 <section id="file-tree-drivers-directory-mpu401">
252 <title>drivers/mpu401</title> 252 <title>drivers/mpu401</title>
253 <para> 253 <para>
254 The MPU401 and MPU401-UART modules are stored here. 254 The MPU401 and MPU401-UART modules are stored here.
255 </para> 255 </para>
256 </section> 256 </section>
257 257
258 <section id="file-tree-drivers-directory-opl3"> 258 <section id="file-tree-drivers-directory-opl3">
259 <title>drivers/opl3 and opl4</title> 259 <title>drivers/opl3 and opl4</title>
260 <para> 260 <para>
261 The OPL3 and OPL4 FM-synth stuff is found here. 261 The OPL3 and OPL4 FM-synth stuff is found here.
262 </para> 262 </para>
263 </section> 263 </section>
264 </section> 264 </section>
265 265
266 <section id="file-tree-i2c-directory"> 266 <section id="file-tree-i2c-directory">
267 <title>i2c directory</title> 267 <title>i2c directory</title>
268 <para> 268 <para>
269 This contains the ALSA i2c components. 269 This contains the ALSA i2c components.
270 </para> 270 </para>
271 271
272 <para> 272 <para>
273 Although there is a standard i2c layer on Linux, ALSA has its 273 Although there is a standard i2c layer on Linux, ALSA has its
274 own i2c codes for some cards, because the soundcard needs only a 274 own i2c codes for some cards, because the soundcard needs only a
275 simple operation and the standard i2c API is too complicated for 275 simple operation and the standard i2c API is too complicated for
276 such a purpose. 276 such a purpose.
277 </para> 277 </para>
278 278
279 <section id="file-tree-i2c-directory-l3"> 279 <section id="file-tree-i2c-directory-l3">
280 <title>i2c/l3</title> 280 <title>i2c/l3</title>
281 <para> 281 <para>
282 This is a sub-directory for ARM L3 i2c. 282 This is a sub-directory for ARM L3 i2c.
283 </para> 283 </para>
284 </section> 284 </section>
285 </section> 285 </section>
286 286
287 <section id="file-tree-synth-directory"> 287 <section id="file-tree-synth-directory">
288 <title>synth directory</title> 288 <title>synth directory</title>
289 <para> 289 <para>
290 This contains the synth middle-level modules. 290 This contains the synth middle-level modules.
291 </para> 291 </para>
292 292
293 <para> 293 <para>
294 So far, there is only Emu8000/Emu10k1 synth driver under 294 So far, there is only Emu8000/Emu10k1 synth driver under
295 synth/emux sub-directory. 295 synth/emux sub-directory.
296 </para> 296 </para>
297 </section> 297 </section>
298 298
299 <section id="file-tree-pci-directory"> 299 <section id="file-tree-pci-directory">
300 <title>pci directory</title> 300 <title>pci directory</title>
301 <para> 301 <para>
302 This and its sub-directories hold the top-level card modules 302 This and its sub-directories hold the top-level card modules
303 for PCI soundcards and the codes specific to the PCI BUS. 303 for PCI soundcards and the codes specific to the PCI BUS.
304 </para> 304 </para>
305 305
306 <para> 306 <para>
307 The drivers compiled from a single file is stored directly on 307 The drivers compiled from a single file is stored directly on
308 pci directory, while the drivers with several source files are 308 pci directory, while the drivers with several source files are
309 stored on its own sub-directory (e.g. emu10k1, ice1712). 309 stored on its own sub-directory (e.g. emu10k1, ice1712).
310 </para> 310 </para>
311 </section> 311 </section>
312 312
313 <section id="file-tree-isa-directory"> 313 <section id="file-tree-isa-directory">
314 <title>isa directory</title> 314 <title>isa directory</title>
315 <para> 315 <para>
316 This and its sub-directories hold the top-level card modules 316 This and its sub-directories hold the top-level card modules
317 for ISA soundcards. 317 for ISA soundcards.
318 </para> 318 </para>
319 </section> 319 </section>
320 320
321 <section id="file-tree-arm-ppc-sparc-directories"> 321 <section id="file-tree-arm-ppc-sparc-directories">
322 <title>arm, ppc, and sparc directories</title> 322 <title>arm, ppc, and sparc directories</title>
323 <para> 323 <para>
324 These are for the top-level card modules which are 324 These are for the top-level card modules which are
325 specific to each given architecture. 325 specific to each given architecture.
326 </para> 326 </para>
327 </section> 327 </section>
328 328
329 <section id="file-tree-usb-directory"> 329 <section id="file-tree-usb-directory">
330 <title>usb directory</title> 330 <title>usb directory</title>
331 <para> 331 <para>
332 This contains the USB-audio driver. On the latest version, the 332 This contains the USB-audio driver. On the latest version, the
333 USB MIDI driver is integrated together with usb-audio driver. 333 USB MIDI driver is integrated together with usb-audio driver.
334 </para> 334 </para>
335 </section> 335 </section>
336 336
337 <section id="file-tree-pcmcia-directory"> 337 <section id="file-tree-pcmcia-directory">
338 <title>pcmcia directory</title> 338 <title>pcmcia directory</title>
339 <para> 339 <para>
340 The PCMCIA, especially PCCard drivers will go here. CardBus 340 The PCMCIA, especially PCCard drivers will go here. CardBus
341 drivers will be on pci directory, because its API is identical 341 drivers will be on pci directory, because its API is identical
342 with the standard PCI cards. 342 with the standard PCI cards.
343 </para> 343 </para>
344 </section> 344 </section>
345 345
346 <section id="file-tree-oss-directory"> 346 <section id="file-tree-oss-directory">
347 <title>oss directory</title> 347 <title>oss directory</title>
348 <para> 348 <para>
349 The OSS/Lite source files are stored here on Linux 2.6 (or 349 The OSS/Lite source files are stored here on Linux 2.6 (or
350 later) tree. (In the ALSA driver tarball, it's empty, of course :) 350 later) tree. (In the ALSA driver tarball, it's empty, of course :)
351 </para> 351 </para>
352 </section> 352 </section>
353 </chapter> 353 </chapter>
354 354
355 355
356 <!-- ****************************************************** --> 356 <!-- ****************************************************** -->
357 <!-- Basic Flow for PCI Drivers --> 357 <!-- Basic Flow for PCI Drivers -->
358 <!-- ****************************************************** --> 358 <!-- ****************************************************** -->
359 <chapter id="basic-flow"> 359 <chapter id="basic-flow">
360 <title>Basic Flow for PCI Drivers</title> 360 <title>Basic Flow for PCI Drivers</title>
361 361
362 <section id="basic-flow-outline"> 362 <section id="basic-flow-outline">
363 <title>Outline</title> 363 <title>Outline</title>
364 <para> 364 <para>
365 The minimum flow of PCI soundcard is like the following: 365 The minimum flow of PCI soundcard is like the following:
366 366
367 <itemizedlist> 367 <itemizedlist>
368 <listitem><para>define the PCI ID table (see the section 368 <listitem><para>define the PCI ID table (see the section
369 <link linkend="pci-resource-entries"><citetitle>PCI Entries 369 <link linkend="pci-resource-entries"><citetitle>PCI Entries
370 </citetitle></link>).</para></listitem> 370 </citetitle></link>).</para></listitem>
371 <listitem><para>create <function>probe()</function> callback.</para></listitem> 371 <listitem><para>create <function>probe()</function> callback.</para></listitem>
372 <listitem><para>create <function>remove()</function> callback.</para></listitem> 372 <listitem><para>create <function>remove()</function> callback.</para></listitem>
373 <listitem><para>create pci_driver table which contains the three pointers above.</para></listitem> 373 <listitem><para>create pci_driver table which contains the three pointers above.</para></listitem>
374 <listitem><para>create <function>init()</function> function just calling <function>pci_register_driver()</function> to register the pci_driver table defined above.</para></listitem> 374 <listitem><para>create <function>init()</function> function just calling <function>pci_register_driver()</function> to register the pci_driver table defined above.</para></listitem>
375 <listitem><para>create <function>exit()</function> function to call <function>pci_unregister_driver()</function> function.</para></listitem> 375 <listitem><para>create <function>exit()</function> function to call <function>pci_unregister_driver()</function> function.</para></listitem>
376 </itemizedlist> 376 </itemizedlist>
377 </para> 377 </para>
378 </section> 378 </section>
379 379
380 <section id="basic-flow-example"> 380 <section id="basic-flow-example">
381 <title>Full Code Example</title> 381 <title>Full Code Example</title>
382 <para> 382 <para>
383 The code example is shown below. Some parts are kept 383 The code example is shown below. Some parts are kept
384 unimplemented at this moment but will be filled in the 384 unimplemented at this moment but will be filled in the
385 succeeding sections. The numbers in comment lines of 385 succeeding sections. The numbers in comment lines of
386 <function>snd_mychip_probe()</function> function are the 386 <function>snd_mychip_probe()</function> function are the
387 markers. 387 markers.
388 388
389 <example> 389 <example>
390 <title>Basic Flow for PCI Drivers Example</title> 390 <title>Basic Flow for PCI Drivers Example</title>
391 <programlisting> 391 <programlisting>
392 <![CDATA[ 392 <![CDATA[
393 #include <sound/driver.h> 393 #include <sound/driver.h>
394 #include <linux/init.h> 394 #include <linux/init.h>
395 #include <linux/pci.h> 395 #include <linux/pci.h>
396 #include <linux/slab.h> 396 #include <linux/slab.h>
397 #include <sound/core.h> 397 #include <sound/core.h>
398 #include <sound/initval.h> 398 #include <sound/initval.h>
399 399
400 /* module parameters (see "Module Parameters") */ 400 /* module parameters (see "Module Parameters") */
401 static int index[SNDRV_CARDS] = SNDRV_DEFAULT_IDX; 401 static int index[SNDRV_CARDS] = SNDRV_DEFAULT_IDX;
402 static char *id[SNDRV_CARDS] = SNDRV_DEFAULT_STR; 402 static char *id[SNDRV_CARDS] = SNDRV_DEFAULT_STR;
403 static int enable[SNDRV_CARDS] = SNDRV_DEFAULT_ENABLE_PNP; 403 static int enable[SNDRV_CARDS] = SNDRV_DEFAULT_ENABLE_PNP;
404 404
405 /* definition of the chip-specific record */ 405 /* definition of the chip-specific record */
406 struct mychip { 406 struct mychip {
407 struct snd_card *card; 407 struct snd_card *card;
408 // rest of implementation will be in the section 408 // rest of implementation will be in the section
409 // "PCI Resource Managements" 409 // "PCI Resource Managements"
410 }; 410 };
411 411
412 /* chip-specific destructor 412 /* chip-specific destructor
413 * (see "PCI Resource Managements") 413 * (see "PCI Resource Managements")
414 */ 414 */
415 static int snd_mychip_free(struct mychip *chip) 415 static int snd_mychip_free(struct mychip *chip)
416 { 416 {
417 .... // will be implemented later... 417 .... // will be implemented later...
418 } 418 }
419 419
420 /* component-destructor 420 /* component-destructor
421 * (see "Management of Cards and Components") 421 * (see "Management of Cards and Components")
422 */ 422 */
423 static int snd_mychip_dev_free(struct snd_device *device) 423 static int snd_mychip_dev_free(struct snd_device *device)
424 { 424 {
425 return snd_mychip_free(device->device_data); 425 return snd_mychip_free(device->device_data);
426 } 426 }
427 427
428 /* chip-specific constructor 428 /* chip-specific constructor
429 * (see "Management of Cards and Components") 429 * (see "Management of Cards and Components")
430 */ 430 */
431 static int __devinit snd_mychip_create(struct snd_card *card, 431 static int __devinit snd_mychip_create(struct snd_card *card,
432 struct pci_dev *pci, 432 struct pci_dev *pci,
433 struct mychip **rchip) 433 struct mychip **rchip)
434 { 434 {
435 struct mychip *chip; 435 struct mychip *chip;
436 int err; 436 int err;
437 static struct snd_device_ops ops = { 437 static struct snd_device_ops ops = {
438 .dev_free = snd_mychip_dev_free, 438 .dev_free = snd_mychip_dev_free,
439 }; 439 };
440 440
441 *rchip = NULL; 441 *rchip = NULL;
442 442
443 // check PCI availability here 443 // check PCI availability here
444 // (see "PCI Resource Managements") 444 // (see "PCI Resource Managements")
445 .... 445 ....
446 446
447 /* allocate a chip-specific data with zero filled */ 447 /* allocate a chip-specific data with zero filled */
448 chip = kzalloc(sizeof(*chip), GFP_KERNEL); 448 chip = kzalloc(sizeof(*chip), GFP_KERNEL);
449 if (chip == NULL) 449 if (chip == NULL)
450 return -ENOMEM; 450 return -ENOMEM;
451 451
452 chip->card = card; 452 chip->card = card;
453 453
454 // rest of initialization here; will be implemented 454 // rest of initialization here; will be implemented
455 // later, see "PCI Resource Managements" 455 // later, see "PCI Resource Managements"
456 .... 456 ....
457 457
458 if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, 458 if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL,
459 chip, &ops)) < 0) { 459 chip, &ops)) < 0) {
460 snd_mychip_free(chip); 460 snd_mychip_free(chip);
461 return err; 461 return err;
462 } 462 }
463 463
464 snd_card_set_dev(card, &pci->dev); 464 snd_card_set_dev(card, &pci->dev);
465 465
466 *rchip = chip; 466 *rchip = chip;
467 return 0; 467 return 0;
468 } 468 }
469 469
470 /* constructor -- see "Constructor" sub-section */ 470 /* constructor -- see "Constructor" sub-section */
471 static int __devinit snd_mychip_probe(struct pci_dev *pci, 471 static int __devinit snd_mychip_probe(struct pci_dev *pci,
472 const struct pci_device_id *pci_id) 472 const struct pci_device_id *pci_id)
473 { 473 {
474 static int dev; 474 static int dev;
475 struct snd_card *card; 475 struct snd_card *card;
476 struct mychip *chip; 476 struct mychip *chip;
477 int err; 477 int err;
478 478
479 /* (1) */ 479 /* (1) */
480 if (dev >= SNDRV_CARDS) 480 if (dev >= SNDRV_CARDS)
481 return -ENODEV; 481 return -ENODEV;
482 if (!enable[dev]) { 482 if (!enable[dev]) {
483 dev++; 483 dev++;
484 return -ENOENT; 484 return -ENOENT;
485 } 485 }
486 486
487 /* (2) */ 487 /* (2) */
488 card = snd_card_new(index[dev], id[dev], THIS_MODULE, 0); 488 card = snd_card_new(index[dev], id[dev], THIS_MODULE, 0);
489 if (card == NULL) 489 if (card == NULL)
490 return -ENOMEM; 490 return -ENOMEM;
491 491
492 /* (3) */ 492 /* (3) */
493 if ((err = snd_mychip_create(card, pci, &chip)) < 0) { 493 if ((err = snd_mychip_create(card, pci, &chip)) < 0) {
494 snd_card_free(card); 494 snd_card_free(card);
495 return err; 495 return err;
496 } 496 }
497 497
498 /* (4) */ 498 /* (4) */
499 strcpy(card->driver, "My Chip"); 499 strcpy(card->driver, "My Chip");
500 strcpy(card->shortname, "My Own Chip 123"); 500 strcpy(card->shortname, "My Own Chip 123");
501 sprintf(card->longname, "%s at 0x%lx irq %i", 501 sprintf(card->longname, "%s at 0x%lx irq %i",
502 card->shortname, chip->ioport, chip->irq); 502 card->shortname, chip->ioport, chip->irq);
503 503
504 /* (5) */ 504 /* (5) */
505 .... // implemented later 505 .... // implemented later
506 506
507 /* (6) */ 507 /* (6) */
508 if ((err = snd_card_register(card)) < 0) { 508 if ((err = snd_card_register(card)) < 0) {
509 snd_card_free(card); 509 snd_card_free(card);
510 return err; 510 return err;
511 } 511 }
512 512
513 /* (7) */ 513 /* (7) */
514 pci_set_drvdata(pci, card); 514 pci_set_drvdata(pci, card);
515 dev++; 515 dev++;
516 return 0; 516 return 0;
517 } 517 }
518 518
519 /* destructor -- see "Destructor" sub-section */ 519 /* destructor -- see "Destructor" sub-section */
520 static void __devexit snd_mychip_remove(struct pci_dev *pci) 520 static void __devexit snd_mychip_remove(struct pci_dev *pci)
521 { 521 {
522 snd_card_free(pci_get_drvdata(pci)); 522 snd_card_free(pci_get_drvdata(pci));
523 pci_set_drvdata(pci, NULL); 523 pci_set_drvdata(pci, NULL);
524 } 524 }
525 ]]> 525 ]]>
526 </programlisting> 526 </programlisting>
527 </example> 527 </example>
528 </para> 528 </para>
529 </section> 529 </section>
530 530
531 <section id="basic-flow-constructor"> 531 <section id="basic-flow-constructor">
532 <title>Constructor</title> 532 <title>Constructor</title>
533 <para> 533 <para>
534 The real constructor of PCI drivers is probe callback. The 534 The real constructor of PCI drivers is probe callback. The
535 probe callback and other component-constructors which are called 535 probe callback and other component-constructors which are called
536 from probe callback should be defined with 536 from probe callback should be defined with
537 <parameter>__devinit</parameter> prefix. You 537 <parameter>__devinit</parameter> prefix. You
538 cannot use <parameter>__init</parameter> prefix for them, 538 cannot use <parameter>__init</parameter> prefix for them,
539 because any PCI device could be a hotplug device. 539 because any PCI device could be a hotplug device.
540 </para> 540 </para>
541 541
542 <para> 542 <para>
543 In the probe callback, the following scheme is often used. 543 In the probe callback, the following scheme is often used.
544 </para> 544 </para>
545 545
546 <section id="basic-flow-constructor-device-index"> 546 <section id="basic-flow-constructor-device-index">
547 <title>1) Check and increment the device index.</title> 547 <title>1) Check and increment the device index.</title>
548 <para> 548 <para>
549 <informalexample> 549 <informalexample>
550 <programlisting> 550 <programlisting>
551 <![CDATA[ 551 <![CDATA[
552 static int dev; 552 static int dev;
553 .... 553 ....
554 if (dev >= SNDRV_CARDS) 554 if (dev >= SNDRV_CARDS)
555 return -ENODEV; 555 return -ENODEV;
556 if (!enable[dev]) { 556 if (!enable[dev]) {
557 dev++; 557 dev++;
558 return -ENOENT; 558 return -ENOENT;
559 } 559 }
560 ]]> 560 ]]>
561 </programlisting> 561 </programlisting>
562 </informalexample> 562 </informalexample>
563 563
564 where enable[dev] is the module option. 564 where enable[dev] is the module option.
565 </para> 565 </para>
566 566
567 <para> 567 <para>
568 At each time probe callback is called, check the 568 At each time probe callback is called, check the
569 availability of the device. If not available, simply increment 569 availability of the device. If not available, simply increment
570 the device index and returns. dev will be incremented also 570 the device index and returns. dev will be incremented also
571 later (<link 571 later (<link
572 linkend="basic-flow-constructor-set-pci"><citetitle>step 572 linkend="basic-flow-constructor-set-pci"><citetitle>step
573 7</citetitle></link>). 573 7</citetitle></link>).
574 </para> 574 </para>
575 </section> 575 </section>
576 576
577 <section id="basic-flow-constructor-create-card"> 577 <section id="basic-flow-constructor-create-card">
578 <title>2) Create a card instance</title> 578 <title>2) Create a card instance</title>
579 <para> 579 <para>
580 <informalexample> 580 <informalexample>
581 <programlisting> 581 <programlisting>
582 <![CDATA[ 582 <![CDATA[
583 struct snd_card *card; 583 struct snd_card *card;
584 .... 584 ....
585 card = snd_card_new(index[dev], id[dev], THIS_MODULE, 0); 585 card = snd_card_new(index[dev], id[dev], THIS_MODULE, 0);
586 ]]> 586 ]]>
587 </programlisting> 587 </programlisting>
588 </informalexample> 588 </informalexample>
589 </para> 589 </para>
590 590
591 <para> 591 <para>
592 The detail will be explained in the section 592 The detail will be explained in the section
593 <link linkend="card-management-card-instance"><citetitle> 593 <link linkend="card-management-card-instance"><citetitle>
594 Management of Cards and Components</citetitle></link>. 594 Management of Cards and Components</citetitle></link>.
595 </para> 595 </para>
596 </section> 596 </section>
597 597
598 <section id="basic-flow-constructor-create-main"> 598 <section id="basic-flow-constructor-create-main">
599 <title>3) Create a main component</title> 599 <title>3) Create a main component</title>
600 <para> 600 <para>
601 In this part, the PCI resources are allocated. 601 In this part, the PCI resources are allocated.
602 602
603 <informalexample> 603 <informalexample>
604 <programlisting> 604 <programlisting>
605 <![CDATA[ 605 <![CDATA[
606 struct mychip *chip; 606 struct mychip *chip;
607 .... 607 ....
608 if ((err = snd_mychip_create(card, pci, &chip)) < 0) { 608 if ((err = snd_mychip_create(card, pci, &chip)) < 0) {
609 snd_card_free(card); 609 snd_card_free(card);
610 return err; 610 return err;
611 } 611 }
612 ]]> 612 ]]>
613 </programlisting> 613 </programlisting>
614 </informalexample> 614 </informalexample>
615 615
616 The detail will be explained in the section <link 616 The detail will be explained in the section <link
617 linkend="pci-resource"><citetitle>PCI Resource 617 linkend="pci-resource"><citetitle>PCI Resource
618 Managements</citetitle></link>. 618 Managements</citetitle></link>.
619 </para> 619 </para>
620 </section> 620 </section>
621 621
622 <section id="basic-flow-constructor-main-component"> 622 <section id="basic-flow-constructor-main-component">
623 <title>4) Set the driver ID and name strings.</title> 623 <title>4) Set the driver ID and name strings.</title>
624 <para> 624 <para>
625 <informalexample> 625 <informalexample>
626 <programlisting> 626 <programlisting>
627 <![CDATA[ 627 <![CDATA[
628 strcpy(card->driver, "My Chip"); 628 strcpy(card->driver, "My Chip");
629 strcpy(card->shortname, "My Own Chip 123"); 629 strcpy(card->shortname, "My Own Chip 123");
630 sprintf(card->longname, "%s at 0x%lx irq %i", 630 sprintf(card->longname, "%s at 0x%lx irq %i",
631 card->shortname, chip->ioport, chip->irq); 631 card->shortname, chip->ioport, chip->irq);
632 ]]> 632 ]]>
633 </programlisting> 633 </programlisting>
634 </informalexample> 634 </informalexample>
635 635
636 The driver field holds the minimal ID string of the 636 The driver field holds the minimal ID string of the
637 chip. This is referred by alsa-lib's configurator, so keep it 637 chip. This is referred by alsa-lib's configurator, so keep it
638 simple but unique. 638 simple but unique.
639 Even the same driver can have different driver IDs to 639 Even the same driver can have different driver IDs to
640 distinguish the functionality of each chip type. 640 distinguish the functionality of each chip type.
641 </para> 641 </para>
642 642
643 <para> 643 <para>
644 The shortname field is a string shown as more verbose 644 The shortname field is a string shown as more verbose
645 name. The longname field contains the information which is 645 name. The longname field contains the information which is
646 shown in <filename>/proc/asound/cards</filename>. 646 shown in <filename>/proc/asound/cards</filename>.
647 </para> 647 </para>
648 </section> 648 </section>
649 649
650 <section id="basic-flow-constructor-create-other"> 650 <section id="basic-flow-constructor-create-other">
651 <title>5) Create other components, such as mixer, MIDI, etc.</title> 651 <title>5) Create other components, such as mixer, MIDI, etc.</title>
652 <para> 652 <para>
653 Here you define the basic components such as 653 Here you define the basic components such as
654 <link linkend="pcm-interface"><citetitle>PCM</citetitle></link>, 654 <link linkend="pcm-interface"><citetitle>PCM</citetitle></link>,
655 mixer (e.g. <link linkend="api-ac97"><citetitle>AC97</citetitle></link>), 655 mixer (e.g. <link linkend="api-ac97"><citetitle>AC97</citetitle></link>),
656 MIDI (e.g. <link linkend="midi-interface"><citetitle>MPU-401</citetitle></link>), 656 MIDI (e.g. <link linkend="midi-interface"><citetitle>MPU-401</citetitle></link>),
657 and other interfaces. 657 and other interfaces.
658 Also, if you want a <link linkend="proc-interface"><citetitle>proc 658 Also, if you want a <link linkend="proc-interface"><citetitle>proc
659 file</citetitle></link>, define it here, too. 659 file</citetitle></link>, define it here, too.
660 </para> 660 </para>
661 </section> 661 </section>
662 662
663 <section id="basic-flow-constructor-register-card"> 663 <section id="basic-flow-constructor-register-card">
664 <title>6) Register the card instance.</title> 664 <title>6) Register the card instance.</title>
665 <para> 665 <para>
666 <informalexample> 666 <informalexample>
667 <programlisting> 667 <programlisting>
668 <![CDATA[ 668 <![CDATA[
669 if ((err = snd_card_register(card)) < 0) { 669 if ((err = snd_card_register(card)) < 0) {
670 snd_card_free(card); 670 snd_card_free(card);
671 return err; 671 return err;
672 } 672 }
673 ]]> 673 ]]>
674 </programlisting> 674 </programlisting>
675 </informalexample> 675 </informalexample>
676 </para> 676 </para>
677 677
678 <para> 678 <para>
679 Will be explained in the section <link 679 Will be explained in the section <link
680 linkend="card-management-registration"><citetitle>Management 680 linkend="card-management-registration"><citetitle>Management
681 of Cards and Components</citetitle></link>, too. 681 of Cards and Components</citetitle></link>, too.
682 </para> 682 </para>
683 </section> 683 </section>
684 684
685 <section id="basic-flow-constructor-set-pci"> 685 <section id="basic-flow-constructor-set-pci">
686 <title>7) Set the PCI driver data and return zero.</title> 686 <title>7) Set the PCI driver data and return zero.</title>
687 <para> 687 <para>
688 <informalexample> 688 <informalexample>
689 <programlisting> 689 <programlisting>
690 <![CDATA[ 690 <![CDATA[
691 pci_set_drvdata(pci, card); 691 pci_set_drvdata(pci, card);
692 dev++; 692 dev++;
693 return 0; 693 return 0;
694 ]]> 694 ]]>
695 </programlisting> 695 </programlisting>
696 </informalexample> 696 </informalexample>
697 697
698 In the above, the card record is stored. This pointer is 698 In the above, the card record is stored. This pointer is
699 referred in the remove callback and power-management 699 referred in the remove callback and power-management
700 callbacks, too. 700 callbacks, too.
701 </para> 701 </para>
702 </section> 702 </section>
703 </section> 703 </section>
704 704
705 <section id="basic-flow-destructor"> 705 <section id="basic-flow-destructor">
706 <title>Destructor</title> 706 <title>Destructor</title>
707 <para> 707 <para>
708 The destructor, remove callback, simply releases the card 708 The destructor, remove callback, simply releases the card
709 instance. Then the ALSA middle layer will release all the 709 instance. Then the ALSA middle layer will release all the
710 attached components automatically. 710 attached components automatically.
711 </para> 711 </para>
712 712
713 <para> 713 <para>
714 It would be typically like the following: 714 It would be typically like the following:
715 715
716 <informalexample> 716 <informalexample>
717 <programlisting> 717 <programlisting>
718 <![CDATA[ 718 <![CDATA[
719 static void __devexit snd_mychip_remove(struct pci_dev *pci) 719 static void __devexit snd_mychip_remove(struct pci_dev *pci)
720 { 720 {
721 snd_card_free(pci_get_drvdata(pci)); 721 snd_card_free(pci_get_drvdata(pci));
722 pci_set_drvdata(pci, NULL); 722 pci_set_drvdata(pci, NULL);
723 } 723 }
724 ]]> 724 ]]>
725 </programlisting> 725 </programlisting>
726 </informalexample> 726 </informalexample>
727 727
728 The above code assumes that the card pointer is set to the PCI 728 The above code assumes that the card pointer is set to the PCI
729 driver data. 729 driver data.
730 </para> 730 </para>
731 </section> 731 </section>
732 732
733 <section id="basic-flow-header-files"> 733 <section id="basic-flow-header-files">
734 <title>Header Files</title> 734 <title>Header Files</title>
735 <para> 735 <para>
736 For the above example, at least the following include files 736 For the above example, at least the following include files
737 are necessary. 737 are necessary.
738 738
739 <informalexample> 739 <informalexample>
740 <programlisting> 740 <programlisting>
741 <![CDATA[ 741 <![CDATA[
742 #include <sound/driver.h> 742 #include <sound/driver.h>
743 #include <linux/init.h> 743 #include <linux/init.h>
744 #include <linux/pci.h> 744 #include <linux/pci.h>
745 #include <linux/slab.h> 745 #include <linux/slab.h>
746 #include <sound/core.h> 746 #include <sound/core.h>
747 #include <sound/initval.h> 747 #include <sound/initval.h>
748 ]]> 748 ]]>
749 </programlisting> 749 </programlisting>
750 </informalexample> 750 </informalexample>
751 751
752 where the last one is necessary only when module options are 752 where the last one is necessary only when module options are
753 defined in the source file. If the codes are split to several 753 defined in the source file. If the codes are split to several
754 files, the file without module options don't need them. 754 files, the file without module options don't need them.
755 </para> 755 </para>
756 756
757 <para> 757 <para>
758 In addition to them, you'll need 758 In addition to them, you'll need
759 <filename>&lt;linux/interrupt.h&gt;</filename> for the interrupt 759 <filename>&lt;linux/interrupt.h&gt;</filename> for the interrupt
760 handling, and <filename>&lt;asm/io.h&gt;</filename> for the i/o 760 handling, and <filename>&lt;asm/io.h&gt;</filename> for the i/o
761 access. If you use <function>mdelay()</function> or 761 access. If you use <function>mdelay()</function> or
762 <function>udelay()</function> functions, you'll need to include 762 <function>udelay()</function> functions, you'll need to include
763 <filename>&lt;linux/delay.h&gt;</filename>, too. 763 <filename>&lt;linux/delay.h&gt;</filename>, too.
764 </para> 764 </para>
765 765
766 <para> 766 <para>
767 The ALSA interfaces like PCM or control API are defined in other 767 The ALSA interfaces like PCM or control API are defined in other
768 header files as <filename>&lt;sound/xxx.h&gt;</filename>. 768 header files as <filename>&lt;sound/xxx.h&gt;</filename>.
769 They have to be included after 769 They have to be included after
770 <filename>&lt;sound/core.h&gt;</filename>. 770 <filename>&lt;sound/core.h&gt;</filename>.
771 </para> 771 </para>
772 772
773 </section> 773 </section>
774 </chapter> 774 </chapter>
775 775
776 776
777 <!-- ****************************************************** --> 777 <!-- ****************************************************** -->
778 <!-- Management of Cards and Components --> 778 <!-- Management of Cards and Components -->
779 <!-- ****************************************************** --> 779 <!-- ****************************************************** -->
780 <chapter id="card-management"> 780 <chapter id="card-management">
781 <title>Management of Cards and Components</title> 781 <title>Management of Cards and Components</title>
782 782
783 <section id="card-management-card-instance"> 783 <section id="card-management-card-instance">
784 <title>Card Instance</title> 784 <title>Card Instance</title>
785 <para> 785 <para>
786 For each soundcard, a <quote>card</quote> record must be allocated. 786 For each soundcard, a <quote>card</quote> record must be allocated.
787 </para> 787 </para>
788 788
789 <para> 789 <para>
790 A card record is the headquarters of the soundcard. It manages 790 A card record is the headquarters of the soundcard. It manages
791 the list of whole devices (components) on the soundcard, such as 791 the list of whole devices (components) on the soundcard, such as
792 PCM, mixers, MIDI, synthesizer, and so on. Also, the card 792 PCM, mixers, MIDI, synthesizer, and so on. Also, the card
793 record holds the ID and the name strings of the card, manages 793 record holds the ID and the name strings of the card, manages
794 the root of proc files, and controls the power-management states 794 the root of proc files, and controls the power-management states
795 and hotplug disconnections. The component list on the card 795 and hotplug disconnections. The component list on the card
796 record is used to manage the proper releases of resources at 796 record is used to manage the proper releases of resources at
797 destruction. 797 destruction.
798 </para> 798 </para>
799 799
800 <para> 800 <para>
801 As mentioned above, to create a card instance, call 801 As mentioned above, to create a card instance, call
802 <function>snd_card_new()</function>. 802 <function>snd_card_new()</function>.
803 803
804 <informalexample> 804 <informalexample>
805 <programlisting> 805 <programlisting>
806 <![CDATA[ 806 <![CDATA[
807 struct snd_card *card; 807 struct snd_card *card;
808 card = snd_card_new(index, id, module, extra_size); 808 card = snd_card_new(index, id, module, extra_size);
809 ]]> 809 ]]>
810 </programlisting> 810 </programlisting>
811 </informalexample> 811 </informalexample>
812 </para> 812 </para>
813 813
814 <para> 814 <para>
815 The function takes four arguments, the card-index number, the 815 The function takes four arguments, the card-index number, the
816 id string, the module pointer (usually 816 id string, the module pointer (usually
817 <constant>THIS_MODULE</constant>), 817 <constant>THIS_MODULE</constant>),
818 and the size of extra-data space. The last argument is used to 818 and the size of extra-data space. The last argument is used to
819 allocate card-&gt;private_data for the 819 allocate card-&gt;private_data for the
820 chip-specific data. Note that this data 820 chip-specific data. Note that this data
821 <emphasis>is</emphasis> allocated by 821 <emphasis>is</emphasis> allocated by
822 <function>snd_card_new()</function>. 822 <function>snd_card_new()</function>.
823 </para> 823 </para>
824 </section> 824 </section>
825 825
826 <section id="card-management-component"> 826 <section id="card-management-component">
827 <title>Components</title> 827 <title>Components</title>
828 <para> 828 <para>
829 After the card is created, you can attach the components 829 After the card is created, you can attach the components
830 (devices) to the card instance. On ALSA driver, a component is 830 (devices) to the card instance. On ALSA driver, a component is
831 represented as a struct <structname>snd_device</structname> object. 831 represented as a struct <structname>snd_device</structname> object.
832 A component can be a PCM instance, a control interface, a raw 832 A component can be a PCM instance, a control interface, a raw
833 MIDI interface, etc. Each of such instances has one component 833 MIDI interface, etc. Each of such instances has one component
834 entry. 834 entry.
835 </para> 835 </para>
836 836
837 <para> 837 <para>
838 A component can be created via 838 A component can be created via
839 <function>snd_device_new()</function> function. 839 <function>snd_device_new()</function> function.
840 840
841 <informalexample> 841 <informalexample>
842 <programlisting> 842 <programlisting>
843 <![CDATA[ 843 <![CDATA[
844 snd_device_new(card, SNDRV_DEV_XXX, chip, &ops); 844 snd_device_new(card, SNDRV_DEV_XXX, chip, &ops);
845 ]]> 845 ]]>
846 </programlisting> 846 </programlisting>
847 </informalexample> 847 </informalexample>
848 </para> 848 </para>
849 849
850 <para> 850 <para>
851 This takes the card pointer, the device-level 851 This takes the card pointer, the device-level
852 (<constant>SNDRV_DEV_XXX</constant>), the data pointer, and the 852 (<constant>SNDRV_DEV_XXX</constant>), the data pointer, and the
853 callback pointers (<parameter>&amp;ops</parameter>). The 853 callback pointers (<parameter>&amp;ops</parameter>). The
854 device-level defines the type of components and the order of 854 device-level defines the type of components and the order of
855 registration and de-registration. For most of components, the 855 registration and de-registration. For most of components, the
856 device-level is already defined. For a user-defined component, 856 device-level is already defined. For a user-defined component,
857 you can use <constant>SNDRV_DEV_LOWLEVEL</constant>. 857 you can use <constant>SNDRV_DEV_LOWLEVEL</constant>.
858 </para> 858 </para>
859 859
860 <para> 860 <para>
861 This function itself doesn't allocate the data space. The data 861 This function itself doesn't allocate the data space. The data
862 must be allocated manually beforehand, and its pointer is passed 862 must be allocated manually beforehand, and its pointer is passed
863 as the argument. This pointer is used as the identifier 863 as the argument. This pointer is used as the identifier
864 (<parameter>chip</parameter> in the above example) for the 864 (<parameter>chip</parameter> in the above example) for the
865 instance. 865 instance.
866 </para> 866 </para>
867 867
868 <para> 868 <para>
869 Each ALSA pre-defined component such as ac97 or pcm calls 869 Each ALSA pre-defined component such as ac97 or pcm calls
870 <function>snd_device_new()</function> inside its 870 <function>snd_device_new()</function> inside its
871 constructor. The destructor for each component is defined in the 871 constructor. The destructor for each component is defined in the
872 callback pointers. Hence, you don't need to take care of 872 callback pointers. Hence, you don't need to take care of
873 calling a destructor for such a component. 873 calling a destructor for such a component.
874 </para> 874 </para>
875 875
876 <para> 876 <para>
877 If you would like to create your own component, you need to 877 If you would like to create your own component, you need to
878 set the destructor function to dev_free callback in 878 set the destructor function to dev_free callback in
879 <parameter>ops</parameter>, so that it can be released 879 <parameter>ops</parameter>, so that it can be released
880 automatically via <function>snd_card_free()</function>. The 880 automatically via <function>snd_card_free()</function>. The
881 example will be shown later as an implementation of a 881 example will be shown later as an implementation of a
882 chip-specific data. 882 chip-specific data.
883 </para> 883 </para>
884 </section> 884 </section>
885 885
886 <section id="card-management-chip-specific"> 886 <section id="card-management-chip-specific">
887 <title>Chip-Specific Data</title> 887 <title>Chip-Specific Data</title>
888 <para> 888 <para>
889 The chip-specific information, e.g. the i/o port address, its 889 The chip-specific information, e.g. the i/o port address, its
890 resource pointer, or the irq number, is stored in the 890 resource pointer, or the irq number, is stored in the
891 chip-specific record. 891 chip-specific record.
892 892
893 <informalexample> 893 <informalexample>
894 <programlisting> 894 <programlisting>
895 <![CDATA[ 895 <![CDATA[
896 struct mychip { 896 struct mychip {
897 .... 897 ....
898 }; 898 };
899 ]]> 899 ]]>
900 </programlisting> 900 </programlisting>
901 </informalexample> 901 </informalexample>
902 </para> 902 </para>
903 903
904 <para> 904 <para>
905 In general, there are two ways to allocate the chip record. 905 In general, there are two ways to allocate the chip record.
906 </para> 906 </para>
907 907
908 <section id="card-management-chip-specific-snd-card-new"> 908 <section id="card-management-chip-specific-snd-card-new">
909 <title>1. Allocating via <function>snd_card_new()</function>.</title> 909 <title>1. Allocating via <function>snd_card_new()</function>.</title>
910 <para> 910 <para>
911 As mentioned above, you can pass the extra-data-length to the 4th argument of <function>snd_card_new()</function>, i.e. 911 As mentioned above, you can pass the extra-data-length to the 4th argument of <function>snd_card_new()</function>, i.e.
912 912
913 <informalexample> 913 <informalexample>
914 <programlisting> 914 <programlisting>
915 <![CDATA[ 915 <![CDATA[
916 card = snd_card_new(index[dev], id[dev], THIS_MODULE, sizeof(struct mychip)); 916 card = snd_card_new(index[dev], id[dev], THIS_MODULE, sizeof(struct mychip));
917 ]]> 917 ]]>
918 </programlisting> 918 </programlisting>
919 </informalexample> 919 </informalexample>
920 920
921 whether struct <structname>mychip</structname> is the type of the chip record. 921 whether struct <structname>mychip</structname> is the type of the chip record.
922 </para> 922 </para>
923 923
924 <para> 924 <para>
925 In return, the allocated record can be accessed as 925 In return, the allocated record can be accessed as
926 926
927 <informalexample> 927 <informalexample>
928 <programlisting> 928 <programlisting>
929 <![CDATA[ 929 <![CDATA[
930 struct mychip *chip = (struct mychip *)card->private_data; 930 struct mychip *chip = (struct mychip *)card->private_data;
931 ]]> 931 ]]>
932 </programlisting> 932 </programlisting>
933 </informalexample> 933 </informalexample>
934 934
935 With this method, you don't have to allocate twice. 935 With this method, you don't have to allocate twice.
936 The record is released together with the card instance. 936 The record is released together with the card instance.
937 </para> 937 </para>
938 </section> 938 </section>
939 939
940 <section id="card-management-chip-specific-allocate-extra"> 940 <section id="card-management-chip-specific-allocate-extra">
941 <title>2. Allocating an extra device.</title> 941 <title>2. Allocating an extra device.</title>
942 942
943 <para> 943 <para>
944 After allocating a card instance via 944 After allocating a card instance via
945 <function>snd_card_new()</function> (with 945 <function>snd_card_new()</function> (with
946 <constant>NULL</constant> on the 4th arg), call 946 <constant>NULL</constant> on the 4th arg), call
947 <function>kzalloc()</function>. 947 <function>kzalloc()</function>.
948 948
949 <informalexample> 949 <informalexample>
950 <programlisting> 950 <programlisting>
951 <![CDATA[ 951 <![CDATA[
952 struct snd_card *card; 952 struct snd_card *card;
953 struct mychip *chip; 953 struct mychip *chip;
954 card = snd_card_new(index[dev], id[dev], THIS_MODULE, NULL); 954 card = snd_card_new(index[dev], id[dev], THIS_MODULE, NULL);
955 ..... 955 .....
956 chip = kzalloc(sizeof(*chip), GFP_KERNEL); 956 chip = kzalloc(sizeof(*chip), GFP_KERNEL);
957 ]]> 957 ]]>
958 </programlisting> 958 </programlisting>
959 </informalexample> 959 </informalexample>
960 </para> 960 </para>
961 961
962 <para> 962 <para>
963 The chip record should have the field to hold the card 963 The chip record should have the field to hold the card
964 pointer at least, 964 pointer at least,
965 965
966 <informalexample> 966 <informalexample>
967 <programlisting> 967 <programlisting>
968 <![CDATA[ 968 <![CDATA[
969 struct mychip { 969 struct mychip {
970 struct snd_card *card; 970 struct snd_card *card;
971 .... 971 ....
972 }; 972 };
973 ]]> 973 ]]>
974 </programlisting> 974 </programlisting>
975 </informalexample> 975 </informalexample>
976 </para> 976 </para>
977 977
978 <para> 978 <para>
979 Then, set the card pointer in the returned chip instance. 979 Then, set the card pointer in the returned chip instance.
980 980
981 <informalexample> 981 <informalexample>
982 <programlisting> 982 <programlisting>
983 <![CDATA[ 983 <![CDATA[
984 chip->card = card; 984 chip->card = card;
985 ]]> 985 ]]>
986 </programlisting> 986 </programlisting>
987 </informalexample> 987 </informalexample>
988 </para> 988 </para>
989 989
990 <para> 990 <para>
991 Next, initialize the fields, and register this chip 991 Next, initialize the fields, and register this chip
992 record as a low-level device with a specified 992 record as a low-level device with a specified
993 <parameter>ops</parameter>, 993 <parameter>ops</parameter>,
994 994
995 <informalexample> 995 <informalexample>
996 <programlisting> 996 <programlisting>
997 <![CDATA[ 997 <![CDATA[
998 static struct snd_device_ops ops = { 998 static struct snd_device_ops ops = {
999 .dev_free = snd_mychip_dev_free, 999 .dev_free = snd_mychip_dev_free,
1000 }; 1000 };
1001 .... 1001 ....
1002 snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops); 1002 snd_device_new(card, SNDRV_DEV_LOWLEVEL, chip, &ops);
1003 ]]> 1003 ]]>
1004 </programlisting> 1004 </programlisting>
1005 </informalexample> 1005 </informalexample>
1006 1006
1007 <function>snd_mychip_dev_free()</function> is the 1007 <function>snd_mychip_dev_free()</function> is the
1008 device-destructor function, which will call the real 1008 device-destructor function, which will call the real
1009 destructor. 1009 destructor.
1010 </para> 1010 </para>
1011 1011
1012 <para> 1012 <para>
1013 <informalexample> 1013 <informalexample>
1014 <programlisting> 1014 <programlisting>
1015 <![CDATA[ 1015 <![CDATA[
1016 static int snd_mychip_dev_free(struct snd_device *device) 1016 static int snd_mychip_dev_free(struct snd_device *device)
1017 { 1017 {
1018 return snd_mychip_free(device->device_data); 1018 return snd_mychip_free(device->device_data);
1019 } 1019 }
1020 ]]> 1020 ]]>
1021 </programlisting> 1021 </programlisting>
1022 </informalexample> 1022 </informalexample>
1023 1023
1024 where <function>snd_mychip_free()</function> is the real destructor. 1024 where <function>snd_mychip_free()</function> is the real destructor.
1025 </para> 1025 </para>
1026 </section> 1026 </section>
1027 </section> 1027 </section>
1028 1028
1029 <section id="card-management-registration"> 1029 <section id="card-management-registration">
1030 <title>Registration and Release</title> 1030 <title>Registration and Release</title>
1031 <para> 1031 <para>
1032 After all components are assigned, register the card instance 1032 After all components are assigned, register the card instance
1033 by calling <function>snd_card_register()</function>. The access 1033 by calling <function>snd_card_register()</function>. The access
1034 to the device files are enabled at this point. That is, before 1034 to the device files are enabled at this point. That is, before
1035 <function>snd_card_register()</function> is called, the 1035 <function>snd_card_register()</function> is called, the
1036 components are safely inaccessible from external side. If this 1036 components are safely inaccessible from external side. If this
1037 call fails, exit the probe function after releasing the card via 1037 call fails, exit the probe function after releasing the card via
1038 <function>snd_card_free()</function>. 1038 <function>snd_card_free()</function>.
1039 </para> 1039 </para>
1040 1040
1041 <para> 1041 <para>
1042 For releasing the card instance, you can call simply 1042 For releasing the card instance, you can call simply
1043 <function>snd_card_free()</function>. As already mentioned, all 1043 <function>snd_card_free()</function>. As already mentioned, all
1044 components are released automatically by this call. 1044 components are released automatically by this call.
1045 </para> 1045 </para>
1046 1046
1047 <para> 1047 <para>
1048 As further notes, the destructors (both 1048 As further notes, the destructors (both
1049 <function>snd_mychip_dev_free</function> and 1049 <function>snd_mychip_dev_free</function> and
1050 <function>snd_mychip_free</function>) cannot be defined with 1050 <function>snd_mychip_free</function>) cannot be defined with
1051 <parameter>__devexit</parameter> prefix, because they may be 1051 <parameter>__devexit</parameter> prefix, because they may be
1052 called from the constructor, too, at the false path. 1052 called from the constructor, too, at the false path.
1053 </para> 1053 </para>
1054 1054
1055 <para> 1055 <para>
1056 For a device which allows hotplugging, you can use 1056 For a device which allows hotplugging, you can use
1057 <function>snd_card_free_when_closed</function>. This one will 1057 <function>snd_card_free_when_closed</function>. This one will
1058 postpone the destruction until all devices are closed. 1058 postpone the destruction until all devices are closed.
1059 </para> 1059 </para>
1060 1060
1061 </section> 1061 </section>
1062 1062
1063 </chapter> 1063 </chapter>
1064 1064
1065 1065
1066 <!-- ****************************************************** --> 1066 <!-- ****************************************************** -->
1067 <!-- PCI Resource Managements --> 1067 <!-- PCI Resource Managements -->
1068 <!-- ****************************************************** --> 1068 <!-- ****************************************************** -->
1069 <chapter id="pci-resource"> 1069 <chapter id="pci-resource">
1070 <title>PCI Resource Managements</title> 1070 <title>PCI Resource Managements</title>
1071 1071
1072 <section id="pci-resource-example"> 1072 <section id="pci-resource-example">
1073 <title>Full Code Example</title> 1073 <title>Full Code Example</title>
1074 <para> 1074 <para>
1075 In this section, we'll finish the chip-specific constructor, 1075 In this section, we'll finish the chip-specific constructor,
1076 destructor and PCI entries. The example code is shown first, 1076 destructor and PCI entries. The example code is shown first,
1077 below. 1077 below.
1078 1078
1079 <example> 1079 <example>
1080 <title>PCI Resource Managements Example</title> 1080 <title>PCI Resource Managements Example</title>
1081 <programlisting> 1081 <programlisting>
1082 <![CDATA[ 1082 <![CDATA[
1083 struct mychip { 1083 struct mychip {
1084 struct snd_card *card; 1084 struct snd_card *card;
1085 struct pci_dev *pci; 1085 struct pci_dev *pci;
1086 1086
1087 unsigned long port; 1087 unsigned long port;
1088 int irq; 1088 int irq;
1089 }; 1089 };
1090 1090
1091 static int snd_mychip_free(struct mychip *chip) 1091 static int snd_mychip_free(struct mychip *chip)
1092 { 1092 {
1093 /* disable hardware here if any */ 1093 /* disable hardware here if any */
1094 .... // (not implemented in this document) 1094 .... // (not implemented in this document)
1095 1095
1096 /* release the irq */ 1096 /* release the irq */
1097 if (chip->irq >= 0) 1097 if (chip->irq >= 0)
1098 free_irq(chip->irq, (void *)chip); 1098 free_irq(chip->irq, (void *)chip);
1099 /* release the i/o ports & memory */ 1099 /* release the i/o ports & memory */
1100 pci_release_regions(chip->pci); 1100 pci_release_regions(chip->pci);
1101 /* disable the PCI entry */ 1101 /* disable the PCI entry */
1102 pci_disable_device(chip->pci); 1102 pci_disable_device(chip->pci);
1103 /* release the data */ 1103 /* release the data */
1104 kfree(chip); 1104 kfree(chip);
1105 return 0; 1105 return 0;
1106 } 1106 }
1107 1107
1108 /* chip-specific constructor */ 1108 /* chip-specific constructor */
1109 static int __devinit snd_mychip_create(struct snd_card *card, 1109 static int __devinit snd_mychip_create(struct snd_card *card,
1110 struct pci_dev *pci, 1110 struct pci_dev *pci,
1111 struct mychip **rchip) 1111 struct mychip **rchip)
1112 { 1112 {
1113 struct mychip *chip; 1113 struct mychip *chip;
1114 int err; 1114 int err;
1115 static struct snd_device_ops ops = { 1115 static struct snd_device_ops ops = {
1116 .dev_free = snd_mychip_dev_free, 1116 .dev_free = snd_mychip_dev_free,
1117 }; 1117 };
1118 1118
1119 *rchip = NULL; 1119 *rchip = NULL;
1120 1120
1121 /* initialize the PCI entry */ 1121 /* initialize the PCI entry */
1122 if ((err = pci_enable_device(pci)) < 0) 1122 if ((err = pci_enable_device(pci)) < 0)
1123 return err; 1123 return err;
1124 /* check PCI availability (28bit DMA) */ 1124 /* check PCI availability (28bit DMA) */
1125 if (pci_set_dma_mask(pci, DMA_28BIT_MASK) < 0 || 1125 if (pci_set_dma_mask(pci, DMA_28BIT_MASK) < 0 ||
1126 pci_set_consistent_dma_mask(pci, DMA_28BIT_MASK) < 0) { 1126 pci_set_consistent_dma_mask(pci, DMA_28BIT_MASK) < 0) {
1127 printk(KERN_ERR "error to set 28bit mask DMA\n"); 1127 printk(KERN_ERR "error to set 28bit mask DMA\n");
1128 pci_disable_device(pci); 1128 pci_disable_device(pci);
1129 return -ENXIO; 1129 return -ENXIO;
1130 } 1130 }
1131 1131
1132 chip = kzalloc(sizeof(*chip), GFP_KERNEL); 1132 chip = kzalloc(sizeof(*chip), GFP_KERNEL);
1133 if (chip == NULL) { 1133 if (chip == NULL) {
1134 pci_disable_device(pci); 1134 pci_disable_device(pci);
1135 return -ENOMEM; 1135 return -ENOMEM;
1136 } 1136 }
1137 1137
1138 /* initialize the stuff */ 1138 /* initialize the stuff */
1139 chip->card = card; 1139 chip->card = card;
1140 chip->pci = pci; 1140 chip->pci = pci;
1141 chip->irq = -1; 1141 chip->irq = -1;
1142 1142
1143 /* (1) PCI resource allocation */ 1143 /* (1) PCI resource allocation */
1144 if ((err = pci_request_regions(pci, "My Chip")) < 0) { 1144 if ((err = pci_request_regions(pci, "My Chip")) < 0) {
1145 kfree(chip); 1145 kfree(chip);
1146 pci_disable_device(pci); 1146 pci_disable_device(pci);
1147 return err; 1147 return err;
1148 } 1148 }
1149 chip->port = pci_resource_start(pci, 0); 1149 chip->port = pci_resource_start(pci, 0);
1150 if (request_irq(pci->irq, snd_mychip_interrupt, 1150 if (request_irq(pci->irq, snd_mychip_interrupt,
1151 IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) { 1151 IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) {
1152 printk(KERN_ERR "cannot grab irq %d\n", pci->irq); 1152 printk(KERN_ERR "cannot grab irq %d\n", pci->irq);
1153 snd_mychip_free(chip); 1153 snd_mychip_free(chip);
1154 return -EBUSY; 1154 return -EBUSY;
1155 } 1155 }
1156 chip->irq = pci->irq; 1156 chip->irq = pci->irq;
1157 1157
1158 /* (2) initialization of the chip hardware */ 1158 /* (2) initialization of the chip hardware */
1159 .... // (not implemented in this document) 1159 .... // (not implemented in this document)
1160 1160
1161 if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL, 1161 if ((err = snd_device_new(card, SNDRV_DEV_LOWLEVEL,
1162 chip, &ops)) < 0) { 1162 chip, &ops)) < 0) {
1163 snd_mychip_free(chip); 1163 snd_mychip_free(chip);
1164 return err; 1164 return err;
1165 } 1165 }
1166 1166
1167 snd_card_set_dev(card, &pci->dev); 1167 snd_card_set_dev(card, &pci->dev);
1168 1168
1169 *rchip = chip; 1169 *rchip = chip;
1170 return 0; 1170 return 0;
1171 } 1171 }
1172 1172
1173 /* PCI IDs */ 1173 /* PCI IDs */
1174 static struct pci_device_id snd_mychip_ids[] = { 1174 static struct pci_device_id snd_mychip_ids[] = {
1175 { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR, 1175 { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR,
1176 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, }, 1176 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, },
1177 .... 1177 ....
1178 { 0, } 1178 { 0, }
1179 }; 1179 };
1180 MODULE_DEVICE_TABLE(pci, snd_mychip_ids); 1180 MODULE_DEVICE_TABLE(pci, snd_mychip_ids);
1181 1181
1182 /* pci_driver definition */ 1182 /* pci_driver definition */
1183 static struct pci_driver driver = { 1183 static struct pci_driver driver = {
1184 .name = "My Own Chip", 1184 .name = "My Own Chip",
1185 .id_table = snd_mychip_ids, 1185 .id_table = snd_mychip_ids,
1186 .probe = snd_mychip_probe, 1186 .probe = snd_mychip_probe,
1187 .remove = __devexit_p(snd_mychip_remove), 1187 .remove = __devexit_p(snd_mychip_remove),
1188 }; 1188 };
1189 1189
1190 /* initialization of the module */ 1190 /* initialization of the module */
1191 static int __init alsa_card_mychip_init(void) 1191 static int __init alsa_card_mychip_init(void)
1192 { 1192 {
1193 return pci_register_driver(&driver); 1193 return pci_register_driver(&driver);
1194 } 1194 }
1195 1195
1196 /* clean up the module */ 1196 /* clean up the module */
1197 static void __exit alsa_card_mychip_exit(void) 1197 static void __exit alsa_card_mychip_exit(void)
1198 { 1198 {
1199 pci_unregister_driver(&driver); 1199 pci_unregister_driver(&driver);
1200 } 1200 }
1201 1201
1202 module_init(alsa_card_mychip_init) 1202 module_init(alsa_card_mychip_init)
1203 module_exit(alsa_card_mychip_exit) 1203 module_exit(alsa_card_mychip_exit)
1204 1204
1205 EXPORT_NO_SYMBOLS; /* for old kernels only */ 1205 EXPORT_NO_SYMBOLS; /* for old kernels only */
1206 ]]> 1206 ]]>
1207 </programlisting> 1207 </programlisting>
1208 </example> 1208 </example>
1209 </para> 1209 </para>
1210 </section> 1210 </section>
1211 1211
1212 <section id="pci-resource-some-haftas"> 1212 <section id="pci-resource-some-haftas">
1213 <title>Some Hafta's</title> 1213 <title>Some Hafta's</title>
1214 <para> 1214 <para>
1215 The allocation of PCI resources is done in the 1215 The allocation of PCI resources is done in the
1216 <function>probe()</function> function, and usually an extra 1216 <function>probe()</function> function, and usually an extra
1217 <function>xxx_create()</function> function is written for this 1217 <function>xxx_create()</function> function is written for this
1218 purpose. 1218 purpose.
1219 </para> 1219 </para>
1220 1220
1221 <para> 1221 <para>
1222 In the case of PCI devices, you have to call at first 1222 In the case of PCI devices, you have to call at first
1223 <function>pci_enable_device()</function> function before 1223 <function>pci_enable_device()</function> function before
1224 allocating resources. Also, you need to set the proper PCI DMA 1224 allocating resources. Also, you need to set the proper PCI DMA
1225 mask to limit the accessed i/o range. In some cases, you might 1225 mask to limit the accessed i/o range. In some cases, you might
1226 need to call <function>pci_set_master()</function> function, 1226 need to call <function>pci_set_master()</function> function,
1227 too. 1227 too.
1228 </para> 1228 </para>
1229 1229
1230 <para> 1230 <para>
1231 Suppose the 28bit mask, and the code to be added would be like: 1231 Suppose the 28bit mask, and the code to be added would be like:
1232 1232
1233 <informalexample> 1233 <informalexample>
1234 <programlisting> 1234 <programlisting>
1235 <![CDATA[ 1235 <![CDATA[
1236 if ((err = pci_enable_device(pci)) < 0) 1236 if ((err = pci_enable_device(pci)) < 0)
1237 return err; 1237 return err;
1238 if (pci_set_dma_mask(pci, DMA_28BIT_MASK) < 0 || 1238 if (pci_set_dma_mask(pci, DMA_28BIT_MASK) < 0 ||
1239 pci_set_consistent_dma_mask(pci, DMA_28BIT_MASK) < 0) { 1239 pci_set_consistent_dma_mask(pci, DMA_28BIT_MASK) < 0) {
1240 printk(KERN_ERR "error to set 28bit mask DMA\n"); 1240 printk(KERN_ERR "error to set 28bit mask DMA\n");
1241 pci_disable_device(pci); 1241 pci_disable_device(pci);
1242 return -ENXIO; 1242 return -ENXIO;
1243 } 1243 }
1244 1244
1245 ]]> 1245 ]]>
1246 </programlisting> 1246 </programlisting>
1247 </informalexample> 1247 </informalexample>
1248 </para> 1248 </para>
1249 </section> 1249 </section>
1250 1250
1251 <section id="pci-resource-resource-allocation"> 1251 <section id="pci-resource-resource-allocation">
1252 <title>Resource Allocation</title> 1252 <title>Resource Allocation</title>
1253 <para> 1253 <para>
1254 The allocation of I/O ports and irqs are done via standard kernel 1254 The allocation of I/O ports and irqs are done via standard kernel
1255 functions. Unlike ALSA ver.0.5.x., there are no helpers for 1255 functions. Unlike ALSA ver.0.5.x., there are no helpers for
1256 that. And these resources must be released in the destructor 1256 that. And these resources must be released in the destructor
1257 function (see below). Also, on ALSA 0.9.x, you don't need to 1257 function (see below). Also, on ALSA 0.9.x, you don't need to
1258 allocate (pseudo-)DMA for PCI like ALSA 0.5.x. 1258 allocate (pseudo-)DMA for PCI like ALSA 0.5.x.
1259 </para> 1259 </para>
1260 1260
1261 <para> 1261 <para>
1262 Now assume that this PCI device has an I/O port with 8 bytes 1262 Now assume that this PCI device has an I/O port with 8 bytes
1263 and an interrupt. Then struct <structname>mychip</structname> will have the 1263 and an interrupt. Then struct <structname>mychip</structname> will have the
1264 following fields: 1264 following fields:
1265 1265
1266 <informalexample> 1266 <informalexample>
1267 <programlisting> 1267 <programlisting>
1268 <![CDATA[ 1268 <![CDATA[
1269 struct mychip { 1269 struct mychip {
1270 struct snd_card *card; 1270 struct snd_card *card;
1271 1271
1272 unsigned long port; 1272 unsigned long port;
1273 int irq; 1273 int irq;
1274 }; 1274 };
1275 ]]> 1275 ]]>
1276 </programlisting> 1276 </programlisting>
1277 </informalexample> 1277 </informalexample>
1278 </para> 1278 </para>
1279 1279
1280 <para> 1280 <para>
1281 For an i/o port (and also a memory region), you need to have 1281 For an i/o port (and also a memory region), you need to have
1282 the resource pointer for the standard resource management. For 1282 the resource pointer for the standard resource management. For
1283 an irq, you have to keep only the irq number (integer). But you 1283 an irq, you have to keep only the irq number (integer). But you
1284 need to initialize this number as -1 before actual allocation, 1284 need to initialize this number as -1 before actual allocation,
1285 since irq 0 is valid. The port address and its resource pointer 1285 since irq 0 is valid. The port address and its resource pointer
1286 can be initialized as null by 1286 can be initialized as null by
1287 <function>kzalloc()</function> automatically, so you 1287 <function>kzalloc()</function> automatically, so you
1288 don't have to take care of resetting them. 1288 don't have to take care of resetting them.
1289 </para> 1289 </para>
1290 1290
1291 <para> 1291 <para>
1292 The allocation of an i/o port is done like this: 1292 The allocation of an i/o port is done like this:
1293 1293
1294 <informalexample> 1294 <informalexample>
1295 <programlisting> 1295 <programlisting>
1296 <![CDATA[ 1296 <![CDATA[
1297 if ((err = pci_request_regions(pci, "My Chip")) < 0) { 1297 if ((err = pci_request_regions(pci, "My Chip")) < 0) {
1298 kfree(chip); 1298 kfree(chip);
1299 pci_disable_device(pci); 1299 pci_disable_device(pci);
1300 return err; 1300 return err;
1301 } 1301 }
1302 chip->port = pci_resource_start(pci, 0); 1302 chip->port = pci_resource_start(pci, 0);
1303 ]]> 1303 ]]>
1304 </programlisting> 1304 </programlisting>
1305 </informalexample> 1305 </informalexample>
1306 </para> 1306 </para>
1307 1307
1308 <para> 1308 <para>
1309 <!-- obsolete --> 1309 <!-- obsolete -->
1310 It will reserve the i/o port region of 8 bytes of the given 1310 It will reserve the i/o port region of 8 bytes of the given
1311 PCI device. The returned value, chip-&gt;res_port, is allocated 1311 PCI device. The returned value, chip-&gt;res_port, is allocated
1312 via <function>kmalloc()</function> by 1312 via <function>kmalloc()</function> by
1313 <function>request_region()</function>. The pointer must be 1313 <function>request_region()</function>. The pointer must be
1314 released via <function>kfree()</function>, but there is some 1314 released via <function>kfree()</function>, but there is some
1315 problem regarding this. This issue will be explained more below. 1315 problem regarding this. This issue will be explained more below.
1316 </para> 1316 </para>
1317 1317
1318 <para> 1318 <para>
1319 The allocation of an interrupt source is done like this: 1319 The allocation of an interrupt source is done like this:
1320 1320
1321 <informalexample> 1321 <informalexample>
1322 <programlisting> 1322 <programlisting>
1323 <![CDATA[ 1323 <![CDATA[
1324 if (request_irq(pci->irq, snd_mychip_interrupt, 1324 if (request_irq(pci->irq, snd_mychip_interrupt,
1325 IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) { 1325 IRQF_DISABLED|IRQF_SHARED, "My Chip", chip)) {
1326 printk(KERN_ERR "cannot grab irq %d\n", pci->irq); 1326 printk(KERN_ERR "cannot grab irq %d\n", pci->irq);
1327 snd_mychip_free(chip); 1327 snd_mychip_free(chip);
1328 return -EBUSY; 1328 return -EBUSY;
1329 } 1329 }
1330 chip->irq = pci->irq; 1330 chip->irq = pci->irq;
1331 ]]> 1331 ]]>
1332 </programlisting> 1332 </programlisting>
1333 </informalexample> 1333 </informalexample>
1334 1334
1335 where <function>snd_mychip_interrupt()</function> is the 1335 where <function>snd_mychip_interrupt()</function> is the
1336 interrupt handler defined <link 1336 interrupt handler defined <link
1337 linkend="pcm-interface-interrupt-handler"><citetitle>later</citetitle></link>. 1337 linkend="pcm-interface-interrupt-handler"><citetitle>later</citetitle></link>.
1338 Note that chip-&gt;irq should be defined 1338 Note that chip-&gt;irq should be defined
1339 only when <function>request_irq()</function> succeeded. 1339 only when <function>request_irq()</function> succeeded.
1340 </para> 1340 </para>
1341 1341
1342 <para> 1342 <para>
1343 On the PCI bus, the interrupts can be shared. Thus, 1343 On the PCI bus, the interrupts can be shared. Thus,
1344 <constant>IRQF_SHARED</constant> is given as the interrupt flag of 1344 <constant>IRQF_SHARED</constant> is given as the interrupt flag of
1345 <function>request_irq()</function>. 1345 <function>request_irq()</function>.
1346 </para> 1346 </para>
1347 1347
1348 <para> 1348 <para>
1349 The last argument of <function>request_irq()</function> is the 1349 The last argument of <function>request_irq()</function> is the
1350 data pointer passed to the interrupt handler. Usually, the 1350 data pointer passed to the interrupt handler. Usually, the
1351 chip-specific record is used for that, but you can use what you 1351 chip-specific record is used for that, but you can use what you
1352 like, too. 1352 like, too.
1353 </para> 1353 </para>
1354 1354
1355 <para> 1355 <para>
1356 I won't define the detail of the interrupt handler at this 1356 I won't define the detail of the interrupt handler at this
1357 point, but at least its appearance can be explained now. The 1357 point, but at least its appearance can be explained now. The
1358 interrupt handler looks usually like the following: 1358 interrupt handler looks usually like the following:
1359 1359
1360 <informalexample> 1360 <informalexample>
1361 <programlisting> 1361 <programlisting>
1362 <![CDATA[ 1362 <![CDATA[
1363 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id, 1363 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id,
1364 struct pt_regs *regs) 1364 struct pt_regs *regs)
1365 { 1365 {
1366 struct mychip *chip = dev_id; 1366 struct mychip *chip = dev_id;
1367 .... 1367 ....
1368 return IRQ_HANDLED; 1368 return IRQ_HANDLED;
1369 } 1369 }
1370 ]]> 1370 ]]>
1371 </programlisting> 1371 </programlisting>
1372 </informalexample> 1372 </informalexample>
1373 </para> 1373 </para>
1374 1374
1375 <para> 1375 <para>
1376 Now let's write the corresponding destructor for the resources 1376 Now let's write the corresponding destructor for the resources
1377 above. The role of destructor is simple: disable the hardware 1377 above. The role of destructor is simple: disable the hardware
1378 (if already activated) and release the resources. So far, we 1378 (if already activated) and release the resources. So far, we
1379 have no hardware part, so the disabling is not written here. 1379 have no hardware part, so the disabling is not written here.
1380 </para> 1380 </para>
1381 1381
1382 <para> 1382 <para>
1383 For releasing the resources, <quote>check-and-release</quote> 1383 For releasing the resources, <quote>check-and-release</quote>
1384 method is a safer way. For the interrupt, do like this: 1384 method is a safer way. For the interrupt, do like this:
1385 1385
1386 <informalexample> 1386 <informalexample>
1387 <programlisting> 1387 <programlisting>
1388 <![CDATA[ 1388 <![CDATA[
1389 if (chip->irq >= 0) 1389 if (chip->irq >= 0)
1390 free_irq(chip->irq, (void *)chip); 1390 free_irq(chip->irq, (void *)chip);
1391 ]]> 1391 ]]>
1392 </programlisting> 1392 </programlisting>
1393 </informalexample> 1393 </informalexample>
1394 1394
1395 Since the irq number can start from 0, you should initialize 1395 Since the irq number can start from 0, you should initialize
1396 chip-&gt;irq with a negative value (e.g. -1), so that you can 1396 chip-&gt;irq with a negative value (e.g. -1), so that you can
1397 check the validity of the irq number as above. 1397 check the validity of the irq number as above.
1398 </para> 1398 </para>
1399 1399
1400 <para> 1400 <para>
1401 When you requested I/O ports or memory regions via 1401 When you requested I/O ports or memory regions via
1402 <function>pci_request_region()</function> or 1402 <function>pci_request_region()</function> or
1403 <function>pci_request_regions()</function> like this example, 1403 <function>pci_request_regions()</function> like this example,
1404 release the resource(s) using the corresponding function, 1404 release the resource(s) using the corresponding function,
1405 <function>pci_release_region()</function> or 1405 <function>pci_release_region()</function> or
1406 <function>pci_release_regions()</function>. 1406 <function>pci_release_regions()</function>.
1407 1407
1408 <informalexample> 1408 <informalexample>
1409 <programlisting> 1409 <programlisting>
1410 <![CDATA[ 1410 <![CDATA[
1411 pci_release_regions(chip->pci); 1411 pci_release_regions(chip->pci);
1412 ]]> 1412 ]]>
1413 </programlisting> 1413 </programlisting>
1414 </informalexample> 1414 </informalexample>
1415 </para> 1415 </para>
1416 1416
1417 <para> 1417 <para>
1418 When you requested manually via <function>request_region()</function> 1418 When you requested manually via <function>request_region()</function>
1419 or <function>request_mem_region</function>, you can release it via 1419 or <function>request_mem_region</function>, you can release it via
1420 <function>release_resource()</function>. Suppose that you keep 1420 <function>release_resource()</function>. Suppose that you keep
1421 the resource pointer returned from <function>request_region()</function> 1421 the resource pointer returned from <function>request_region()</function>
1422 in chip-&gt;res_port, the release procedure looks like below: 1422 in chip-&gt;res_port, the release procedure looks like below:
1423 1423
1424 <informalexample> 1424 <informalexample>
1425 <programlisting> 1425 <programlisting>
1426 <![CDATA[ 1426 <![CDATA[
1427 release_and_free_resource(chip->res_port); 1427 release_and_free_resource(chip->res_port);
1428 ]]> 1428 ]]>
1429 </programlisting> 1429 </programlisting>
1430 </informalexample> 1430 </informalexample>
1431 </para> 1431 </para>
1432 1432
1433 <para> 1433 <para>
1434 Don't forget to call <function>pci_disable_device()</function> 1434 Don't forget to call <function>pci_disable_device()</function>
1435 before all finished. 1435 before all finished.
1436 </para> 1436 </para>
1437 1437
1438 <para> 1438 <para>
1439 And finally, release the chip-specific record. 1439 And finally, release the chip-specific record.
1440 1440
1441 <informalexample> 1441 <informalexample>
1442 <programlisting> 1442 <programlisting>
1443 <![CDATA[ 1443 <![CDATA[
1444 kfree(chip); 1444 kfree(chip);
1445 ]]> 1445 ]]>
1446 </programlisting> 1446 </programlisting>
1447 </informalexample> 1447 </informalexample>
1448 </para> 1448 </para>
1449 1449
1450 <para> 1450 <para>
1451 Again, remember that you cannot 1451 Again, remember that you cannot
1452 set <parameter>__devexit</parameter> prefix for this destructor. 1452 set <parameter>__devexit</parameter> prefix for this destructor.
1453 </para> 1453 </para>
1454 1454
1455 <para> 1455 <para>
1456 We didn't implement the hardware-disabling part in the above. 1456 We didn't implement the hardware-disabling part in the above.
1457 If you need to do this, please note that the destructor may be 1457 If you need to do this, please note that the destructor may be
1458 called even before the initialization of the chip is completed. 1458 called even before the initialization of the chip is completed.
1459 It would be better to have a flag to skip the hardware-disabling 1459 It would be better to have a flag to skip the hardware-disabling
1460 if the hardware was not initialized yet. 1460 if the hardware was not initialized yet.
1461 </para> 1461 </para>
1462 1462
1463 <para> 1463 <para>
1464 When the chip-data is assigned to the card using 1464 When the chip-data is assigned to the card using
1465 <function>snd_device_new()</function> with 1465 <function>snd_device_new()</function> with
1466 <constant>SNDRV_DEV_LOWLELVEL</constant> , its destructor is 1466 <constant>SNDRV_DEV_LOWLELVEL</constant> , its destructor is
1467 called at the last. That is, it is assured that all other 1467 called at the last. That is, it is assured that all other
1468 components like PCMs and controls have been already released. 1468 components like PCMs and controls have been already released.
1469 You don't have to call stopping PCMs, etc. explicitly, but just 1469 You don't have to call stopping PCMs, etc. explicitly, but just
1470 stop the hardware in the low-level. 1470 stop the hardware in the low-level.
1471 </para> 1471 </para>
1472 1472
1473 <para> 1473 <para>
1474 The management of a memory-mapped region is almost as same as 1474 The management of a memory-mapped region is almost as same as
1475 the management of an i/o port. You'll need three fields like 1475 the management of an i/o port. You'll need three fields like
1476 the following: 1476 the following:
1477 1477
1478 <informalexample> 1478 <informalexample>
1479 <programlisting> 1479 <programlisting>
1480 <![CDATA[ 1480 <![CDATA[
1481 struct mychip { 1481 struct mychip {
1482 .... 1482 ....
1483 unsigned long iobase_phys; 1483 unsigned long iobase_phys;
1484 void __iomem *iobase_virt; 1484 void __iomem *iobase_virt;
1485 }; 1485 };
1486 ]]> 1486 ]]>
1487 </programlisting> 1487 </programlisting>
1488 </informalexample> 1488 </informalexample>
1489 1489
1490 and the allocation would be like below: 1490 and the allocation would be like below:
1491 1491
1492 <informalexample> 1492 <informalexample>
1493 <programlisting> 1493 <programlisting>
1494 <![CDATA[ 1494 <![CDATA[
1495 if ((err = pci_request_regions(pci, "My Chip")) < 0) { 1495 if ((err = pci_request_regions(pci, "My Chip")) < 0) {
1496 kfree(chip); 1496 kfree(chip);
1497 return err; 1497 return err;
1498 } 1498 }
1499 chip->iobase_phys = pci_resource_start(pci, 0); 1499 chip->iobase_phys = pci_resource_start(pci, 0);
1500 chip->iobase_virt = ioremap_nocache(chip->iobase_phys, 1500 chip->iobase_virt = ioremap_nocache(chip->iobase_phys,
1501 pci_resource_len(pci, 0)); 1501 pci_resource_len(pci, 0));
1502 ]]> 1502 ]]>
1503 </programlisting> 1503 </programlisting>
1504 </informalexample> 1504 </informalexample>
1505 1505
1506 and the corresponding destructor would be: 1506 and the corresponding destructor would be:
1507 1507
1508 <informalexample> 1508 <informalexample>
1509 <programlisting> 1509 <programlisting>
1510 <![CDATA[ 1510 <![CDATA[
1511 static int snd_mychip_free(struct mychip *chip) 1511 static int snd_mychip_free(struct mychip *chip)
1512 { 1512 {
1513 .... 1513 ....
1514 if (chip->iobase_virt) 1514 if (chip->iobase_virt)
1515 iounmap(chip->iobase_virt); 1515 iounmap(chip->iobase_virt);
1516 .... 1516 ....
1517 pci_release_regions(chip->pci); 1517 pci_release_regions(chip->pci);
1518 .... 1518 ....
1519 } 1519 }
1520 ]]> 1520 ]]>
1521 </programlisting> 1521 </programlisting>
1522 </informalexample> 1522 </informalexample>
1523 </para> 1523 </para>
1524 1524
1525 </section> 1525 </section>
1526 1526
1527 <section id="pci-resource-device-struct"> 1527 <section id="pci-resource-device-struct">
1528 <title>Registration of Device Struct</title> 1528 <title>Registration of Device Struct</title>
1529 <para> 1529 <para>
1530 At some point, typically after calling <function>snd_device_new()</function>, 1530 At some point, typically after calling <function>snd_device_new()</function>,
1531 you need to register the struct <structname>device</structname> of the chip 1531 you need to register the struct <structname>device</structname> of the chip
1532 you're handling for udev and co. ALSA provides a macro for compatibility with 1532 you're handling for udev and co. ALSA provides a macro for compatibility with
1533 older kernels. Simply call like the following: 1533 older kernels. Simply call like the following:
1534 <informalexample> 1534 <informalexample>
1535 <programlisting> 1535 <programlisting>
1536 <![CDATA[ 1536 <![CDATA[
1537 snd_card_set_dev(card, &pci->dev); 1537 snd_card_set_dev(card, &pci->dev);
1538 ]]> 1538 ]]>
1539 </programlisting> 1539 </programlisting>
1540 </informalexample> 1540 </informalexample>
1541 so that it stores the PCI's device pointer to the card. This will be 1541 so that it stores the PCI's device pointer to the card. This will be
1542 referred by ALSA core functions later when the devices are registered. 1542 referred by ALSA core functions later when the devices are registered.
1543 </para> 1543 </para>
1544 <para> 1544 <para>
1545 In the case of non-PCI, pass the proper device struct pointer of the BUS 1545 In the case of non-PCI, pass the proper device struct pointer of the BUS
1546 instead. (In the case of legacy ISA without PnP, you don't have to do 1546 instead. (In the case of legacy ISA without PnP, you don't have to do
1547 anything.) 1547 anything.)
1548 </para> 1548 </para>
1549 </section> 1549 </section>
1550 1550
1551 <section id="pci-resource-entries"> 1551 <section id="pci-resource-entries">
1552 <title>PCI Entries</title> 1552 <title>PCI Entries</title>
1553 <para> 1553 <para>
1554 So far, so good. Let's finish the rest of missing PCI 1554 So far, so good. Let's finish the rest of missing PCI
1555 stuffs. At first, we need a 1555 stuffs. At first, we need a
1556 <structname>pci_device_id</structname> table for this 1556 <structname>pci_device_id</structname> table for this
1557 chipset. It's a table of PCI vendor/device ID number, and some 1557 chipset. It's a table of PCI vendor/device ID number, and some
1558 masks. 1558 masks.
1559 </para> 1559 </para>
1560 1560
1561 <para> 1561 <para>
1562 For example, 1562 For example,
1563 1563
1564 <informalexample> 1564 <informalexample>
1565 <programlisting> 1565 <programlisting>
1566 <![CDATA[ 1566 <![CDATA[
1567 static struct pci_device_id snd_mychip_ids[] = { 1567 static struct pci_device_id snd_mychip_ids[] = {
1568 { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR, 1568 { PCI_VENDOR_ID_FOO, PCI_DEVICE_ID_BAR,
1569 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, }, 1569 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, },
1570 .... 1570 ....
1571 { 0, } 1571 { 0, }
1572 }; 1572 };
1573 MODULE_DEVICE_TABLE(pci, snd_mychip_ids); 1573 MODULE_DEVICE_TABLE(pci, snd_mychip_ids);
1574 ]]> 1574 ]]>
1575 </programlisting> 1575 </programlisting>
1576 </informalexample> 1576 </informalexample>
1577 </para> 1577 </para>
1578 1578
1579 <para> 1579 <para>
1580 The first and second fields of 1580 The first and second fields of
1581 <structname>pci_device_id</structname> struct are the vendor and 1581 <structname>pci_device_id</structname> struct are the vendor and
1582 device IDs. If you have nothing special to filter the matching 1582 device IDs. If you have nothing special to filter the matching
1583 devices, you can use the rest of fields like above. The last 1583 devices, you can use the rest of fields like above. The last
1584 field of <structname>pci_device_id</structname> struct is a 1584 field of <structname>pci_device_id</structname> struct is a
1585 private data for this entry. You can specify any value here, for 1585 private data for this entry. You can specify any value here, for
1586 example, to tell the type of different operations per each 1586 example, to tell the type of different operations per each
1587 device IDs. Such an example is found in intel8x0 driver. 1587 device IDs. Such an example is found in intel8x0 driver.
1588 </para> 1588 </para>
1589 1589
1590 <para> 1590 <para>
1591 The last entry of this list is the terminator. You must 1591 The last entry of this list is the terminator. You must
1592 specify this all-zero entry. 1592 specify this all-zero entry.
1593 </para> 1593 </para>
1594 1594
1595 <para> 1595 <para>
1596 Then, prepare the <structname>pci_driver</structname> record: 1596 Then, prepare the <structname>pci_driver</structname> record:
1597 1597
1598 <informalexample> 1598 <informalexample>
1599 <programlisting> 1599 <programlisting>
1600 <![CDATA[ 1600 <![CDATA[
1601 static struct pci_driver driver = { 1601 static struct pci_driver driver = {
1602 .name = "My Own Chip", 1602 .name = "My Own Chip",
1603 .id_table = snd_mychip_ids, 1603 .id_table = snd_mychip_ids,
1604 .probe = snd_mychip_probe, 1604 .probe = snd_mychip_probe,
1605 .remove = __devexit_p(snd_mychip_remove), 1605 .remove = __devexit_p(snd_mychip_remove),
1606 }; 1606 };
1607 ]]> 1607 ]]>
1608 </programlisting> 1608 </programlisting>
1609 </informalexample> 1609 </informalexample>
1610 </para> 1610 </para>
1611 1611
1612 <para> 1612 <para>
1613 The <structfield>probe</structfield> and 1613 The <structfield>probe</structfield> and
1614 <structfield>remove</structfield> functions are what we already 1614 <structfield>remove</structfield> functions are what we already
1615 defined in 1615 defined in
1616 the previous sections. The <structfield>remove</structfield> should 1616 the previous sections. The <structfield>remove</structfield> should
1617 be defined with 1617 be defined with
1618 <function>__devexit_p()</function> macro, so that it's not 1618 <function>__devexit_p()</function> macro, so that it's not
1619 defined for built-in (and non-hot-pluggable) case. The 1619 defined for built-in (and non-hot-pluggable) case. The
1620 <structfield>name</structfield> 1620 <structfield>name</structfield>
1621 field is the name string of this device. Note that you must not 1621 field is the name string of this device. Note that you must not
1622 use a slash <quote>/</quote> in this string. 1622 use a slash <quote>/</quote> in this string.
1623 </para> 1623 </para>
1624 1624
1625 <para> 1625 <para>
1626 And at last, the module entries: 1626 And at last, the module entries:
1627 1627
1628 <informalexample> 1628 <informalexample>
1629 <programlisting> 1629 <programlisting>
1630 <![CDATA[ 1630 <![CDATA[
1631 static int __init alsa_card_mychip_init(void) 1631 static int __init alsa_card_mychip_init(void)
1632 { 1632 {
1633 return pci_register_driver(&driver); 1633 return pci_register_driver(&driver);
1634 } 1634 }
1635 1635
1636 static void __exit alsa_card_mychip_exit(void) 1636 static void __exit alsa_card_mychip_exit(void)
1637 { 1637 {
1638 pci_unregister_driver(&driver); 1638 pci_unregister_driver(&driver);
1639 } 1639 }
1640 1640
1641 module_init(alsa_card_mychip_init) 1641 module_init(alsa_card_mychip_init)
1642 module_exit(alsa_card_mychip_exit) 1642 module_exit(alsa_card_mychip_exit)
1643 ]]> 1643 ]]>
1644 </programlisting> 1644 </programlisting>
1645 </informalexample> 1645 </informalexample>
1646 </para> 1646 </para>
1647 1647
1648 <para> 1648 <para>
1649 Note that these module entries are tagged with 1649 Note that these module entries are tagged with
1650 <parameter>__init</parameter> and 1650 <parameter>__init</parameter> and
1651 <parameter>__exit</parameter> prefixes, not 1651 <parameter>__exit</parameter> prefixes, not
1652 <parameter>__devinit</parameter> nor 1652 <parameter>__devinit</parameter> nor
1653 <parameter>__devexit</parameter>. 1653 <parameter>__devexit</parameter>.
1654 </para> 1654 </para>
1655 1655
1656 <para> 1656 <para>
1657 Oh, one thing was forgotten. If you have no exported symbols, 1657 Oh, one thing was forgotten. If you have no exported symbols,
1658 you need to declare it on 2.2 or 2.4 kernels (on 2.6 kernels 1658 you need to declare it on 2.2 or 2.4 kernels (on 2.6 kernels
1659 it's not necessary, though). 1659 it's not necessary, though).
1660 1660
1661 <informalexample> 1661 <informalexample>
1662 <programlisting> 1662 <programlisting>
1663 <![CDATA[ 1663 <![CDATA[
1664 EXPORT_NO_SYMBOLS; 1664 EXPORT_NO_SYMBOLS;
1665 ]]> 1665 ]]>
1666 </programlisting> 1666 </programlisting>
1667 </informalexample> 1667 </informalexample>
1668 1668
1669 That's all! 1669 That's all!
1670 </para> 1670 </para>
1671 </section> 1671 </section>
1672 </chapter> 1672 </chapter>
1673 1673
1674 1674
1675 <!-- ****************************************************** --> 1675 <!-- ****************************************************** -->
1676 <!-- PCM Interface --> 1676 <!-- PCM Interface -->
1677 <!-- ****************************************************** --> 1677 <!-- ****************************************************** -->
1678 <chapter id="pcm-interface"> 1678 <chapter id="pcm-interface">
1679 <title>PCM Interface</title> 1679 <title>PCM Interface</title>
1680 1680
1681 <section id="pcm-interface-general"> 1681 <section id="pcm-interface-general">
1682 <title>General</title> 1682 <title>General</title>
1683 <para> 1683 <para>
1684 The PCM middle layer of ALSA is quite powerful and it is only 1684 The PCM middle layer of ALSA is quite powerful and it is only
1685 necessary for each driver to implement the low-level functions 1685 necessary for each driver to implement the low-level functions
1686 to access its hardware. 1686 to access its hardware.
1687 </para> 1687 </para>
1688 1688
1689 <para> 1689 <para>
1690 For accessing to the PCM layer, you need to include 1690 For accessing to the PCM layer, you need to include
1691 <filename>&lt;sound/pcm.h&gt;</filename> above all. In addition, 1691 <filename>&lt;sound/pcm.h&gt;</filename> above all. In addition,
1692 <filename>&lt;sound/pcm_params.h&gt;</filename> might be needed 1692 <filename>&lt;sound/pcm_params.h&gt;</filename> might be needed
1693 if you access to some functions related with hw_param. 1693 if you access to some functions related with hw_param.
1694 </para> 1694 </para>
1695 1695
1696 <para> 1696 <para>
1697 Each card device can have up to four pcm instances. A pcm 1697 Each card device can have up to four pcm instances. A pcm
1698 instance corresponds to a pcm device file. The limitation of 1698 instance corresponds to a pcm device file. The limitation of
1699 number of instances comes only from the available bit size of 1699 number of instances comes only from the available bit size of
1700 the linux's device number. Once when 64bit device number is 1700 the linux's device number. Once when 64bit device number is
1701 used, we'll have more available pcm instances. 1701 used, we'll have more available pcm instances.
1702 </para> 1702 </para>
1703 1703
1704 <para> 1704 <para>
1705 A pcm instance consists of pcm playback and capture streams, 1705 A pcm instance consists of pcm playback and capture streams,
1706 and each pcm stream consists of one or more pcm substreams. Some 1706 and each pcm stream consists of one or more pcm substreams. Some
1707 soundcard supports the multiple-playback function. For example, 1707 soundcard supports the multiple-playback function. For example,
1708 emu10k1 has a PCM playback of 32 stereo substreams. In this case, at 1708 emu10k1 has a PCM playback of 32 stereo substreams. In this case, at
1709 each open, a free substream is (usually) automatically chosen 1709 each open, a free substream is (usually) automatically chosen
1710 and opened. Meanwhile, when only one substream exists and it was 1710 and opened. Meanwhile, when only one substream exists and it was
1711 already opened, the succeeding open will result in the blocking 1711 already opened, the succeeding open will result in the blocking
1712 or the error with <constant>EAGAIN</constant> according to the 1712 or the error with <constant>EAGAIN</constant> according to the
1713 file open mode. But you don't have to know the detail in your 1713 file open mode. But you don't have to know the detail in your
1714 driver. The PCM middle layer will take all such jobs. 1714 driver. The PCM middle layer will take all such jobs.
1715 </para> 1715 </para>
1716 </section> 1716 </section>
1717 1717
1718 <section id="pcm-interface-example"> 1718 <section id="pcm-interface-example">
1719 <title>Full Code Example</title> 1719 <title>Full Code Example</title>
1720 <para> 1720 <para>
1721 The example code below does not include any hardware access 1721 The example code below does not include any hardware access
1722 routines but shows only the skeleton, how to build up the PCM 1722 routines but shows only the skeleton, how to build up the PCM
1723 interfaces. 1723 interfaces.
1724 1724
1725 <example> 1725 <example>
1726 <title>PCM Example Code</title> 1726 <title>PCM Example Code</title>
1727 <programlisting> 1727 <programlisting>
1728 <![CDATA[ 1728 <![CDATA[
1729 #include <sound/pcm.h> 1729 #include <sound/pcm.h>
1730 .... 1730 ....
1731 1731
1732 /* hardware definition */ 1732 /* hardware definition */
1733 static struct snd_pcm_hardware snd_mychip_playback_hw = { 1733 static struct snd_pcm_hardware snd_mychip_playback_hw = {
1734 .info = (SNDRV_PCM_INFO_MMAP | 1734 .info = (SNDRV_PCM_INFO_MMAP |
1735 SNDRV_PCM_INFO_INTERLEAVED | 1735 SNDRV_PCM_INFO_INTERLEAVED |
1736 SNDRV_PCM_INFO_BLOCK_TRANSFER | 1736 SNDRV_PCM_INFO_BLOCK_TRANSFER |
1737 SNDRV_PCM_INFO_MMAP_VALID), 1737 SNDRV_PCM_INFO_MMAP_VALID),
1738 .formats = SNDRV_PCM_FMTBIT_S16_LE, 1738 .formats = SNDRV_PCM_FMTBIT_S16_LE,
1739 .rates = SNDRV_PCM_RATE_8000_48000, 1739 .rates = SNDRV_PCM_RATE_8000_48000,
1740 .rate_min = 8000, 1740 .rate_min = 8000,
1741 .rate_max = 48000, 1741 .rate_max = 48000,
1742 .channels_min = 2, 1742 .channels_min = 2,
1743 .channels_max = 2, 1743 .channels_max = 2,
1744 .buffer_bytes_max = 32768, 1744 .buffer_bytes_max = 32768,
1745 .period_bytes_min = 4096, 1745 .period_bytes_min = 4096,
1746 .period_bytes_max = 32768, 1746 .period_bytes_max = 32768,
1747 .periods_min = 1, 1747 .periods_min = 1,
1748 .periods_max = 1024, 1748 .periods_max = 1024,
1749 }; 1749 };
1750 1750
1751 /* hardware definition */ 1751 /* hardware definition */
1752 static struct snd_pcm_hardware snd_mychip_capture_hw = { 1752 static struct snd_pcm_hardware snd_mychip_capture_hw = {
1753 .info = (SNDRV_PCM_INFO_MMAP | 1753 .info = (SNDRV_PCM_INFO_MMAP |
1754 SNDRV_PCM_INFO_INTERLEAVED | 1754 SNDRV_PCM_INFO_INTERLEAVED |
1755 SNDRV_PCM_INFO_BLOCK_TRANSFER | 1755 SNDRV_PCM_INFO_BLOCK_TRANSFER |
1756 SNDRV_PCM_INFO_MMAP_VALID), 1756 SNDRV_PCM_INFO_MMAP_VALID),
1757 .formats = SNDRV_PCM_FMTBIT_S16_LE, 1757 .formats = SNDRV_PCM_FMTBIT_S16_LE,
1758 .rates = SNDRV_PCM_RATE_8000_48000, 1758 .rates = SNDRV_PCM_RATE_8000_48000,
1759 .rate_min = 8000, 1759 .rate_min = 8000,
1760 .rate_max = 48000, 1760 .rate_max = 48000,
1761 .channels_min = 2, 1761 .channels_min = 2,
1762 .channels_max = 2, 1762 .channels_max = 2,
1763 .buffer_bytes_max = 32768, 1763 .buffer_bytes_max = 32768,
1764 .period_bytes_min = 4096, 1764 .period_bytes_min = 4096,
1765 .period_bytes_max = 32768, 1765 .period_bytes_max = 32768,
1766 .periods_min = 1, 1766 .periods_min = 1,
1767 .periods_max = 1024, 1767 .periods_max = 1024,
1768 }; 1768 };
1769 1769
1770 /* open callback */ 1770 /* open callback */
1771 static int snd_mychip_playback_open(struct snd_pcm_substream *substream) 1771 static int snd_mychip_playback_open(struct snd_pcm_substream *substream)
1772 { 1772 {
1773 struct mychip *chip = snd_pcm_substream_chip(substream); 1773 struct mychip *chip = snd_pcm_substream_chip(substream);
1774 struct snd_pcm_runtime *runtime = substream->runtime; 1774 struct snd_pcm_runtime *runtime = substream->runtime;
1775 1775
1776 runtime->hw = snd_mychip_playback_hw; 1776 runtime->hw = snd_mychip_playback_hw;
1777 // more hardware-initialization will be done here 1777 // more hardware-initialization will be done here
1778 return 0; 1778 return 0;
1779 } 1779 }
1780 1780
1781 /* close callback */ 1781 /* close callback */
1782 static int snd_mychip_playback_close(struct snd_pcm_substream *substream) 1782 static int snd_mychip_playback_close(struct snd_pcm_substream *substream)
1783 { 1783 {
1784 struct mychip *chip = snd_pcm_substream_chip(substream); 1784 struct mychip *chip = snd_pcm_substream_chip(substream);
1785 // the hardware-specific codes will be here 1785 // the hardware-specific codes will be here
1786 return 0; 1786 return 0;
1787 1787
1788 } 1788 }
1789 1789
1790 /* open callback */ 1790 /* open callback */
1791 static int snd_mychip_capture_open(struct snd_pcm_substream *substream) 1791 static int snd_mychip_capture_open(struct snd_pcm_substream *substream)
1792 { 1792 {
1793 struct mychip *chip = snd_pcm_substream_chip(substream); 1793 struct mychip *chip = snd_pcm_substream_chip(substream);
1794 struct snd_pcm_runtime *runtime = substream->runtime; 1794 struct snd_pcm_runtime *runtime = substream->runtime;
1795 1795
1796 runtime->hw = snd_mychip_capture_hw; 1796 runtime->hw = snd_mychip_capture_hw;
1797 // more hardware-initialization will be done here 1797 // more hardware-initialization will be done here
1798 return 0; 1798 return 0;
1799 } 1799 }
1800 1800
1801 /* close callback */ 1801 /* close callback */
1802 static int snd_mychip_capture_close(struct snd_pcm_substream *substream) 1802 static int snd_mychip_capture_close(struct snd_pcm_substream *substream)
1803 { 1803 {
1804 struct mychip *chip = snd_pcm_substream_chip(substream); 1804 struct mychip *chip = snd_pcm_substream_chip(substream);
1805 // the hardware-specific codes will be here 1805 // the hardware-specific codes will be here
1806 return 0; 1806 return 0;
1807 1807
1808 } 1808 }
1809 1809
1810 /* hw_params callback */ 1810 /* hw_params callback */
1811 static int snd_mychip_pcm_hw_params(struct snd_pcm_substream *substream, 1811 static int snd_mychip_pcm_hw_params(struct snd_pcm_substream *substream,
1812 struct snd_pcm_hw_params *hw_params) 1812 struct snd_pcm_hw_params *hw_params)
1813 { 1813 {
1814 return snd_pcm_lib_malloc_pages(substream, 1814 return snd_pcm_lib_malloc_pages(substream,
1815 params_buffer_bytes(hw_params)); 1815 params_buffer_bytes(hw_params));
1816 } 1816 }
1817 1817
1818 /* hw_free callback */ 1818 /* hw_free callback */
1819 static int snd_mychip_pcm_hw_free(struct snd_pcm_substream *substream) 1819 static int snd_mychip_pcm_hw_free(struct snd_pcm_substream *substream)
1820 { 1820 {
1821 return snd_pcm_lib_free_pages(substream); 1821 return snd_pcm_lib_free_pages(substream);
1822 } 1822 }
1823 1823
1824 /* prepare callback */ 1824 /* prepare callback */
1825 static int snd_mychip_pcm_prepare(struct snd_pcm_substream *substream) 1825 static int snd_mychip_pcm_prepare(struct snd_pcm_substream *substream)
1826 { 1826 {
1827 struct mychip *chip = snd_pcm_substream_chip(substream); 1827 struct mychip *chip = snd_pcm_substream_chip(substream);
1828 struct snd_pcm_runtime *runtime = substream->runtime; 1828 struct snd_pcm_runtime *runtime = substream->runtime;
1829 1829
1830 /* set up the hardware with the current configuration 1830 /* set up the hardware with the current configuration
1831 * for example... 1831 * for example...
1832 */ 1832 */
1833 mychip_set_sample_format(chip, runtime->format); 1833 mychip_set_sample_format(chip, runtime->format);
1834 mychip_set_sample_rate(chip, runtime->rate); 1834 mychip_set_sample_rate(chip, runtime->rate);
1835 mychip_set_channels(chip, runtime->channels); 1835 mychip_set_channels(chip, runtime->channels);
1836 mychip_set_dma_setup(chip, runtime->dma_addr, 1836 mychip_set_dma_setup(chip, runtime->dma_addr,
1837 chip->buffer_size, 1837 chip->buffer_size,
1838 chip->period_size); 1838 chip->period_size);
1839 return 0; 1839 return 0;
1840 } 1840 }
1841 1841
1842 /* trigger callback */ 1842 /* trigger callback */
1843 static int snd_mychip_pcm_trigger(struct snd_pcm_substream *substream, 1843 static int snd_mychip_pcm_trigger(struct snd_pcm_substream *substream,
1844 int cmd) 1844 int cmd)
1845 { 1845 {
1846 switch (cmd) { 1846 switch (cmd) {
1847 case SNDRV_PCM_TRIGGER_START: 1847 case SNDRV_PCM_TRIGGER_START:
1848 // do something to start the PCM engine 1848 // do something to start the PCM engine
1849 break; 1849 break;
1850 case SNDRV_PCM_TRIGGER_STOP: 1850 case SNDRV_PCM_TRIGGER_STOP:
1851 // do something to stop the PCM engine 1851 // do something to stop the PCM engine
1852 break; 1852 break;
1853 default: 1853 default:
1854 return -EINVAL; 1854 return -EINVAL;
1855 } 1855 }
1856 } 1856 }
1857 1857
1858 /* pointer callback */ 1858 /* pointer callback */
1859 static snd_pcm_uframes_t 1859 static snd_pcm_uframes_t
1860 snd_mychip_pcm_pointer(struct snd_pcm_substream *substream) 1860 snd_mychip_pcm_pointer(struct snd_pcm_substream *substream)
1861 { 1861 {
1862 struct mychip *chip = snd_pcm_substream_chip(substream); 1862 struct mychip *chip = snd_pcm_substream_chip(substream);
1863 unsigned int current_ptr; 1863 unsigned int current_ptr;
1864 1864
1865 /* get the current hardware pointer */ 1865 /* get the current hardware pointer */
1866 current_ptr = mychip_get_hw_pointer(chip); 1866 current_ptr = mychip_get_hw_pointer(chip);
1867 return current_ptr; 1867 return current_ptr;
1868 } 1868 }
1869 1869
1870 /* operators */ 1870 /* operators */
1871 static struct snd_pcm_ops snd_mychip_playback_ops = { 1871 static struct snd_pcm_ops snd_mychip_playback_ops = {
1872 .open = snd_mychip_playback_open, 1872 .open = snd_mychip_playback_open,
1873 .close = snd_mychip_playback_close, 1873 .close = snd_mychip_playback_close,
1874 .ioctl = snd_pcm_lib_ioctl, 1874 .ioctl = snd_pcm_lib_ioctl,
1875 .hw_params = snd_mychip_pcm_hw_params, 1875 .hw_params = snd_mychip_pcm_hw_params,
1876 .hw_free = snd_mychip_pcm_hw_free, 1876 .hw_free = snd_mychip_pcm_hw_free,
1877 .prepare = snd_mychip_pcm_prepare, 1877 .prepare = snd_mychip_pcm_prepare,
1878 .trigger = snd_mychip_pcm_trigger, 1878 .trigger = snd_mychip_pcm_trigger,
1879 .pointer = snd_mychip_pcm_pointer, 1879 .pointer = snd_mychip_pcm_pointer,
1880 }; 1880 };
1881 1881
1882 /* operators */ 1882 /* operators */
1883 static struct snd_pcm_ops snd_mychip_capture_ops = { 1883 static struct snd_pcm_ops snd_mychip_capture_ops = {
1884 .open = snd_mychip_capture_open, 1884 .open = snd_mychip_capture_open,
1885 .close = snd_mychip_capture_close, 1885 .close = snd_mychip_capture_close,
1886 .ioctl = snd_pcm_lib_ioctl, 1886 .ioctl = snd_pcm_lib_ioctl,
1887 .hw_params = snd_mychip_pcm_hw_params, 1887 .hw_params = snd_mychip_pcm_hw_params,
1888 .hw_free = snd_mychip_pcm_hw_free, 1888 .hw_free = snd_mychip_pcm_hw_free,
1889 .prepare = snd_mychip_pcm_prepare, 1889 .prepare = snd_mychip_pcm_prepare,
1890 .trigger = snd_mychip_pcm_trigger, 1890 .trigger = snd_mychip_pcm_trigger,
1891 .pointer = snd_mychip_pcm_pointer, 1891 .pointer = snd_mychip_pcm_pointer,
1892 }; 1892 };
1893 1893
1894 /* 1894 /*
1895 * definitions of capture are omitted here... 1895 * definitions of capture are omitted here...
1896 */ 1896 */
1897 1897
1898 /* create a pcm device */ 1898 /* create a pcm device */
1899 static int __devinit snd_mychip_new_pcm(struct mychip *chip) 1899 static int __devinit snd_mychip_new_pcm(struct mychip *chip)
1900 { 1900 {
1901 struct snd_pcm *pcm; 1901 struct snd_pcm *pcm;
1902 int err; 1902 int err;
1903 1903
1904 if ((err = snd_pcm_new(chip->card, "My Chip", 0, 1, 1, 1904 if ((err = snd_pcm_new(chip->card, "My Chip", 0, 1, 1,
1905 &pcm)) < 0) 1905 &pcm)) < 0)
1906 return err; 1906 return err;
1907 pcm->private_data = chip; 1907 pcm->private_data = chip;
1908 strcpy(pcm->name, "My Chip"); 1908 strcpy(pcm->name, "My Chip");
1909 chip->pcm = pcm; 1909 chip->pcm = pcm;
1910 /* set operators */ 1910 /* set operators */
1911 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_PLAYBACK, 1911 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_PLAYBACK,
1912 &snd_mychip_playback_ops); 1912 &snd_mychip_playback_ops);
1913 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE, 1913 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE,
1914 &snd_mychip_capture_ops); 1914 &snd_mychip_capture_ops);
1915 /* pre-allocation of buffers */ 1915 /* pre-allocation of buffers */
1916 /* NOTE: this may fail */ 1916 /* NOTE: this may fail */
1917 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, 1917 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV,
1918 snd_dma_pci_data(chip->pci), 1918 snd_dma_pci_data(chip->pci),
1919 64*1024, 64*1024); 1919 64*1024, 64*1024);
1920 return 0; 1920 return 0;
1921 } 1921 }
1922 ]]> 1922 ]]>
1923 </programlisting> 1923 </programlisting>
1924 </example> 1924 </example>
1925 </para> 1925 </para>
1926 </section> 1926 </section>
1927 1927
1928 <section id="pcm-interface-constructor"> 1928 <section id="pcm-interface-constructor">
1929 <title>Constructor</title> 1929 <title>Constructor</title>
1930 <para> 1930 <para>
1931 A pcm instance is allocated by <function>snd_pcm_new()</function> 1931 A pcm instance is allocated by <function>snd_pcm_new()</function>
1932 function. It would be better to create a constructor for pcm, 1932 function. It would be better to create a constructor for pcm,
1933 namely, 1933 namely,
1934 1934
1935 <informalexample> 1935 <informalexample>
1936 <programlisting> 1936 <programlisting>
1937 <![CDATA[ 1937 <![CDATA[
1938 static int __devinit snd_mychip_new_pcm(struct mychip *chip) 1938 static int __devinit snd_mychip_new_pcm(struct mychip *chip)
1939 { 1939 {
1940 struct snd_pcm *pcm; 1940 struct snd_pcm *pcm;
1941 int err; 1941 int err;
1942 1942
1943 if ((err = snd_pcm_new(chip->card, "My Chip", 0, 1, 1, 1943 if ((err = snd_pcm_new(chip->card, "My Chip", 0, 1, 1,
1944 &pcm)) < 0) 1944 &pcm)) < 0)
1945 return err; 1945 return err;
1946 pcm->private_data = chip; 1946 pcm->private_data = chip;
1947 strcpy(pcm->name, "My Chip"); 1947 strcpy(pcm->name, "My Chip");
1948 chip->pcm = pcm; 1948 chip->pcm = pcm;
1949 .... 1949 ....
1950 return 0; 1950 return 0;
1951 } 1951 }
1952 ]]> 1952 ]]>
1953 </programlisting> 1953 </programlisting>
1954 </informalexample> 1954 </informalexample>
1955 </para> 1955 </para>
1956 1956
1957 <para> 1957 <para>
1958 The <function>snd_pcm_new()</function> function takes the four 1958 The <function>snd_pcm_new()</function> function takes the four
1959 arguments. The first argument is the card pointer to which this 1959 arguments. The first argument is the card pointer to which this
1960 pcm is assigned, and the second is the ID string. 1960 pcm is assigned, and the second is the ID string.
1961 </para> 1961 </para>
1962 1962
1963 <para> 1963 <para>
1964 The third argument (<parameter>index</parameter>, 0 in the 1964 The third argument (<parameter>index</parameter>, 0 in the
1965 above) is the index of this new pcm. It begins from zero. When 1965 above) is the index of this new pcm. It begins from zero. When
1966 you will create more than one pcm instances, specify the 1966 you will create more than one pcm instances, specify the
1967 different numbers in this argument. For example, 1967 different numbers in this argument. For example,
1968 <parameter>index</parameter> = 1 for the second PCM device. 1968 <parameter>index</parameter> = 1 for the second PCM device.
1969 </para> 1969 </para>
1970 1970
1971 <para> 1971 <para>
1972 The fourth and fifth arguments are the number of substreams 1972 The fourth and fifth arguments are the number of substreams
1973 for playback and capture, respectively. Here both 1 are given in 1973 for playback and capture, respectively. Here both 1 are given in
1974 the above example. When no playback or no capture is available, 1974 the above example. When no playback or no capture is available,
1975 pass 0 to the corresponding argument. 1975 pass 0 to the corresponding argument.
1976 </para> 1976 </para>
1977 1977
1978 <para> 1978 <para>
1979 If a chip supports multiple playbacks or captures, you can 1979 If a chip supports multiple playbacks or captures, you can
1980 specify more numbers, but they must be handled properly in 1980 specify more numbers, but they must be handled properly in
1981 open/close, etc. callbacks. When you need to know which 1981 open/close, etc. callbacks. When you need to know which
1982 substream you are referring to, then it can be obtained from 1982 substream you are referring to, then it can be obtained from
1983 struct <structname>snd_pcm_substream</structname> data passed to each callback 1983 struct <structname>snd_pcm_substream</structname> data passed to each callback
1984 as follows: 1984 as follows:
1985 1985
1986 <informalexample> 1986 <informalexample>
1987 <programlisting> 1987 <programlisting>
1988 <![CDATA[ 1988 <![CDATA[
1989 struct snd_pcm_substream *substream; 1989 struct snd_pcm_substream *substream;
1990 int index = substream->number; 1990 int index = substream->number;
1991 ]]> 1991 ]]>
1992 </programlisting> 1992 </programlisting>
1993 </informalexample> 1993 </informalexample>
1994 </para> 1994 </para>
1995 1995
1996 <para> 1996 <para>
1997 After the pcm is created, you need to set operators for each 1997 After the pcm is created, you need to set operators for each
1998 pcm stream. 1998 pcm stream.
1999 1999
2000 <informalexample> 2000 <informalexample>
2001 <programlisting> 2001 <programlisting>
2002 <![CDATA[ 2002 <![CDATA[
2003 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_PLAYBACK, 2003 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_PLAYBACK,
2004 &snd_mychip_playback_ops); 2004 &snd_mychip_playback_ops);
2005 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE, 2005 snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE,
2006 &snd_mychip_capture_ops); 2006 &snd_mychip_capture_ops);
2007 ]]> 2007 ]]>
2008 </programlisting> 2008 </programlisting>
2009 </informalexample> 2009 </informalexample>
2010 </para> 2010 </para>
2011 2011
2012 <para> 2012 <para>
2013 The operators are defined typically like this: 2013 The operators are defined typically like this:
2014 2014
2015 <informalexample> 2015 <informalexample>
2016 <programlisting> 2016 <programlisting>
2017 <![CDATA[ 2017 <![CDATA[
2018 static struct snd_pcm_ops snd_mychip_playback_ops = { 2018 static struct snd_pcm_ops snd_mychip_playback_ops = {
2019 .open = snd_mychip_pcm_open, 2019 .open = snd_mychip_pcm_open,
2020 .close = snd_mychip_pcm_close, 2020 .close = snd_mychip_pcm_close,
2021 .ioctl = snd_pcm_lib_ioctl, 2021 .ioctl = snd_pcm_lib_ioctl,
2022 .hw_params = snd_mychip_pcm_hw_params, 2022 .hw_params = snd_mychip_pcm_hw_params,
2023 .hw_free = snd_mychip_pcm_hw_free, 2023 .hw_free = snd_mychip_pcm_hw_free,
2024 .prepare = snd_mychip_pcm_prepare, 2024 .prepare = snd_mychip_pcm_prepare,
2025 .trigger = snd_mychip_pcm_trigger, 2025 .trigger = snd_mychip_pcm_trigger,
2026 .pointer = snd_mychip_pcm_pointer, 2026 .pointer = snd_mychip_pcm_pointer,
2027 }; 2027 };
2028 ]]> 2028 ]]>
2029 </programlisting> 2029 </programlisting>
2030 </informalexample> 2030 </informalexample>
2031 2031
2032 Each of callbacks is explained in the subsection 2032 Each of callbacks is explained in the subsection
2033 <link linkend="pcm-interface-operators"><citetitle> 2033 <link linkend="pcm-interface-operators"><citetitle>
2034 Operators</citetitle></link>. 2034 Operators</citetitle></link>.
2035 </para> 2035 </para>
2036 2036
2037 <para> 2037 <para>
2038 After setting the operators, most likely you'd like to 2038 After setting the operators, most likely you'd like to
2039 pre-allocate the buffer. For the pre-allocation, simply call 2039 pre-allocate the buffer. For the pre-allocation, simply call
2040 the following: 2040 the following:
2041 2041
2042 <informalexample> 2042 <informalexample>
2043 <programlisting> 2043 <programlisting>
2044 <![CDATA[ 2044 <![CDATA[
2045 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, 2045 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV,
2046 snd_dma_pci_data(chip->pci), 2046 snd_dma_pci_data(chip->pci),
2047 64*1024, 64*1024); 2047 64*1024, 64*1024);
2048 ]]> 2048 ]]>
2049 </programlisting> 2049 </programlisting>
2050 </informalexample> 2050 </informalexample>
2051 2051
2052 It will allocate up to 64kB buffer as default. The details of 2052 It will allocate up to 64kB buffer as default. The details of
2053 buffer management will be described in the later section <link 2053 buffer management will be described in the later section <link
2054 linkend="buffer-and-memory"><citetitle>Buffer and Memory 2054 linkend="buffer-and-memory"><citetitle>Buffer and Memory
2055 Management</citetitle></link>. 2055 Management</citetitle></link>.
2056 </para> 2056 </para>
2057 2057
2058 <para> 2058 <para>
2059 Additionally, you can set some extra information for this pcm 2059 Additionally, you can set some extra information for this pcm
2060 in pcm-&gt;info_flags. 2060 in pcm-&gt;info_flags.
2061 The available values are defined as 2061 The available values are defined as
2062 <constant>SNDRV_PCM_INFO_XXX</constant> in 2062 <constant>SNDRV_PCM_INFO_XXX</constant> in
2063 <filename>&lt;sound/asound.h&gt;</filename>, which is used for 2063 <filename>&lt;sound/asound.h&gt;</filename>, which is used for
2064 the hardware definition (described later). When your soundchip 2064 the hardware definition (described later). When your soundchip
2065 supports only half-duplex, specify like this: 2065 supports only half-duplex, specify like this:
2066 2066
2067 <informalexample> 2067 <informalexample>
2068 <programlisting> 2068 <programlisting>
2069 <![CDATA[ 2069 <![CDATA[
2070 pcm->info_flags = SNDRV_PCM_INFO_HALF_DUPLEX; 2070 pcm->info_flags = SNDRV_PCM_INFO_HALF_DUPLEX;
2071 ]]> 2071 ]]>
2072 </programlisting> 2072 </programlisting>
2073 </informalexample> 2073 </informalexample>
2074 </para> 2074 </para>
2075 </section> 2075 </section>
2076 2076
2077 <section id="pcm-interface-destructor"> 2077 <section id="pcm-interface-destructor">
2078 <title>... And the Destructor?</title> 2078 <title>... And the Destructor?</title>
2079 <para> 2079 <para>
2080 The destructor for a pcm instance is not always 2080 The destructor for a pcm instance is not always
2081 necessary. Since the pcm device will be released by the middle 2081 necessary. Since the pcm device will be released by the middle
2082 layer code automatically, you don't have to call destructor 2082 layer code automatically, you don't have to call destructor
2083 explicitly. 2083 explicitly.
2084 </para> 2084 </para>
2085 2085
2086 <para> 2086 <para>
2087 The destructor would be necessary when you created some 2087 The destructor would be necessary when you created some
2088 special records internally and need to release them. In such a 2088 special records internally and need to release them. In such a
2089 case, set the destructor function to 2089 case, set the destructor function to
2090 pcm-&gt;private_free: 2090 pcm-&gt;private_free:
2091 2091
2092 <example> 2092 <example>
2093 <title>PCM Instance with a Destructor</title> 2093 <title>PCM Instance with a Destructor</title>
2094 <programlisting> 2094 <programlisting>
2095 <![CDATA[ 2095 <![CDATA[
2096 static void mychip_pcm_free(struct snd_pcm *pcm) 2096 static void mychip_pcm_free(struct snd_pcm *pcm)
2097 { 2097 {
2098 struct mychip *chip = snd_pcm_chip(pcm); 2098 struct mychip *chip = snd_pcm_chip(pcm);
2099 /* free your own data */ 2099 /* free your own data */
2100 kfree(chip->my_private_pcm_data); 2100 kfree(chip->my_private_pcm_data);
2101 // do what you like else 2101 // do what you like else
2102 .... 2102 ....
2103 } 2103 }
2104 2104
2105 static int __devinit snd_mychip_new_pcm(struct mychip *chip) 2105 static int __devinit snd_mychip_new_pcm(struct mychip *chip)
2106 { 2106 {
2107 struct snd_pcm *pcm; 2107 struct snd_pcm *pcm;
2108 .... 2108 ....
2109 /* allocate your own data */ 2109 /* allocate your own data */
2110 chip->my_private_pcm_data = kmalloc(...); 2110 chip->my_private_pcm_data = kmalloc(...);
2111 /* set the destructor */ 2111 /* set the destructor */
2112 pcm->private_data = chip; 2112 pcm->private_data = chip;
2113 pcm->private_free = mychip_pcm_free; 2113 pcm->private_free = mychip_pcm_free;
2114 .... 2114 ....
2115 } 2115 }
2116 ]]> 2116 ]]>
2117 </programlisting> 2117 </programlisting>
2118 </example> 2118 </example>
2119 </para> 2119 </para>
2120 </section> 2120 </section>
2121 2121
2122 <section id="pcm-interface-runtime"> 2122 <section id="pcm-interface-runtime">
2123 <title>Runtime Pointer - The Chest of PCM Information</title> 2123 <title>Runtime Pointer - The Chest of PCM Information</title>
2124 <para> 2124 <para>
2125 When the PCM substream is opened, a PCM runtime instance is 2125 When the PCM substream is opened, a PCM runtime instance is
2126 allocated and assigned to the substream. This pointer is 2126 allocated and assigned to the substream. This pointer is
2127 accessible via <constant>substream-&gt;runtime</constant>. 2127 accessible via <constant>substream-&gt;runtime</constant>.
2128 This runtime pointer holds the various information; it holds 2128 This runtime pointer holds the various information; it holds
2129 the copy of hw_params and sw_params configurations, the buffer 2129 the copy of hw_params and sw_params configurations, the buffer
2130 pointers, mmap records, spinlocks, etc. Almost everyhing you 2130 pointers, mmap records, spinlocks, etc. Almost everyhing you
2131 need for controlling the PCM can be found there. 2131 need for controlling the PCM can be found there.
2132 </para> 2132 </para>
2133 2133
2134 <para> 2134 <para>
2135 The definition of runtime instance is found in 2135 The definition of runtime instance is found in
2136 <filename>&lt;sound/pcm.h&gt;</filename>. Here is the 2136 <filename>&lt;sound/pcm.h&gt;</filename>. Here is the
2137 copy from the file. 2137 copy from the file.
2138 <informalexample> 2138 <informalexample>
2139 <programlisting> 2139 <programlisting>
2140 <![CDATA[ 2140 <![CDATA[
2141 struct _snd_pcm_runtime { 2141 struct _snd_pcm_runtime {
2142 /* -- Status -- */ 2142 /* -- Status -- */
2143 struct snd_pcm_substream *trigger_master; 2143 struct snd_pcm_substream *trigger_master;
2144 snd_timestamp_t trigger_tstamp; /* trigger timestamp */ 2144 snd_timestamp_t trigger_tstamp; /* trigger timestamp */
2145 int overrange; 2145 int overrange;
2146 snd_pcm_uframes_t avail_max; 2146 snd_pcm_uframes_t avail_max;
2147 snd_pcm_uframes_t hw_ptr_base; /* Position at buffer restart */ 2147 snd_pcm_uframes_t hw_ptr_base; /* Position at buffer restart */
2148 snd_pcm_uframes_t hw_ptr_interrupt; /* Position at interrupt time*/ 2148 snd_pcm_uframes_t hw_ptr_interrupt; /* Position at interrupt time*/
2149 2149
2150 /* -- HW params -- */ 2150 /* -- HW params -- */
2151 snd_pcm_access_t access; /* access mode */ 2151 snd_pcm_access_t access; /* access mode */
2152 snd_pcm_format_t format; /* SNDRV_PCM_FORMAT_* */ 2152 snd_pcm_format_t format; /* SNDRV_PCM_FORMAT_* */
2153 snd_pcm_subformat_t subformat; /* subformat */ 2153 snd_pcm_subformat_t subformat; /* subformat */
2154 unsigned int rate; /* rate in Hz */ 2154 unsigned int rate; /* rate in Hz */
2155 unsigned int channels; /* channels */ 2155 unsigned int channels; /* channels */
2156 snd_pcm_uframes_t period_size; /* period size */ 2156 snd_pcm_uframes_t period_size; /* period size */
2157 unsigned int periods; /* periods */ 2157 unsigned int periods; /* periods */
2158 snd_pcm_uframes_t buffer_size; /* buffer size */ 2158 snd_pcm_uframes_t buffer_size; /* buffer size */
2159 unsigned int tick_time; /* tick time */ 2159 unsigned int tick_time; /* tick time */
2160 snd_pcm_uframes_t min_align; /* Min alignment for the format */ 2160 snd_pcm_uframes_t min_align; /* Min alignment for the format */
2161 size_t byte_align; 2161 size_t byte_align;
2162 unsigned int frame_bits; 2162 unsigned int frame_bits;
2163 unsigned int sample_bits; 2163 unsigned int sample_bits;
2164 unsigned int info; 2164 unsigned int info;
2165 unsigned int rate_num; 2165 unsigned int rate_num;
2166 unsigned int rate_den; 2166 unsigned int rate_den;
2167 2167
2168 /* -- SW params -- */ 2168 /* -- SW params -- */
2169 struct timespec tstamp_mode; /* mmap timestamp is updated */ 2169 struct timespec tstamp_mode; /* mmap timestamp is updated */
2170 unsigned int period_step; 2170 unsigned int period_step;
2171 unsigned int sleep_min; /* min ticks to sleep */ 2171 unsigned int sleep_min; /* min ticks to sleep */
2172 snd_pcm_uframes_t xfer_align; /* xfer size need to be a multiple */ 2172 snd_pcm_uframes_t xfer_align; /* xfer size need to be a multiple */
2173 snd_pcm_uframes_t start_threshold; 2173 snd_pcm_uframes_t start_threshold;
2174 snd_pcm_uframes_t stop_threshold; 2174 snd_pcm_uframes_t stop_threshold;
2175 snd_pcm_uframes_t silence_threshold; /* Silence filling happens when 2175 snd_pcm_uframes_t silence_threshold; /* Silence filling happens when
2176 noise is nearest than this */ 2176 noise is nearest than this */
2177 snd_pcm_uframes_t silence_size; /* Silence filling size */ 2177 snd_pcm_uframes_t silence_size; /* Silence filling size */
2178 snd_pcm_uframes_t boundary; /* pointers wrap point */ 2178 snd_pcm_uframes_t boundary; /* pointers wrap point */
2179 2179
2180 snd_pcm_uframes_t silenced_start; 2180 snd_pcm_uframes_t silenced_start;
2181 snd_pcm_uframes_t silenced_size; 2181 snd_pcm_uframes_t silenced_size;
2182 2182
2183 snd_pcm_sync_id_t sync; /* hardware synchronization ID */ 2183 snd_pcm_sync_id_t sync; /* hardware synchronization ID */
2184 2184
2185 /* -- mmap -- */ 2185 /* -- mmap -- */
2186 volatile struct snd_pcm_mmap_status *status; 2186 volatile struct snd_pcm_mmap_status *status;
2187 volatile struct snd_pcm_mmap_control *control; 2187 volatile struct snd_pcm_mmap_control *control;
2188 atomic_t mmap_count; 2188 atomic_t mmap_count;
2189 2189
2190 /* -- locking / scheduling -- */ 2190 /* -- locking / scheduling -- */
2191 spinlock_t lock; 2191 spinlock_t lock;
2192 wait_queue_head_t sleep; 2192 wait_queue_head_t sleep;
2193 struct timer_list tick_timer; 2193 struct timer_list tick_timer;
2194 struct fasync_struct *fasync; 2194 struct fasync_struct *fasync;
2195 2195
2196 /* -- private section -- */ 2196 /* -- private section -- */
2197 void *private_data; 2197 void *private_data;
2198 void (*private_free)(struct snd_pcm_runtime *runtime); 2198 void (*private_free)(struct snd_pcm_runtime *runtime);
2199 2199
2200 /* -- hardware description -- */ 2200 /* -- hardware description -- */
2201 struct snd_pcm_hardware hw; 2201 struct snd_pcm_hardware hw;
2202 struct snd_pcm_hw_constraints hw_constraints; 2202 struct snd_pcm_hw_constraints hw_constraints;
2203 2203
2204 /* -- interrupt callbacks -- */ 2204 /* -- interrupt callbacks -- */
2205 void (*transfer_ack_begin)(struct snd_pcm_substream *substream); 2205 void (*transfer_ack_begin)(struct snd_pcm_substream *substream);
2206 void (*transfer_ack_end)(struct snd_pcm_substream *substream); 2206 void (*transfer_ack_end)(struct snd_pcm_substream *substream);
2207 2207
2208 /* -- timer -- */ 2208 /* -- timer -- */
2209 unsigned int timer_resolution; /* timer resolution */ 2209 unsigned int timer_resolution; /* timer resolution */
2210 2210
2211 /* -- DMA -- */ 2211 /* -- DMA -- */
2212 unsigned char *dma_area; /* DMA area */ 2212 unsigned char *dma_area; /* DMA area */
2213 dma_addr_t dma_addr; /* physical bus address (not accessible from main CPU) */ 2213 dma_addr_t dma_addr; /* physical bus address (not accessible from main CPU) */
2214 size_t dma_bytes; /* size of DMA area */ 2214 size_t dma_bytes; /* size of DMA area */
2215 2215
2216 struct snd_dma_buffer *dma_buffer_p; /* allocated buffer */ 2216 struct snd_dma_buffer *dma_buffer_p; /* allocated buffer */
2217 2217
2218 #if defined(CONFIG_SND_PCM_OSS) || defined(CONFIG_SND_PCM_OSS_MODULE) 2218 #if defined(CONFIG_SND_PCM_OSS) || defined(CONFIG_SND_PCM_OSS_MODULE)
2219 /* -- OSS things -- */ 2219 /* -- OSS things -- */
2220 struct snd_pcm_oss_runtime oss; 2220 struct snd_pcm_oss_runtime oss;
2221 #endif 2221 #endif
2222 }; 2222 };
2223 ]]> 2223 ]]>
2224 </programlisting> 2224 </programlisting>
2225 </informalexample> 2225 </informalexample>
2226 </para> 2226 </para>
2227 2227
2228 <para> 2228 <para>
2229 For the operators (callbacks) of each sound driver, most of 2229 For the operators (callbacks) of each sound driver, most of
2230 these records are supposed to be read-only. Only the PCM 2230 these records are supposed to be read-only. Only the PCM
2231 middle-layer changes / updates these info. The exceptions are 2231 middle-layer changes / updates these info. The exceptions are
2232 the hardware description (hw), interrupt callbacks 2232 the hardware description (hw), interrupt callbacks
2233 (transfer_ack_xxx), DMA buffer information, and the private 2233 (transfer_ack_xxx), DMA buffer information, and the private
2234 data. Besides, if you use the standard buffer allocation 2234 data. Besides, if you use the standard buffer allocation
2235 method via <function>snd_pcm_lib_malloc_pages()</function>, 2235 method via <function>snd_pcm_lib_malloc_pages()</function>,
2236 you don't need to set the DMA buffer information by yourself. 2236 you don't need to set the DMA buffer information by yourself.
2237 </para> 2237 </para>
2238 2238
2239 <para> 2239 <para>
2240 In the sections below, important records are explained. 2240 In the sections below, important records are explained.
2241 </para> 2241 </para>
2242 2242
2243 <section id="pcm-interface-runtime-hw"> 2243 <section id="pcm-interface-runtime-hw">
2244 <title>Hardware Description</title> 2244 <title>Hardware Description</title>
2245 <para> 2245 <para>
2246 The hardware descriptor (struct <structname>snd_pcm_hardware</structname>) 2246 The hardware descriptor (struct <structname>snd_pcm_hardware</structname>)
2247 contains the definitions of the fundamental hardware 2247 contains the definitions of the fundamental hardware
2248 configuration. Above all, you'll need to define this in 2248 configuration. Above all, you'll need to define this in
2249 <link linkend="pcm-interface-operators-open-callback"><citetitle> 2249 <link linkend="pcm-interface-operators-open-callback"><citetitle>
2250 the open callback</citetitle></link>. 2250 the open callback</citetitle></link>.
2251 Note that the runtime instance holds the copy of the 2251 Note that the runtime instance holds the copy of the
2252 descriptor, not the pointer to the existing descriptor. That 2252 descriptor, not the pointer to the existing descriptor. That
2253 is, in the open callback, you can modify the copied descriptor 2253 is, in the open callback, you can modify the copied descriptor
2254 (<constant>runtime-&gt;hw</constant>) as you need. For example, if the maximum 2254 (<constant>runtime-&gt;hw</constant>) as you need. For example, if the maximum
2255 number of channels is 1 only on some chip models, you can 2255 number of channels is 1 only on some chip models, you can
2256 still use the same hardware descriptor and change the 2256 still use the same hardware descriptor and change the
2257 channels_max later: 2257 channels_max later:
2258 <informalexample> 2258 <informalexample>
2259 <programlisting> 2259 <programlisting>
2260 <![CDATA[ 2260 <![CDATA[
2261 struct snd_pcm_runtime *runtime = substream->runtime; 2261 struct snd_pcm_runtime *runtime = substream->runtime;
2262 ... 2262 ...
2263 runtime->hw = snd_mychip_playback_hw; /* common definition */ 2263 runtime->hw = snd_mychip_playback_hw; /* common definition */
2264 if (chip->model == VERY_OLD_ONE) 2264 if (chip->model == VERY_OLD_ONE)
2265 runtime->hw.channels_max = 1; 2265 runtime->hw.channels_max = 1;
2266 ]]> 2266 ]]>
2267 </programlisting> 2267 </programlisting>
2268 </informalexample> 2268 </informalexample>
2269 </para> 2269 </para>
2270 2270
2271 <para> 2271 <para>
2272 Typically, you'll have a hardware descriptor like below: 2272 Typically, you'll have a hardware descriptor like below:
2273 <informalexample> 2273 <informalexample>
2274 <programlisting> 2274 <programlisting>
2275 <![CDATA[ 2275 <![CDATA[
2276 static struct snd_pcm_hardware snd_mychip_playback_hw = { 2276 static struct snd_pcm_hardware snd_mychip_playback_hw = {
2277 .info = (SNDRV_PCM_INFO_MMAP | 2277 .info = (SNDRV_PCM_INFO_MMAP |
2278 SNDRV_PCM_INFO_INTERLEAVED | 2278 SNDRV_PCM_INFO_INTERLEAVED |
2279 SNDRV_PCM_INFO_BLOCK_TRANSFER | 2279 SNDRV_PCM_INFO_BLOCK_TRANSFER |
2280 SNDRV_PCM_INFO_MMAP_VALID), 2280 SNDRV_PCM_INFO_MMAP_VALID),
2281 .formats = SNDRV_PCM_FMTBIT_S16_LE, 2281 .formats = SNDRV_PCM_FMTBIT_S16_LE,
2282 .rates = SNDRV_PCM_RATE_8000_48000, 2282 .rates = SNDRV_PCM_RATE_8000_48000,
2283 .rate_min = 8000, 2283 .rate_min = 8000,
2284 .rate_max = 48000, 2284 .rate_max = 48000,
2285 .channels_min = 2, 2285 .channels_min = 2,
2286 .channels_max = 2, 2286 .channels_max = 2,
2287 .buffer_bytes_max = 32768, 2287 .buffer_bytes_max = 32768,
2288 .period_bytes_min = 4096, 2288 .period_bytes_min = 4096,
2289 .period_bytes_max = 32768, 2289 .period_bytes_max = 32768,
2290 .periods_min = 1, 2290 .periods_min = 1,
2291 .periods_max = 1024, 2291 .periods_max = 1024,
2292 }; 2292 };
2293 ]]> 2293 ]]>
2294 </programlisting> 2294 </programlisting>
2295 </informalexample> 2295 </informalexample>
2296 </para> 2296 </para>
2297 2297
2298 <para> 2298 <para>
2299 <itemizedlist> 2299 <itemizedlist>
2300 <listitem><para> 2300 <listitem><para>
2301 The <structfield>info</structfield> field contains the type and 2301 The <structfield>info</structfield> field contains the type and
2302 capabilities of this pcm. The bit flags are defined in 2302 capabilities of this pcm. The bit flags are defined in
2303 <filename>&lt;sound/asound.h&gt;</filename> as 2303 <filename>&lt;sound/asound.h&gt;</filename> as
2304 <constant>SNDRV_PCM_INFO_XXX</constant>. Here, at least, you 2304 <constant>SNDRV_PCM_INFO_XXX</constant>. Here, at least, you
2305 have to specify whether the mmap is supported and which 2305 have to specify whether the mmap is supported and which
2306 interleaved format is supported. 2306 interleaved format is supported.
2307 When the mmap is supported, add 2307 When the mmap is supported, add
2308 <constant>SNDRV_PCM_INFO_MMAP</constant> flag here. When the 2308 <constant>SNDRV_PCM_INFO_MMAP</constant> flag here. When the
2309 hardware supports the interleaved or the non-interleaved 2309 hardware supports the interleaved or the non-interleaved
2310 format, <constant>SNDRV_PCM_INFO_INTERLEAVED</constant> or 2310 format, <constant>SNDRV_PCM_INFO_INTERLEAVED</constant> or
2311 <constant>SNDRV_PCM_INFO_NONINTERLEAVED</constant> flag must 2311 <constant>SNDRV_PCM_INFO_NONINTERLEAVED</constant> flag must
2312 be set, respectively. If both are supported, you can set both, 2312 be set, respectively. If both are supported, you can set both,
2313 too. 2313 too.
2314 </para> 2314 </para>
2315 2315
2316 <para> 2316 <para>
2317 In the above example, <constant>MMAP_VALID</constant> and 2317 In the above example, <constant>MMAP_VALID</constant> and
2318 <constant>BLOCK_TRANSFER</constant> are specified for OSS mmap 2318 <constant>BLOCK_TRANSFER</constant> are specified for OSS mmap
2319 mode. Usually both are set. Of course, 2319 mode. Usually both are set. Of course,
2320 <constant>MMAP_VALID</constant> is set only if the mmap is 2320 <constant>MMAP_VALID</constant> is set only if the mmap is
2321 really supported. 2321 really supported.
2322 </para> 2322 </para>
2323 2323
2324 <para> 2324 <para>
2325 The other possible flags are 2325 The other possible flags are
2326 <constant>SNDRV_PCM_INFO_PAUSE</constant> and 2326 <constant>SNDRV_PCM_INFO_PAUSE</constant> and
2327 <constant>SNDRV_PCM_INFO_RESUME</constant>. The 2327 <constant>SNDRV_PCM_INFO_RESUME</constant>. The
2328 <constant>PAUSE</constant> bit means that the pcm supports the 2328 <constant>PAUSE</constant> bit means that the pcm supports the
2329 <quote>pause</quote> operation, while the 2329 <quote>pause</quote> operation, while the
2330 <constant>RESUME</constant> bit means that the pcm supports 2330 <constant>RESUME</constant> bit means that the pcm supports
2331 the full <quote>suspend/resume</quote> operation. 2331 the full <quote>suspend/resume</quote> operation.
2332 If <constant>PAUSE</constant> flag is set, 2332 If <constant>PAUSE</constant> flag is set,
2333 the <structfield>trigger</structfield> callback below 2333 the <structfield>trigger</structfield> callback below
2334 must handle the corresponding (pause push/release) commands. 2334 must handle the corresponding (pause push/release) commands.
2335 The suspend/resume trigger commands can be defined even without 2335 The suspend/resume trigger commands can be defined even without
2336 <constant>RESUME</constant> flag. See <link 2336 <constant>RESUME</constant> flag. See <link
2337 linkend="power-management"><citetitle> 2337 linkend="power-management"><citetitle>
2338 Power Management</citetitle></link> section for details. 2338 Power Management</citetitle></link> section for details.
2339 </para> 2339 </para>
2340 2340
2341 <para> 2341 <para>
2342 When the PCM substreams can be synchronized (typically, 2342 When the PCM substreams can be synchronized (typically,
2343 synchorinized start/stop of a playback and a capture streams), 2343 synchorinized start/stop of a playback and a capture streams),
2344 you can give <constant>SNDRV_PCM_INFO_SYNC_START</constant>, 2344 you can give <constant>SNDRV_PCM_INFO_SYNC_START</constant>,
2345 too. In this case, you'll need to check the linked-list of 2345 too. In this case, you'll need to check the linked-list of
2346 PCM substreams in the trigger callback. This will be 2346 PCM substreams in the trigger callback. This will be
2347 described in the later section. 2347 described in the later section.
2348 </para> 2348 </para>
2349 </listitem> 2349 </listitem>
2350 2350
2351 <listitem> 2351 <listitem>
2352 <para> 2352 <para>
2353 <structfield>formats</structfield> field contains the bit-flags 2353 <structfield>formats</structfield> field contains the bit-flags
2354 of supported formats (<constant>SNDRV_PCM_FMTBIT_XXX</constant>). 2354 of supported formats (<constant>SNDRV_PCM_FMTBIT_XXX</constant>).
2355 If the hardware supports more than one format, give all or'ed 2355 If the hardware supports more than one format, give all or'ed
2356 bits. In the example above, the signed 16bit little-endian 2356 bits. In the example above, the signed 16bit little-endian
2357 format is specified. 2357 format is specified.
2358 </para> 2358 </para>
2359 </listitem> 2359 </listitem>
2360 2360
2361 <listitem> 2361 <listitem>
2362 <para> 2362 <para>
2363 <structfield>rates</structfield> field contains the bit-flags of 2363 <structfield>rates</structfield> field contains the bit-flags of
2364 supported rates (<constant>SNDRV_PCM_RATE_XXX</constant>). 2364 supported rates (<constant>SNDRV_PCM_RATE_XXX</constant>).
2365 When the chip supports continuous rates, pass 2365 When the chip supports continuous rates, pass
2366 <constant>CONTINUOUS</constant> bit additionally. 2366 <constant>CONTINUOUS</constant> bit additionally.
2367 The pre-defined rate bits are provided only for typical 2367 The pre-defined rate bits are provided only for typical
2368 rates. If your chip supports unconventional rates, you need to add 2368 rates. If your chip supports unconventional rates, you need to add
2369 <constant>KNOT</constant> bit and set up the hardware 2369 <constant>KNOT</constant> bit and set up the hardware
2370 constraint manually (explained later). 2370 constraint manually (explained later).
2371 </para> 2371 </para>
2372 </listitem> 2372 </listitem>
2373 2373
2374 <listitem> 2374 <listitem>
2375 <para> 2375 <para>
2376 <structfield>rate_min</structfield> and 2376 <structfield>rate_min</structfield> and
2377 <structfield>rate_max</structfield> define the minimal and 2377 <structfield>rate_max</structfield> define the minimal and
2378 maximal sample rate. This should correspond somehow to 2378 maximal sample rate. This should correspond somehow to
2379 <structfield>rates</structfield> bits. 2379 <structfield>rates</structfield> bits.
2380 </para> 2380 </para>
2381 </listitem> 2381 </listitem>
2382 2382
2383 <listitem> 2383 <listitem>
2384 <para> 2384 <para>
2385 <structfield>channel_min</structfield> and 2385 <structfield>channel_min</structfield> and
2386 <structfield>channel_max</structfield> 2386 <structfield>channel_max</structfield>
2387 define, as you might already expected, the minimal and maximal 2387 define, as you might already expected, the minimal and maximal
2388 number of channels. 2388 number of channels.
2389 </para> 2389 </para>
2390 </listitem> 2390 </listitem>
2391 2391
2392 <listitem> 2392 <listitem>
2393 <para> 2393 <para>
2394 <structfield>buffer_bytes_max</structfield> defines the 2394 <structfield>buffer_bytes_max</structfield> defines the
2395 maximal buffer size in bytes. There is no 2395 maximal buffer size in bytes. There is no
2396 <structfield>buffer_bytes_min</structfield> field, since 2396 <structfield>buffer_bytes_min</structfield> field, since
2397 it can be calculated from the minimal period size and the 2397 it can be calculated from the minimal period size and the
2398 minimal number of periods. 2398 minimal number of periods.
2399 Meanwhile, <structfield>period_bytes_min</structfield> and 2399 Meanwhile, <structfield>period_bytes_min</structfield> and
2400 define the minimal and maximal size of the period in bytes. 2400 define the minimal and maximal size of the period in bytes.
2401 <structfield>periods_max</structfield> and 2401 <structfield>periods_max</structfield> and
2402 <structfield>periods_min</structfield> define the maximal and 2402 <structfield>periods_min</structfield> define the maximal and
2403 minimal number of periods in the buffer. 2403 minimal number of periods in the buffer.
2404 </para> 2404 </para>
2405 2405
2406 <para> 2406 <para>
2407 The <quote>period</quote> is a term, that corresponds to 2407 The <quote>period</quote> is a term, that corresponds to
2408 fragment in the OSS world. The period defines the size at 2408 fragment in the OSS world. The period defines the size at
2409 which the PCM interrupt is generated. This size strongly 2409 which the PCM interrupt is generated. This size strongly
2410 depends on the hardware. 2410 depends on the hardware.
2411 Generally, the smaller period size will give you more 2411 Generally, the smaller period size will give you more
2412 interrupts, that is, more controls. 2412 interrupts, that is, more controls.
2413 In the case of capture, this size defines the input latency. 2413 In the case of capture, this size defines the input latency.
2414 On the other hand, the whole buffer size defines the 2414 On the other hand, the whole buffer size defines the
2415 output latency for the playback direction. 2415 output latency for the playback direction.
2416 </para> 2416 </para>
2417 </listitem> 2417 </listitem>
2418 2418
2419 <listitem> 2419 <listitem>
2420 <para> 2420 <para>
2421 There is also a field <structfield>fifo_size</structfield>. 2421 There is also a field <structfield>fifo_size</structfield>.
2422 This specifies the size of the hardware FIFO, but it's not 2422 This specifies the size of the hardware FIFO, but it's not
2423 used currently in the driver nor in the alsa-lib. So, you 2423 used currently in the driver nor in the alsa-lib. So, you
2424 can ignore this field. 2424 can ignore this field.
2425 </para> 2425 </para>
2426 </listitem> 2426 </listitem>
2427 </itemizedlist> 2427 </itemizedlist>
2428 </para> 2428 </para>
2429 </section> 2429 </section>
2430 2430
2431 <section id="pcm-interface-runtime-config"> 2431 <section id="pcm-interface-runtime-config">
2432 <title>PCM Configurations</title> 2432 <title>PCM Configurations</title>
2433 <para> 2433 <para>
2434 Ok, let's go back again to the PCM runtime records. 2434 Ok, let's go back again to the PCM runtime records.
2435 The most frequently referred records in the runtime instance are 2435 The most frequently referred records in the runtime instance are
2436 the PCM configurations. 2436 the PCM configurations.
2437 The PCM configurations are stored on runtime instance 2437 The PCM configurations are stored on runtime instance
2438 after the application sends <type>hw_params</type> data via 2438 after the application sends <type>hw_params</type> data via
2439 alsa-lib. There are many fields copied from hw_params and 2439 alsa-lib. There are many fields copied from hw_params and
2440 sw_params structs. For example, 2440 sw_params structs. For example,
2441 <structfield>format</structfield> holds the format type 2441 <structfield>format</structfield> holds the format type
2442 chosen by the application. This field contains the enum value 2442 chosen by the application. This field contains the enum value
2443 <constant>SNDRV_PCM_FORMAT_XXX</constant>. 2443 <constant>SNDRV_PCM_FORMAT_XXX</constant>.
2444 </para> 2444 </para>
2445 2445
2446 <para> 2446 <para>
2447 One thing to be noted is that the configured buffer and period 2447 One thing to be noted is that the configured buffer and period
2448 sizes are stored in <quote>frames</quote> in the runtime 2448 sizes are stored in <quote>frames</quote> in the runtime
2449 In the ALSA world, 1 frame = channels * samples-size. 2449 In the ALSA world, 1 frame = channels * samples-size.
2450 For conversion between frames and bytes, you can use the 2450 For conversion between frames and bytes, you can use the
2451 helper functions, <function>frames_to_bytes()</function> and 2451 helper functions, <function>frames_to_bytes()</function> and
2452 <function>bytes_to_frames()</function>. 2452 <function>bytes_to_frames()</function>.
2453 <informalexample> 2453 <informalexample>
2454 <programlisting> 2454 <programlisting>
2455 <![CDATA[ 2455 <![CDATA[
2456 period_bytes = frames_to_bytes(runtime, runtime->period_size); 2456 period_bytes = frames_to_bytes(runtime, runtime->period_size);
2457 ]]> 2457 ]]>
2458 </programlisting> 2458 </programlisting>
2459 </informalexample> 2459 </informalexample>
2460 </para> 2460 </para>
2461 2461
2462 <para> 2462 <para>
2463 Also, many software parameters (sw_params) are 2463 Also, many software parameters (sw_params) are
2464 stored in frames, too. Please check the type of the field. 2464 stored in frames, too. Please check the type of the field.
2465 <type>snd_pcm_uframes_t</type> is for the frames as unsigned 2465 <type>snd_pcm_uframes_t</type> is for the frames as unsigned
2466 integer while <type>snd_pcm_sframes_t</type> is for the frames 2466 integer while <type>snd_pcm_sframes_t</type> is for the frames
2467 as signed integer. 2467 as signed integer.
2468 </para> 2468 </para>
2469 </section> 2469 </section>
2470 2470
2471 <section id="pcm-interface-runtime-dma"> 2471 <section id="pcm-interface-runtime-dma">
2472 <title>DMA Buffer Information</title> 2472 <title>DMA Buffer Information</title>
2473 <para> 2473 <para>
2474 The DMA buffer is defined by the following four fields, 2474 The DMA buffer is defined by the following four fields,
2475 <structfield>dma_area</structfield>, 2475 <structfield>dma_area</structfield>,
2476 <structfield>dma_addr</structfield>, 2476 <structfield>dma_addr</structfield>,
2477 <structfield>dma_bytes</structfield> and 2477 <structfield>dma_bytes</structfield> and
2478 <structfield>dma_private</structfield>. 2478 <structfield>dma_private</structfield>.
2479 The <structfield>dma_area</structfield> holds the buffer 2479 The <structfield>dma_area</structfield> holds the buffer
2480 pointer (the logical address). You can call 2480 pointer (the logical address). You can call
2481 <function>memcpy</function> from/to 2481 <function>memcpy</function> from/to
2482 this pointer. Meanwhile, <structfield>dma_addr</structfield> 2482 this pointer. Meanwhile, <structfield>dma_addr</structfield>
2483 holds the physical address of the buffer. This field is 2483 holds the physical address of the buffer. This field is
2484 specified only when the buffer is a linear buffer. 2484 specified only when the buffer is a linear buffer.
2485 <structfield>dma_bytes</structfield> holds the size of buffer 2485 <structfield>dma_bytes</structfield> holds the size of buffer
2486 in bytes. <structfield>dma_private</structfield> is used for 2486 in bytes. <structfield>dma_private</structfield> is used for
2487 the ALSA DMA allocator. 2487 the ALSA DMA allocator.
2488 </para> 2488 </para>
2489 2489
2490 <para> 2490 <para>
2491 If you use a standard ALSA function, 2491 If you use a standard ALSA function,
2492 <function>snd_pcm_lib_malloc_pages()</function>, for 2492 <function>snd_pcm_lib_malloc_pages()</function>, for
2493 allocating the buffer, these fields are set by the ALSA middle 2493 allocating the buffer, these fields are set by the ALSA middle
2494 layer, and you should <emphasis>not</emphasis> change them by 2494 layer, and you should <emphasis>not</emphasis> change them by
2495 yourself. You can read them but not write them. 2495 yourself. You can read them but not write them.
2496 On the other hand, if you want to allocate the buffer by 2496 On the other hand, if you want to allocate the buffer by
2497 yourself, you'll need to manage it in hw_params callback. 2497 yourself, you'll need to manage it in hw_params callback.
2498 At least, <structfield>dma_bytes</structfield> is mandatory. 2498 At least, <structfield>dma_bytes</structfield> is mandatory.
2499 <structfield>dma_area</structfield> is necessary when the 2499 <structfield>dma_area</structfield> is necessary when the
2500 buffer is mmapped. If your driver doesn't support mmap, this 2500 buffer is mmapped. If your driver doesn't support mmap, this
2501 field is not necessary. <structfield>dma_addr</structfield> 2501 field is not necessary. <structfield>dma_addr</structfield>
2502 is also not mandatory. You can use 2502 is also not mandatory. You can use
2503 <structfield>dma_private</structfield> as you like, too. 2503 <structfield>dma_private</structfield> as you like, too.
2504 </para> 2504 </para>
2505 </section> 2505 </section>
2506 2506
2507 <section id="pcm-interface-runtime-status"> 2507 <section id="pcm-interface-runtime-status">
2508 <title>Running Status</title> 2508 <title>Running Status</title>
2509 <para> 2509 <para>
2510 The running status can be referred via <constant>runtime-&gt;status</constant>. 2510 The running status can be referred via <constant>runtime-&gt;status</constant>.
2511 This is the pointer to struct <structname>snd_pcm_mmap_status</structname> 2511 This is the pointer to struct <structname>snd_pcm_mmap_status</structname>
2512 record. For example, you can get the current DMA hardware 2512 record. For example, you can get the current DMA hardware
2513 pointer via <constant>runtime-&gt;status-&gt;hw_ptr</constant>. 2513 pointer via <constant>runtime-&gt;status-&gt;hw_ptr</constant>.
2514 </para> 2514 </para>
2515 2515
2516 <para> 2516 <para>
2517 The DMA application pointer can be referred via 2517 The DMA application pointer can be referred via
2518 <constant>runtime-&gt;control</constant>, which points 2518 <constant>runtime-&gt;control</constant>, which points
2519 struct <structname>snd_pcm_mmap_control</structname> record. 2519 struct <structname>snd_pcm_mmap_control</structname> record.
2520 However, accessing directly to this value is not recommended. 2520 However, accessing directly to this value is not recommended.
2521 </para> 2521 </para>
2522 </section> 2522 </section>
2523 2523
2524 <section id="pcm-interface-runtime-private"> 2524 <section id="pcm-interface-runtime-private">
2525 <title>Private Data</title> 2525 <title>Private Data</title>
2526 <para> 2526 <para>
2527 You can allocate a record for the substream and store it in 2527 You can allocate a record for the substream and store it in
2528 <constant>runtime-&gt;private_data</constant>. Usually, this 2528 <constant>runtime-&gt;private_data</constant>. Usually, this
2529 done in 2529 done in
2530 <link linkend="pcm-interface-operators-open-callback"><citetitle> 2530 <link linkend="pcm-interface-operators-open-callback"><citetitle>
2531 the open callback</citetitle></link>. 2531 the open callback</citetitle></link>.
2532 Don't mix this with <constant>pcm-&gt;private_data</constant>. 2532 Don't mix this with <constant>pcm-&gt;private_data</constant>.
2533 The <constant>pcm-&gt;private_data</constant> usually points the 2533 The <constant>pcm-&gt;private_data</constant> usually points the
2534 chip instance assigned statically at the creation of PCM, while the 2534 chip instance assigned statically at the creation of PCM, while the
2535 <constant>runtime-&gt;private_data</constant> points a dynamic 2535 <constant>runtime-&gt;private_data</constant> points a dynamic
2536 data created at the PCM open callback. 2536 data created at the PCM open callback.
2537 2537
2538 <informalexample> 2538 <informalexample>
2539 <programlisting> 2539 <programlisting>
2540 <![CDATA[ 2540 <![CDATA[
2541 static int snd_xxx_open(struct snd_pcm_substream *substream) 2541 static int snd_xxx_open(struct snd_pcm_substream *substream)
2542 { 2542 {
2543 struct my_pcm_data *data; 2543 struct my_pcm_data *data;
2544 .... 2544 ....
2545 data = kmalloc(sizeof(*data), GFP_KERNEL); 2545 data = kmalloc(sizeof(*data), GFP_KERNEL);
2546 substream->runtime->private_data = data; 2546 substream->runtime->private_data = data;
2547 .... 2547 ....
2548 } 2548 }
2549 ]]> 2549 ]]>
2550 </programlisting> 2550 </programlisting>
2551 </informalexample> 2551 </informalexample>
2552 </para> 2552 </para>
2553 2553
2554 <para> 2554 <para>
2555 The allocated object must be released in 2555 The allocated object must be released in
2556 <link linkend="pcm-interface-operators-open-callback"><citetitle> 2556 <link linkend="pcm-interface-operators-open-callback"><citetitle>
2557 the close callback</citetitle></link>. 2557 the close callback</citetitle></link>.
2558 </para> 2558 </para>
2559 </section> 2559 </section>
2560 2560
2561 <section id="pcm-interface-runtime-intr"> 2561 <section id="pcm-interface-runtime-intr">
2562 <title>Interrupt Callbacks</title> 2562 <title>Interrupt Callbacks</title>
2563 <para> 2563 <para>
2564 The field <structfield>transfer_ack_begin</structfield> and 2564 The field <structfield>transfer_ack_begin</structfield> and
2565 <structfield>transfer_ack_end</structfield> are called at 2565 <structfield>transfer_ack_end</structfield> are called at
2566 the beginning and the end of 2566 the beginning and the end of
2567 <function>snd_pcm_period_elapsed()</function>, respectively. 2567 <function>snd_pcm_period_elapsed()</function>, respectively.
2568 </para> 2568 </para>
2569 </section> 2569 </section>
2570 2570
2571 </section> 2571 </section>
2572 2572
2573 <section id="pcm-interface-operators"> 2573 <section id="pcm-interface-operators">
2574 <title>Operators</title> 2574 <title>Operators</title>
2575 <para> 2575 <para>
2576 OK, now let me explain the detail of each pcm callback 2576 OK, now let me explain the detail of each pcm callback
2577 (<parameter>ops</parameter>). In general, every callback must 2577 (<parameter>ops</parameter>). In general, every callback must
2578 return 0 if successful, or a negative number with the error 2578 return 0 if successful, or a negative number with the error
2579 number such as <constant>-EINVAL</constant> at any 2579 number such as <constant>-EINVAL</constant> at any
2580 error. 2580 error.
2581 </para> 2581 </para>
2582 2582
2583 <para> 2583 <para>
2584 The callback function takes at least the argument with 2584 The callback function takes at least the argument with
2585 <structname>snd_pcm_substream</structname> pointer. For retrieving the 2585 <structname>snd_pcm_substream</structname> pointer. For retrieving the
2586 chip record from the given substream instance, you can use the 2586 chip record from the given substream instance, you can use the
2587 following macro. 2587 following macro.
2588 2588
2589 <informalexample> 2589 <informalexample>
2590 <programlisting> 2590 <programlisting>
2591 <![CDATA[ 2591 <![CDATA[
2592 int xxx() { 2592 int xxx() {
2593 struct mychip *chip = snd_pcm_substream_chip(substream); 2593 struct mychip *chip = snd_pcm_substream_chip(substream);
2594 .... 2594 ....
2595 } 2595 }
2596 ]]> 2596 ]]>
2597 </programlisting> 2597 </programlisting>
2598 </informalexample> 2598 </informalexample>
2599 2599
2600 The macro reads <constant>substream-&gt;private_data</constant>, 2600 The macro reads <constant>substream-&gt;private_data</constant>,
2601 which is a copy of <constant>pcm-&gt;private_data</constant>. 2601 which is a copy of <constant>pcm-&gt;private_data</constant>.
2602 You can override the former if you need to assign different data 2602 You can override the former if you need to assign different data
2603 records per PCM substream. For example, cmi8330 driver assigns 2603 records per PCM substream. For example, cmi8330 driver assigns
2604 different private_data for playback and capture directions, 2604 different private_data for playback and capture directions,
2605 because it uses two different codecs (SB- and AD-compatible) for 2605 because it uses two different codecs (SB- and AD-compatible) for
2606 different directions. 2606 different directions.
2607 </para> 2607 </para>
2608 2608
2609 <section id="pcm-interface-operators-open-callback"> 2609 <section id="pcm-interface-operators-open-callback">
2610 <title>open callback</title> 2610 <title>open callback</title>
2611 <para> 2611 <para>
2612 <informalexample> 2612 <informalexample>
2613 <programlisting> 2613 <programlisting>
2614 <![CDATA[ 2614 <![CDATA[
2615 static int snd_xxx_open(struct snd_pcm_substream *substream); 2615 static int snd_xxx_open(struct snd_pcm_substream *substream);
2616 ]]> 2616 ]]>
2617 </programlisting> 2617 </programlisting>
2618 </informalexample> 2618 </informalexample>
2619 2619
2620 This is called when a pcm substream is opened. 2620 This is called when a pcm substream is opened.
2621 </para> 2621 </para>
2622 2622
2623 <para> 2623 <para>
2624 At least, here you have to initialize the runtime-&gt;hw 2624 At least, here you have to initialize the runtime-&gt;hw
2625 record. Typically, this is done by like this: 2625 record. Typically, this is done by like this:
2626 2626
2627 <informalexample> 2627 <informalexample>
2628 <programlisting> 2628 <programlisting>
2629 <![CDATA[ 2629 <![CDATA[
2630 static int snd_xxx_open(struct snd_pcm_substream *substream) 2630 static int snd_xxx_open(struct snd_pcm_substream *substream)
2631 { 2631 {
2632 struct mychip *chip = snd_pcm_substream_chip(substream); 2632 struct mychip *chip = snd_pcm_substream_chip(substream);
2633 struct snd_pcm_runtime *runtime = substream->runtime; 2633 struct snd_pcm_runtime *runtime = substream->runtime;
2634 2634
2635 runtime->hw = snd_mychip_playback_hw; 2635 runtime->hw = snd_mychip_playback_hw;
2636 return 0; 2636 return 0;
2637 } 2637 }
2638 ]]> 2638 ]]>
2639 </programlisting> 2639 </programlisting>
2640 </informalexample> 2640 </informalexample>
2641 2641
2642 where <parameter>snd_mychip_playback_hw</parameter> is the 2642 where <parameter>snd_mychip_playback_hw</parameter> is the
2643 pre-defined hardware description. 2643 pre-defined hardware description.
2644 </para> 2644 </para>
2645 2645
2646 <para> 2646 <para>
2647 You can allocate a private data in this callback, as described 2647 You can allocate a private data in this callback, as described
2648 in <link linkend="pcm-interface-runtime-private"><citetitle> 2648 in <link linkend="pcm-interface-runtime-private"><citetitle>
2649 Private Data</citetitle></link> section. 2649 Private Data</citetitle></link> section.
2650 </para> 2650 </para>
2651 2651
2652 <para> 2652 <para>
2653 If the hardware configuration needs more constraints, set the 2653 If the hardware configuration needs more constraints, set the
2654 hardware constraints here, too. 2654 hardware constraints here, too.
2655 See <link linkend="pcm-interface-constraints"><citetitle> 2655 See <link linkend="pcm-interface-constraints"><citetitle>
2656 Constraints</citetitle></link> for more details. 2656 Constraints</citetitle></link> for more details.
2657 </para> 2657 </para>
2658 </section> 2658 </section>
2659 2659
2660 <section id="pcm-interface-operators-close-callback"> 2660 <section id="pcm-interface-operators-close-callback">
2661 <title>close callback</title> 2661 <title>close callback</title>
2662 <para> 2662 <para>
2663 <informalexample> 2663 <informalexample>
2664 <programlisting> 2664 <programlisting>
2665 <![CDATA[ 2665 <![CDATA[
2666 static int snd_xxx_close(struct snd_pcm_substream *substream); 2666 static int snd_xxx_close(struct snd_pcm_substream *substream);
2667 ]]> 2667 ]]>
2668 </programlisting> 2668 </programlisting>
2669 </informalexample> 2669 </informalexample>
2670 2670
2671 Obviously, this is called when a pcm substream is closed. 2671 Obviously, this is called when a pcm substream is closed.
2672 </para> 2672 </para>
2673 2673
2674 <para> 2674 <para>
2675 Any private instance for a pcm substream allocated in the 2675 Any private instance for a pcm substream allocated in the
2676 open callback will be released here. 2676 open callback will be released here.
2677 2677
2678 <informalexample> 2678 <informalexample>
2679 <programlisting> 2679 <programlisting>
2680 <![CDATA[ 2680 <![CDATA[
2681 static int snd_xxx_close(struct snd_pcm_substream *substream) 2681 static int snd_xxx_close(struct snd_pcm_substream *substream)
2682 { 2682 {
2683 .... 2683 ....
2684 kfree(substream->runtime->private_data); 2684 kfree(substream->runtime->private_data);
2685 .... 2685 ....
2686 } 2686 }
2687 ]]> 2687 ]]>
2688 </programlisting> 2688 </programlisting>
2689 </informalexample> 2689 </informalexample>
2690 </para> 2690 </para>
2691 </section> 2691 </section>
2692 2692
2693 <section id="pcm-interface-operators-ioctl-callback"> 2693 <section id="pcm-interface-operators-ioctl-callback">
2694 <title>ioctl callback</title> 2694 <title>ioctl callback</title>
2695 <para> 2695 <para>
2696 This is used for any special action to pcm ioctls. But 2696 This is used for any special action to pcm ioctls. But
2697 usually you can pass a generic ioctl callback, 2697 usually you can pass a generic ioctl callback,
2698 <function>snd_pcm_lib_ioctl</function>. 2698 <function>snd_pcm_lib_ioctl</function>.
2699 </para> 2699 </para>
2700 </section> 2700 </section>
2701 2701
2702 <section id="pcm-interface-operators-hw-params-callback"> 2702 <section id="pcm-interface-operators-hw-params-callback">
2703 <title>hw_params callback</title> 2703 <title>hw_params callback</title>
2704 <para> 2704 <para>
2705 <informalexample> 2705 <informalexample>
2706 <programlisting> 2706 <programlisting>
2707 <![CDATA[ 2707 <![CDATA[
2708 static int snd_xxx_hw_params(struct snd_pcm_substream *substream, 2708 static int snd_xxx_hw_params(struct snd_pcm_substream *substream,
2709 struct snd_pcm_hw_params *hw_params); 2709 struct snd_pcm_hw_params *hw_params);
2710 ]]> 2710 ]]>
2711 </programlisting> 2711 </programlisting>
2712 </informalexample> 2712 </informalexample>
2713 2713
2714 This and <structfield>hw_free</structfield> callbacks exist 2714 This and <structfield>hw_free</structfield> callbacks exist
2715 only on ALSA 0.9.x. 2715 only on ALSA 0.9.x.
2716 </para> 2716 </para>
2717 2717
2718 <para> 2718 <para>
2719 This is called when the hardware parameter 2719 This is called when the hardware parameter
2720 (<structfield>hw_params</structfield>) is set 2720 (<structfield>hw_params</structfield>) is set
2721 up by the application, 2721 up by the application,
2722 that is, once when the buffer size, the period size, the 2722 that is, once when the buffer size, the period size, the
2723 format, etc. are defined for the pcm substream. 2723 format, etc. are defined for the pcm substream.
2724 </para> 2724 </para>
2725 2725
2726 <para> 2726 <para>
2727 Many hardware set-up should be done in this callback, 2727 Many hardware set-up should be done in this callback,
2728 including the allocation of buffers. 2728 including the allocation of buffers.
2729 </para> 2729 </para>
2730 2730
2731 <para> 2731 <para>
2732 Parameters to be initialized are retrieved by 2732 Parameters to be initialized are retrieved by
2733 <function>params_xxx()</function> macros. For allocating a 2733 <function>params_xxx()</function> macros. For allocating a
2734 buffer, you can call a helper function, 2734 buffer, you can call a helper function,
2735 2735
2736 <informalexample> 2736 <informalexample>
2737 <programlisting> 2737 <programlisting>
2738 <![CDATA[ 2738 <![CDATA[
2739 snd_pcm_lib_malloc_pages(substream, params_buffer_bytes(hw_params)); 2739 snd_pcm_lib_malloc_pages(substream, params_buffer_bytes(hw_params));
2740 ]]> 2740 ]]>
2741 </programlisting> 2741 </programlisting>
2742 </informalexample> 2742 </informalexample>
2743 2743
2744 <function>snd_pcm_lib_malloc_pages()</function> is available 2744 <function>snd_pcm_lib_malloc_pages()</function> is available
2745 only when the DMA buffers have been pre-allocated. 2745 only when the DMA buffers have been pre-allocated.
2746 See the section <link 2746 See the section <link
2747 linkend="buffer-and-memory-buffer-types"><citetitle> 2747 linkend="buffer-and-memory-buffer-types"><citetitle>
2748 Buffer Types</citetitle></link> for more details. 2748 Buffer Types</citetitle></link> for more details.
2749 </para> 2749 </para>
2750 2750
2751 <para> 2751 <para>
2752 Note that this and <structfield>prepare</structfield> callbacks 2752 Note that this and <structfield>prepare</structfield> callbacks
2753 may be called multiple times per initialization. 2753 may be called multiple times per initialization.
2754 For example, the OSS emulation may 2754 For example, the OSS emulation may
2755 call these callbacks at each change via its ioctl. 2755 call these callbacks at each change via its ioctl.
2756 </para> 2756 </para>
2757 2757
2758 <para> 2758 <para>
2759 Thus, you need to take care not to allocate the same buffers 2759 Thus, you need to take care not to allocate the same buffers
2760 many times, which will lead to memory leak! Calling the 2760 many times, which will lead to memory leak! Calling the
2761 helper function above many times is OK. It will release the 2761 helper function above many times is OK. It will release the
2762 previous buffer automatically when it was already allocated. 2762 previous buffer automatically when it was already allocated.
2763 </para> 2763 </para>
2764 2764
2765 <para> 2765 <para>
2766 Another note is that this callback is non-atomic 2766 Another note is that this callback is non-atomic
2767 (schedulable). This is important, because the 2767 (schedulable). This is important, because the
2768 <structfield>trigger</structfield> callback 2768 <structfield>trigger</structfield> callback
2769 is atomic (non-schedulable). That is, mutex or any 2769 is atomic (non-schedulable). That is, mutex or any
2770 schedule-related functions are not available in 2770 schedule-related functions are not available in
2771 <structfield>trigger</structfield> callback. 2771 <structfield>trigger</structfield> callback.
2772 Please see the subsection 2772 Please see the subsection
2773 <link linkend="pcm-interface-atomicity"><citetitle> 2773 <link linkend="pcm-interface-atomicity"><citetitle>
2774 Atomicity</citetitle></link> for details. 2774 Atomicity</citetitle></link> for details.
2775 </para> 2775 </para>
2776 </section> 2776 </section>
2777 2777
2778 <section id="pcm-interface-operators-hw-free-callback"> 2778 <section id="pcm-interface-operators-hw-free-callback">
2779 <title>hw_free callback</title> 2779 <title>hw_free callback</title>
2780 <para> 2780 <para>
2781 <informalexample> 2781 <informalexample>
2782 <programlisting> 2782 <programlisting>
2783 <![CDATA[ 2783 <![CDATA[
2784 static int snd_xxx_hw_free(struct snd_pcm_substream *substream); 2784 static int snd_xxx_hw_free(struct snd_pcm_substream *substream);
2785 ]]> 2785 ]]>
2786 </programlisting> 2786 </programlisting>
2787 </informalexample> 2787 </informalexample>
2788 </para> 2788 </para>
2789 2789
2790 <para> 2790 <para>
2791 This is called to release the resources allocated via 2791 This is called to release the resources allocated via
2792 <structfield>hw_params</structfield>. For example, releasing the 2792 <structfield>hw_params</structfield>. For example, releasing the
2793 buffer via 2793 buffer via
2794 <function>snd_pcm_lib_malloc_pages()</function> is done by 2794 <function>snd_pcm_lib_malloc_pages()</function> is done by
2795 calling the following: 2795 calling the following:
2796 2796
2797 <informalexample> 2797 <informalexample>
2798 <programlisting> 2798 <programlisting>
2799 <![CDATA[ 2799 <![CDATA[
2800 snd_pcm_lib_free_pages(substream); 2800 snd_pcm_lib_free_pages(substream);
2801 ]]> 2801 ]]>
2802 </programlisting> 2802 </programlisting>
2803 </informalexample> 2803 </informalexample>
2804 </para> 2804 </para>
2805 2805
2806 <para> 2806 <para>
2807 This function is always called before the close callback is called. 2807 This function is always called before the close callback is called.
2808 Also, the callback may be called multiple times, too. 2808 Also, the callback may be called multiple times, too.
2809 Keep track whether the resource was already released. 2809 Keep track whether the resource was already released.
2810 </para> 2810 </para>
2811 </section> 2811 </section>
2812 2812
2813 <section id="pcm-interface-operators-prepare-callback"> 2813 <section id="pcm-interface-operators-prepare-callback">
2814 <title>prepare callback</title> 2814 <title>prepare callback</title>
2815 <para> 2815 <para>
2816 <informalexample> 2816 <informalexample>
2817 <programlisting> 2817 <programlisting>
2818 <![CDATA[ 2818 <![CDATA[
2819 static int snd_xxx_prepare(struct snd_pcm_substream *substream); 2819 static int snd_xxx_prepare(struct snd_pcm_substream *substream);
2820 ]]> 2820 ]]>
2821 </programlisting> 2821 </programlisting>
2822 </informalexample> 2822 </informalexample>
2823 </para> 2823 </para>
2824 2824
2825 <para> 2825 <para>
2826 This callback is called when the pcm is 2826 This callback is called when the pcm is
2827 <quote>prepared</quote>. You can set the format type, sample 2827 <quote>prepared</quote>. You can set the format type, sample
2828 rate, etc. here. The difference from 2828 rate, etc. here. The difference from
2829 <structfield>hw_params</structfield> is that the 2829 <structfield>hw_params</structfield> is that the
2830 <structfield>prepare</structfield> callback will be called at each 2830 <structfield>prepare</structfield> callback will be called at each
2831 time 2831 time
2832 <function>snd_pcm_prepare()</function> is called, i.e. when 2832 <function>snd_pcm_prepare()</function> is called, i.e. when
2833 recovered after underruns, etc. 2833 recovered after underruns, etc.
2834 </para> 2834 </para>
2835 2835
2836 <para> 2836 <para>
2837 Note that this callback became non-atomic since the recent version. 2837 Note that this callback became non-atomic since the recent version.
2838 You can use schedule-related functions safely in this callback now. 2838 You can use schedule-related functions safely in this callback now.
2839 </para> 2839 </para>
2840 2840
2841 <para> 2841 <para>
2842 In this and the following callbacks, you can refer to the 2842 In this and the following callbacks, you can refer to the
2843 values via the runtime record, 2843 values via the runtime record,
2844 substream-&gt;runtime. 2844 substream-&gt;runtime.
2845 For example, to get the current 2845 For example, to get the current
2846 rate, format or channels, access to 2846 rate, format or channels, access to
2847 runtime-&gt;rate, 2847 runtime-&gt;rate,
2848 runtime-&gt;format or 2848 runtime-&gt;format or
2849 runtime-&gt;channels, respectively. 2849 runtime-&gt;channels, respectively.
2850 The physical address of the allocated buffer is set to 2850 The physical address of the allocated buffer is set to
2851 runtime-&gt;dma_area. The buffer and period sizes are 2851 runtime-&gt;dma_area. The buffer and period sizes are
2852 in runtime-&gt;buffer_size and runtime-&gt;period_size, 2852 in runtime-&gt;buffer_size and runtime-&gt;period_size,
2853 respectively. 2853 respectively.
2854 </para> 2854 </para>
2855 2855
2856 <para> 2856 <para>
2857 Be careful that this callback will be called many times at 2857 Be careful that this callback will be called many times at
2858 each set up, too. 2858 each set up, too.
2859 </para> 2859 </para>
2860 </section> 2860 </section>
2861 2861
2862 <section id="pcm-interface-operators-trigger-callback"> 2862 <section id="pcm-interface-operators-trigger-callback">
2863 <title>trigger callback</title> 2863 <title>trigger callback</title>
2864 <para> 2864 <para>
2865 <informalexample> 2865 <informalexample>
2866 <programlisting> 2866 <programlisting>
2867 <![CDATA[ 2867 <![CDATA[
2868 static int snd_xxx_trigger(struct snd_pcm_substream *substream, int cmd); 2868 static int snd_xxx_trigger(struct snd_pcm_substream *substream, int cmd);
2869 ]]> 2869 ]]>
2870 </programlisting> 2870 </programlisting>
2871 </informalexample> 2871 </informalexample>
2872 2872
2873 This is called when the pcm is started, stopped or paused. 2873 This is called when the pcm is started, stopped or paused.
2874 </para> 2874 </para>
2875 2875
2876 <para> 2876 <para>
2877 Which action is specified in the second argument, 2877 Which action is specified in the second argument,
2878 <constant>SNDRV_PCM_TRIGGER_XXX</constant> in 2878 <constant>SNDRV_PCM_TRIGGER_XXX</constant> in
2879 <filename>&lt;sound/pcm.h&gt;</filename>. At least, 2879 <filename>&lt;sound/pcm.h&gt;</filename>. At least,
2880 <constant>START</constant> and <constant>STOP</constant> 2880 <constant>START</constant> and <constant>STOP</constant>
2881 commands must be defined in this callback. 2881 commands must be defined in this callback.
2882 2882
2883 <informalexample> 2883 <informalexample>
2884 <programlisting> 2884 <programlisting>
2885 <![CDATA[ 2885 <![CDATA[
2886 switch (cmd) { 2886 switch (cmd) {
2887 case SNDRV_PCM_TRIGGER_START: 2887 case SNDRV_PCM_TRIGGER_START:
2888 // do something to start the PCM engine 2888 // do something to start the PCM engine
2889 break; 2889 break;
2890 case SNDRV_PCM_TRIGGER_STOP: 2890 case SNDRV_PCM_TRIGGER_STOP:
2891 // do something to stop the PCM engine 2891 // do something to stop the PCM engine
2892 break; 2892 break;
2893 default: 2893 default:
2894 return -EINVAL; 2894 return -EINVAL;
2895 } 2895 }
2896 ]]> 2896 ]]>
2897 </programlisting> 2897 </programlisting>
2898 </informalexample> 2898 </informalexample>
2899 </para> 2899 </para>
2900 2900
2901 <para> 2901 <para>
2902 When the pcm supports the pause operation (given in info 2902 When the pcm supports the pause operation (given in info
2903 field of the hardware table), <constant>PAUSE_PUSE</constant> 2903 field of the hardware table), <constant>PAUSE_PUSE</constant>
2904 and <constant>PAUSE_RELEASE</constant> commands must be 2904 and <constant>PAUSE_RELEASE</constant> commands must be
2905 handled here, too. The former is the command to pause the pcm, 2905 handled here, too. The former is the command to pause the pcm,
2906 and the latter to restart the pcm again. 2906 and the latter to restart the pcm again.
2907 </para> 2907 </para>
2908 2908
2909 <para> 2909 <para>
2910 When the pcm supports the suspend/resume operation, 2910 When the pcm supports the suspend/resume operation,
2911 regardless of full or partial suspend/resume support, 2911 regardless of full or partial suspend/resume support,
2912 <constant>SUSPEND</constant> and <constant>RESUME</constant> 2912 <constant>SUSPEND</constant> and <constant>RESUME</constant>
2913 commands must be handled, too. 2913 commands must be handled, too.
2914 These commands are issued when the power-management status is 2914 These commands are issued when the power-management status is
2915 changed. Obviously, the <constant>SUSPEND</constant> and 2915 changed. Obviously, the <constant>SUSPEND</constant> and
2916 <constant>RESUME</constant> 2916 <constant>RESUME</constant>
2917 do suspend and resume of the pcm substream, and usually, they 2917 do suspend and resume of the pcm substream, and usually, they
2918 are identical with <constant>STOP</constant> and 2918 are identical with <constant>STOP</constant> and
2919 <constant>START</constant> commands, respectively. 2919 <constant>START</constant> commands, respectively.
2920 See <link linkend="power-management"><citetitle> 2920 See <link linkend="power-management"><citetitle>
2921 Power Management</citetitle></link> section for details. 2921 Power Management</citetitle></link> section for details.
2922 </para> 2922 </para>
2923 2923
2924 <para> 2924 <para>
2925 As mentioned, this callback is atomic. You cannot call 2925 As mentioned, this callback is atomic. You cannot call
2926 the function going to sleep. 2926 the function going to sleep.
2927 The trigger callback should be as minimal as possible, 2927 The trigger callback should be as minimal as possible,
2928 just really triggering the DMA. The other stuff should be 2928 just really triggering the DMA. The other stuff should be
2929 initialized hw_params and prepare callbacks properly 2929 initialized hw_params and prepare callbacks properly
2930 beforehand. 2930 beforehand.
2931 </para> 2931 </para>
2932 </section> 2932 </section>
2933 2933
2934 <section id="pcm-interface-operators-pointer-callback"> 2934 <section id="pcm-interface-operators-pointer-callback">
2935 <title>pointer callback</title> 2935 <title>pointer callback</title>
2936 <para> 2936 <para>
2937 <informalexample> 2937 <informalexample>
2938 <programlisting> 2938 <programlisting>
2939 <![CDATA[ 2939 <![CDATA[
2940 static snd_pcm_uframes_t snd_xxx_pointer(struct snd_pcm_substream *substream) 2940 static snd_pcm_uframes_t snd_xxx_pointer(struct snd_pcm_substream *substream)
2941 ]]> 2941 ]]>
2942 </programlisting> 2942 </programlisting>
2943 </informalexample> 2943 </informalexample>
2944 2944
2945 This callback is called when the PCM middle layer inquires 2945 This callback is called when the PCM middle layer inquires
2946 the current hardware position on the buffer. The position must 2946 the current hardware position on the buffer. The position must
2947 be returned in frames (which was in bytes on ALSA 0.5.x), 2947 be returned in frames (which was in bytes on ALSA 0.5.x),
2948 ranged from 0 to buffer_size - 1. 2948 ranged from 0 to buffer_size - 1.
2949 </para> 2949 </para>
2950 2950
2951 <para> 2951 <para>
2952 This is called usually from the buffer-update routine in the 2952 This is called usually from the buffer-update routine in the
2953 pcm middle layer, which is invoked when 2953 pcm middle layer, which is invoked when
2954 <function>snd_pcm_period_elapsed()</function> is called in the 2954 <function>snd_pcm_period_elapsed()</function> is called in the
2955 interrupt routine. Then the pcm middle layer updates the 2955 interrupt routine. Then the pcm middle layer updates the
2956 position and calculates the available space, and wakes up the 2956 position and calculates the available space, and wakes up the
2957 sleeping poll threads, etc. 2957 sleeping poll threads, etc.
2958 </para> 2958 </para>
2959 2959
2960 <para> 2960 <para>
2961 This callback is also atomic. 2961 This callback is also atomic.
2962 </para> 2962 </para>
2963 </section> 2963 </section>
2964 2964
2965 <section id="pcm-interface-operators-copy-silence"> 2965 <section id="pcm-interface-operators-copy-silence">
2966 <title>copy and silence callbacks</title> 2966 <title>copy and silence callbacks</title>
2967 <para> 2967 <para>
2968 These callbacks are not mandatory, and can be omitted in 2968 These callbacks are not mandatory, and can be omitted in
2969 most cases. These callbacks are used when the hardware buffer 2969 most cases. These callbacks are used when the hardware buffer
2970 cannot be on the normal memory space. Some chips have their 2970 cannot be on the normal memory space. Some chips have their
2971 own buffer on the hardware which is not mappable. In such a 2971 own buffer on the hardware which is not mappable. In such a
2972 case, you have to transfer the data manually from the memory 2972 case, you have to transfer the data manually from the memory
2973 buffer to the hardware buffer. Or, if the buffer is 2973 buffer to the hardware buffer. Or, if the buffer is
2974 non-contiguous on both physical and virtual memory spaces, 2974 non-contiguous on both physical and virtual memory spaces,
2975 these callbacks must be defined, too. 2975 these callbacks must be defined, too.
2976 </para> 2976 </para>
2977 2977
2978 <para> 2978 <para>
2979 If these two callbacks are defined, copy and set-silence 2979 If these two callbacks are defined, copy and set-silence
2980 operations are done by them. The detailed will be described in 2980 operations are done by them. The detailed will be described in
2981 the later section <link 2981 the later section <link
2982 linkend="buffer-and-memory"><citetitle>Buffer and Memory 2982 linkend="buffer-and-memory"><citetitle>Buffer and Memory
2983 Management</citetitle></link>. 2983 Management</citetitle></link>.
2984 </para> 2984 </para>
2985 </section> 2985 </section>
2986 2986
2987 <section id="pcm-interface-operators-ack"> 2987 <section id="pcm-interface-operators-ack">
2988 <title>ack callback</title> 2988 <title>ack callback</title>
2989 <para> 2989 <para>
2990 This callback is also not mandatory. This callback is called 2990 This callback is also not mandatory. This callback is called
2991 when the appl_ptr is updated in read or write operations. 2991 when the appl_ptr is updated in read or write operations.
2992 Some drivers like emu10k1-fx and cs46xx need to track the 2992 Some drivers like emu10k1-fx and cs46xx need to track the
2993 current appl_ptr for the internal buffer, and this callback 2993 current appl_ptr for the internal buffer, and this callback
2994 is useful only for such a purpose. 2994 is useful only for such a purpose.
2995 </para> 2995 </para>
2996 <para> 2996 <para>
2997 This callback is atomic. 2997 This callback is atomic.
2998 </para> 2998 </para>
2999 </section> 2999 </section>
3000 3000
3001 <section id="pcm-interface-operators-page-callback"> 3001 <section id="pcm-interface-operators-page-callback">
3002 <title>page callback</title> 3002 <title>page callback</title>
3003 3003
3004 <para> 3004 <para>
3005 This callback is also not mandatory. This callback is used 3005 This callback is also not mandatory. This callback is used
3006 mainly for the non-contiguous buffer. The mmap calls this 3006 mainly for the non-contiguous buffer. The mmap calls this
3007 callback to get the page address. Some examples will be 3007 callback to get the page address. Some examples will be
3008 explained in the later section <link 3008 explained in the later section <link
3009 linkend="buffer-and-memory"><citetitle>Buffer and Memory 3009 linkend="buffer-and-memory"><citetitle>Buffer and Memory
3010 Management</citetitle></link>, too. 3010 Management</citetitle></link>, too.
3011 </para> 3011 </para>
3012 </section> 3012 </section>
3013 </section> 3013 </section>
3014 3014
3015 <section id="pcm-interface-interrupt-handler"> 3015 <section id="pcm-interface-interrupt-handler">
3016 <title>Interrupt Handler</title> 3016 <title>Interrupt Handler</title>
3017 <para> 3017 <para>
3018 The rest of pcm stuff is the PCM interrupt handler. The 3018 The rest of pcm stuff is the PCM interrupt handler. The
3019 role of PCM interrupt handler in the sound driver is to update 3019 role of PCM interrupt handler in the sound driver is to update
3020 the buffer position and to tell the PCM middle layer when the 3020 the buffer position and to tell the PCM middle layer when the
3021 buffer position goes across the prescribed period size. To 3021 buffer position goes across the prescribed period size. To
3022 inform this, call <function>snd_pcm_period_elapsed()</function> 3022 inform this, call <function>snd_pcm_period_elapsed()</function>
3023 function. 3023 function.
3024 </para> 3024 </para>
3025 3025
3026 <para> 3026 <para>
3027 There are several types of sound chips to generate the interrupts. 3027 There are several types of sound chips to generate the interrupts.
3028 </para> 3028 </para>
3029 3029
3030 <section id="pcm-interface-interrupt-handler-boundary"> 3030 <section id="pcm-interface-interrupt-handler-boundary">
3031 <title>Interrupts at the period (fragment) boundary</title> 3031 <title>Interrupts at the period (fragment) boundary</title>
3032 <para> 3032 <para>
3033 This is the most frequently found type: the hardware 3033 This is the most frequently found type: the hardware
3034 generates an interrupt at each period boundary. 3034 generates an interrupt at each period boundary.
3035 In this case, you can call 3035 In this case, you can call
3036 <function>snd_pcm_period_elapsed()</function> at each 3036 <function>snd_pcm_period_elapsed()</function> at each
3037 interrupt. 3037 interrupt.
3038 </para> 3038 </para>
3039 3039
3040 <para> 3040 <para>
3041 <function>snd_pcm_period_elapsed()</function> takes the 3041 <function>snd_pcm_period_elapsed()</function> takes the
3042 substream pointer as its argument. Thus, you need to keep the 3042 substream pointer as its argument. Thus, you need to keep the
3043 substream pointer accessible from the chip instance. For 3043 substream pointer accessible from the chip instance. For
3044 example, define substream field in the chip record to hold the 3044 example, define substream field in the chip record to hold the
3045 current running substream pointer, and set the pointer value 3045 current running substream pointer, and set the pointer value
3046 at open callback (and reset at close callback). 3046 at open callback (and reset at close callback).
3047 </para> 3047 </para>
3048 3048
3049 <para> 3049 <para>
3050 If you acquire a spinlock in the interrupt handler, and the 3050 If you acquire a spinlock in the interrupt handler, and the
3051 lock is used in other pcm callbacks, too, then you have to 3051 lock is used in other pcm callbacks, too, then you have to
3052 release the lock before calling 3052 release the lock before calling
3053 <function>snd_pcm_period_elapsed()</function>, because 3053 <function>snd_pcm_period_elapsed()</function>, because
3054 <function>snd_pcm_period_elapsed()</function> calls other pcm 3054 <function>snd_pcm_period_elapsed()</function> calls other pcm
3055 callbacks inside. 3055 callbacks inside.
3056 </para> 3056 </para>
3057 3057
3058 <para> 3058 <para>
3059 A typical coding would be like: 3059 A typical coding would be like:
3060 3060
3061 <example> 3061 <example>
3062 <title>Interrupt Handler Case #1</title> 3062 <title>Interrupt Handler Case #1</title>
3063 <programlisting> 3063 <programlisting>
3064 <![CDATA[ 3064 <![CDATA[
3065 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id, 3065 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id,
3066 struct pt_regs *regs) 3066 struct pt_regs *regs)
3067 { 3067 {
3068 struct mychip *chip = dev_id; 3068 struct mychip *chip = dev_id;
3069 spin_lock(&chip->lock); 3069 spin_lock(&chip->lock);
3070 .... 3070 ....
3071 if (pcm_irq_invoked(chip)) { 3071 if (pcm_irq_invoked(chip)) {
3072 /* call updater, unlock before it */ 3072 /* call updater, unlock before it */
3073 spin_unlock(&chip->lock); 3073 spin_unlock(&chip->lock);
3074 snd_pcm_period_elapsed(chip->substream); 3074 snd_pcm_period_elapsed(chip->substream);
3075 spin_lock(&chip->lock); 3075 spin_lock(&chip->lock);
3076 // acknowledge the interrupt if necessary 3076 // acknowledge the interrupt if necessary
3077 } 3077 }
3078 .... 3078 ....
3079 spin_unlock(&chip->lock); 3079 spin_unlock(&chip->lock);
3080 return IRQ_HANDLED; 3080 return IRQ_HANDLED;
3081 } 3081 }
3082 ]]> 3082 ]]>
3083 </programlisting> 3083 </programlisting>
3084 </example> 3084 </example>
3085 </para> 3085 </para>
3086 </section> 3086 </section>
3087 3087
3088 <section id="pcm-interface-interrupt-handler-timer"> 3088 <section id="pcm-interface-interrupt-handler-timer">
3089 <title>High-frequent timer interrupts</title> 3089 <title>High-frequent timer interrupts</title>
3090 <para> 3090 <para>
3091 This is the case when the hardware doesn't generate interrupts 3091 This is the case when the hardware doesn't generate interrupts
3092 at the period boundary but do timer-interrupts at the fixed 3092 at the period boundary but do timer-interrupts at the fixed
3093 timer rate (e.g. es1968 or ymfpci drivers). 3093 timer rate (e.g. es1968 or ymfpci drivers).
3094 In this case, you need to check the current hardware 3094 In this case, you need to check the current hardware
3095 position and accumulates the processed sample length at each 3095 position and accumulates the processed sample length at each
3096 interrupt. When the accumulated size overcomes the period 3096 interrupt. When the accumulated size overcomes the period
3097 size, call 3097 size, call
3098 <function>snd_pcm_period_elapsed()</function> and reset the 3098 <function>snd_pcm_period_elapsed()</function> and reset the
3099 accumulator. 3099 accumulator.
3100 </para> 3100 </para>
3101 3101
3102 <para> 3102 <para>
3103 A typical coding would be like the following. 3103 A typical coding would be like the following.
3104 3104
3105 <example> 3105 <example>
3106 <title>Interrupt Handler Case #2</title> 3106 <title>Interrupt Handler Case #2</title>
3107 <programlisting> 3107 <programlisting>
3108 <![CDATA[ 3108 <![CDATA[
3109 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id, 3109 static irqreturn_t snd_mychip_interrupt(int irq, void *dev_id,
3110 struct pt_regs *regs) 3110 struct pt_regs *regs)
3111 { 3111 {
3112 struct mychip *chip = dev_id; 3112 struct mychip *chip = dev_id;
3113 spin_lock(&chip->lock); 3113 spin_lock(&chip->lock);
3114 .... 3114 ....
3115 if (pcm_irq_invoked(chip)) { 3115 if (pcm_irq_invoked(chip)) {
3116 unsigned int last_ptr, size; 3116 unsigned int last_ptr, size;
3117 /* get the current hardware pointer (in frames) */ 3117 /* get the current hardware pointer (in frames) */
3118 last_ptr = get_hw_ptr(chip); 3118 last_ptr = get_hw_ptr(chip);
3119 /* calculate the processed frames since the 3119 /* calculate the processed frames since the
3120 * last update 3120 * last update
3121 */ 3121 */
3122 if (last_ptr < chip->last_ptr) 3122 if (last_ptr < chip->last_ptr)
3123 size = runtime->buffer_size + last_ptr 3123 size = runtime->buffer_size + last_ptr
3124 - chip->last_ptr; 3124 - chip->last_ptr;
3125 else 3125 else
3126 size = last_ptr - chip->last_ptr; 3126 size = last_ptr - chip->last_ptr;
3127 /* remember the last updated point */ 3127 /* remember the last updated point */
3128 chip->last_ptr = last_ptr; 3128 chip->last_ptr = last_ptr;
3129 /* accumulate the size */ 3129 /* accumulate the size */
3130 chip->size += size; 3130 chip->size += size;
3131 /* over the period boundary? */ 3131 /* over the period boundary? */
3132 if (chip->size >= runtime->period_size) { 3132 if (chip->size >= runtime->period_size) {
3133 /* reset the accumulator */ 3133 /* reset the accumulator */
3134 chip->size %= runtime->period_size; 3134 chip->size %= runtime->period_size;
3135 /* call updater */ 3135 /* call updater */
3136 spin_unlock(&chip->lock); 3136 spin_unlock(&chip->lock);
3137 snd_pcm_period_elapsed(substream); 3137 snd_pcm_period_elapsed(substream);
3138 spin_lock(&chip->lock); 3138 spin_lock(&chip->lock);
3139 } 3139 }
3140 // acknowledge the interrupt if necessary 3140 // acknowledge the interrupt if necessary
3141 } 3141 }
3142 .... 3142 ....
3143 spin_unlock(&chip->lock); 3143 spin_unlock(&chip->lock);
3144 return IRQ_HANDLED; 3144 return IRQ_HANDLED;
3145 } 3145 }
3146 ]]> 3146 ]]>
3147 </programlisting> 3147 </programlisting>
3148 </example> 3148 </example>
3149 </para> 3149 </para>
3150 </section> 3150 </section>
3151 3151
3152 <section id="pcm-interface-interrupt-handler-both"> 3152 <section id="pcm-interface-interrupt-handler-both">
3153 <title>On calling <function>snd_pcm_period_elapsed()</function></title> 3153 <title>On calling <function>snd_pcm_period_elapsed()</function></title>
3154 <para> 3154 <para>
3155 In both cases, even if more than one period are elapsed, you 3155 In both cases, even if more than one period are elapsed, you
3156 don't have to call 3156 don't have to call
3157 <function>snd_pcm_period_elapsed()</function> many times. Call 3157 <function>snd_pcm_period_elapsed()</function> many times. Call
3158 only once. And the pcm layer will check the current hardware 3158 only once. And the pcm layer will check the current hardware
3159 pointer and update to the latest status. 3159 pointer and update to the latest status.
3160 </para> 3160 </para>
3161 </section> 3161 </section>
3162 </section> 3162 </section>
3163 3163
3164 <section id="pcm-interface-atomicity"> 3164 <section id="pcm-interface-atomicity">
3165 <title>Atomicity</title> 3165 <title>Atomicity</title>
3166 <para> 3166 <para>
3167 One of the most important (and thus difficult to debug) problem 3167 One of the most important (and thus difficult to debug) problem
3168 on the kernel programming is the race condition. 3168 on the kernel programming is the race condition.
3169 On linux kernel, usually it's solved via spin-locks or 3169 On linux kernel, usually it's solved via spin-locks or
3170 semaphores. In general, if the race condition may 3170 semaphores. In general, if the race condition may
3171 happen in the interrupt handler, it's handled as atomic, and you 3171 happen in the interrupt handler, it's handled as atomic, and you
3172 have to use spinlock for protecting the critical session. If it 3172 have to use spinlock for protecting the critical session. If it
3173 never happens in the interrupt and it may take relatively long 3173 never happens in the interrupt and it may take relatively long
3174 time, you should use semaphore. 3174 time, you should use semaphore.
3175 </para> 3175 </para>
3176 3176
3177 <para> 3177 <para>
3178 As already seen, some pcm callbacks are atomic and some are 3178 As already seen, some pcm callbacks are atomic and some are
3179 not. For example, <parameter>hw_params</parameter> callback is 3179 not. For example, <parameter>hw_params</parameter> callback is
3180 non-atomic, while <parameter>trigger</parameter> callback is 3180 non-atomic, while <parameter>trigger</parameter> callback is
3181 atomic. This means, the latter is called already in a spinlock 3181 atomic. This means, the latter is called already in a spinlock
3182 held by the PCM middle layer. Please take this atomicity into 3182 held by the PCM middle layer. Please take this atomicity into
3183 account when you use a spinlock or a semaphore in the callbacks. 3183 account when you use a spinlock or a semaphore in the callbacks.
3184 </para> 3184 </para>
3185 3185
3186 <para> 3186 <para>
3187 In the atomic callbacks, you cannot use functions which may call 3187 In the atomic callbacks, you cannot use functions which may call
3188 <function>schedule</function> or go to 3188 <function>schedule</function> or go to
3189 <function>sleep</function>. The semaphore and mutex do sleep, 3189 <function>sleep</function>. The semaphore and mutex do sleep,
3190 and hence they cannot be used inside the atomic callbacks 3190 and hence they cannot be used inside the atomic callbacks
3191 (e.g. <parameter>trigger</parameter> callback). 3191 (e.g. <parameter>trigger</parameter> callback).
3192 For taking a certain delay in such a callback, please use 3192 For taking a certain delay in such a callback, please use
3193 <function>udelay()</function> or <function>mdelay()</function>. 3193 <function>udelay()</function> or <function>mdelay()</function>.
3194 </para> 3194 </para>
3195 3195
3196 <para> 3196 <para>
3197 All three atomic callbacks (trigger, pointer, and ack) are 3197 All three atomic callbacks (trigger, pointer, and ack) are
3198 called with local interrupts disabled. 3198 called with local interrupts disabled.
3199 </para> 3199 </para>
3200 3200
3201 </section> 3201 </section>
3202 <section id="pcm-interface-constraints"> 3202 <section id="pcm-interface-constraints">
3203 <title>Constraints</title> 3203 <title>Constraints</title>
3204 <para> 3204 <para>
3205 If your chip supports unconventional sample rates, or only the 3205 If your chip supports unconventional sample rates, or only the
3206 limited samples, you need to set a constraint for the 3206 limited samples, you need to set a constraint for the
3207 condition. 3207 condition.
3208 </para> 3208 </para>
3209 3209
3210 <para> 3210 <para>
3211 For example, in order to restrict the sample rates in the some 3211 For example, in order to restrict the sample rates in the some
3212 supported values, use 3212 supported values, use
3213 <function>snd_pcm_hw_constraint_list()</function>. 3213 <function>snd_pcm_hw_constraint_list()</function>.
3214 You need to call this function in the open callback. 3214 You need to call this function in the open callback.
3215 3215
3216 <example> 3216 <example>
3217 <title>Example of Hardware Constraints</title> 3217 <title>Example of Hardware Constraints</title>
3218 <programlisting> 3218 <programlisting>
3219 <![CDATA[ 3219 <![CDATA[
3220 static unsigned int rates[] = 3220 static unsigned int rates[] =
3221 {4000, 10000, 22050, 44100}; 3221 {4000, 10000, 22050, 44100};
3222 static struct snd_pcm_hw_constraint_list constraints_rates = { 3222 static struct snd_pcm_hw_constraint_list constraints_rates = {
3223 .count = ARRAY_SIZE(rates), 3223 .count = ARRAY_SIZE(rates),
3224 .list = rates, 3224 .list = rates,
3225 .mask = 0, 3225 .mask = 0,
3226 }; 3226 };
3227 3227
3228 static int snd_mychip_pcm_open(struct snd_pcm_substream *substream) 3228 static int snd_mychip_pcm_open(struct snd_pcm_substream *substream)
3229 { 3229 {
3230 int err; 3230 int err;
3231 .... 3231 ....
3232 err = snd_pcm_hw_constraint_list(substream->runtime, 0, 3232 err = snd_pcm_hw_constraint_list(substream->runtime, 0,
3233 SNDRV_PCM_HW_PARAM_RATE, 3233 SNDRV_PCM_HW_PARAM_RATE,
3234 &constraints_rates); 3234 &constraints_rates);
3235 if (err < 0) 3235 if (err < 0)
3236 return err; 3236 return err;
3237 .... 3237 ....
3238 } 3238 }
3239 ]]> 3239 ]]>
3240 </programlisting> 3240 </programlisting>
3241 </example> 3241 </example>
3242 </para> 3242 </para>
3243 3243
3244 <para> 3244 <para>
3245 There are many different constraints. 3245 There are many different constraints.
3246 Look in <filename>sound/pcm.h</filename> for a complete list. 3246 Look in <filename>sound/pcm.h</filename> for a complete list.
3247 You can even define your own constraint rules. 3247 You can even define your own constraint rules.
3248 For example, let's suppose my_chip can manage a substream of 1 channel 3248 For example, let's suppose my_chip can manage a substream of 1 channel
3249 if and only if the format is S16_LE, otherwise it supports any format 3249 if and only if the format is S16_LE, otherwise it supports any format
3250 specified in the <structname>snd_pcm_hardware</structname> stucture (or in any 3250 specified in the <structname>snd_pcm_hardware</structname> stucture (or in any
3251 other constraint_list). You can build a rule like this: 3251 other constraint_list). You can build a rule like this:
3252 3252
3253 <example> 3253 <example>
3254 <title>Example of Hardware Constraints for Channels</title> 3254 <title>Example of Hardware Constraints for Channels</title>
3255 <programlisting> 3255 <programlisting>
3256 <![CDATA[ 3256 <![CDATA[
3257 static int hw_rule_format_by_channels(struct snd_pcm_hw_params *params, 3257 static int hw_rule_format_by_channels(struct snd_pcm_hw_params *params,
3258 struct snd_pcm_hw_rule *rule) 3258 struct snd_pcm_hw_rule *rule)
3259 { 3259 {
3260 struct snd_interval *c = hw_param_interval(params, 3260 struct snd_interval *c = hw_param_interval(params,
3261 SNDRV_PCM_HW_PARAM_CHANNELS); 3261 SNDRV_PCM_HW_PARAM_CHANNELS);
3262 struct snd_mask *f = hw_param_mask(params, SNDRV_PCM_HW_PARAM_FORMAT); 3262 struct snd_mask *f = hw_param_mask(params, SNDRV_PCM_HW_PARAM_FORMAT);
3263 struct snd_mask fmt; 3263 struct snd_mask fmt;
3264 3264
3265 snd_mask_any(&fmt); /* Init the struct */ 3265 snd_mask_any(&fmt); /* Init the struct */
3266 if (c->min < 2) { 3266 if (c->min < 2) {
3267 fmt.bits[0] &= SNDRV_PCM_FMTBIT_S16_LE; 3267 fmt.bits[0] &= SNDRV_PCM_FMTBIT_S16_LE;
3268 return snd_mask_refine(f, &fmt); 3268 return snd_mask_refine(f, &fmt);
3269 } 3269 }
3270 return 0; 3270 return 0;
3271 } 3271 }
3272 ]]> 3272 ]]>
3273 </programlisting> 3273 </programlisting>
3274 </example> 3274 </example>
3275 </para> 3275 </para>
3276 3276
3277 <para> 3277 <para>
3278 Then you need to call this function to add your rule: 3278 Then you need to call this function to add your rule:
3279 3279
3280 <informalexample> 3280 <informalexample>
3281 <programlisting> 3281 <programlisting>
3282 <![CDATA[ 3282 <![CDATA[
3283 snd_pcm_hw_rule_add(substream->runtime, 0, SNDRV_PCM_HW_PARAM_CHANNELS, 3283 snd_pcm_hw_rule_add(substream->runtime, 0, SNDRV_PCM_HW_PARAM_CHANNELS,
3284 hw_rule_channels_by_format, 0, SNDRV_PCM_HW_PARAM_FORMAT, 3284 hw_rule_channels_by_format, 0, SNDRV_PCM_HW_PARAM_FORMAT,
3285 -1); 3285 -1);
3286 ]]> 3286 ]]>
3287 </programlisting> 3287 </programlisting>
3288 </informalexample> 3288 </informalexample>
3289 </para> 3289 </para>
3290 3290
3291 <para> 3291 <para>
3292 The rule function is called when an application sets the number of 3292 The rule function is called when an application sets the number of
3293 channels. But an application can set the format before the number of 3293 channels. But an application can set the format before the number of
3294 channels. Thus you also need to define the inverse rule: 3294 channels. Thus you also need to define the inverse rule:
3295 3295
3296 <example> 3296 <example>
3297 <title>Example of Hardware Constraints for Channels</title> 3297 <title>Example of Hardware Constraints for Channels</title>
3298 <programlisting> 3298 <programlisting>
3299 <![CDATA[ 3299 <![CDATA[
3300 static int hw_rule_channels_by_format(struct snd_pcm_hw_params *params, 3300 static int hw_rule_channels_by_format(struct snd_pcm_hw_params *params,
3301 struct snd_pcm_hw_rule *rule) 3301 struct snd_pcm_hw_rule *rule)
3302 { 3302 {
3303 struct snd_interval *c = hw_param_interval(params, 3303 struct snd_interval *c = hw_param_interval(params,
3304 SNDRV_PCM_HW_PARAM_CHANNELS); 3304 SNDRV_PCM_HW_PARAM_CHANNELS);
3305 struct snd_mask *f = hw_param_mask(params, SNDRV_PCM_HW_PARAM_FORMAT); 3305 struct snd_mask *f = hw_param_mask(params, SNDRV_PCM_HW_PARAM_FORMAT);
3306 struct snd_interval ch; 3306 struct snd_interval ch;
3307 3307
3308 snd_interval_any(&ch); 3308 snd_interval_any(&ch);
3309 if (f->bits[0] == SNDRV_PCM_FMTBIT_S16_LE) { 3309 if (f->bits[0] == SNDRV_PCM_FMTBIT_S16_LE) {
3310 ch.min = ch.max = 1; 3310 ch.min = ch.max = 1;
3311 ch.integer = 1; 3311 ch.integer = 1;
3312 return snd_interval_refine(c, &ch); 3312 return snd_interval_refine(c, &ch);
3313 } 3313 }
3314 return 0; 3314 return 0;
3315 } 3315 }
3316 ]]> 3316 ]]>
3317 </programlisting> 3317 </programlisting>
3318 </example> 3318 </example>
3319 </para> 3319 </para>
3320 3320
3321 <para> 3321 <para>
3322 ...and in the open callback: 3322 ...and in the open callback:
3323 <informalexample> 3323 <informalexample>
3324 <programlisting> 3324 <programlisting>
3325 <![CDATA[ 3325 <![CDATA[
3326 snd_pcm_hw_rule_add(substream->runtime, 0, SNDRV_PCM_HW_PARAM_FORMAT, 3326 snd_pcm_hw_rule_add(substream->runtime, 0, SNDRV_PCM_HW_PARAM_FORMAT,
3327 hw_rule_format_by_channels, 0, SNDRV_PCM_HW_PARAM_CHANNELS, 3327 hw_rule_format_by_channels, 0, SNDRV_PCM_HW_PARAM_CHANNELS,
3328 -1); 3328 -1);
3329 ]]> 3329 ]]>
3330 </programlisting> 3330 </programlisting>
3331 </informalexample> 3331 </informalexample>
3332 </para> 3332 </para>
3333 3333
3334 <para> 3334 <para>
3335 I won't explain more details here, rather I 3335 I won't explain more details here, rather I
3336 would like to say, <quote>Luke, use the source.</quote> 3336 would like to say, <quote>Luke, use the source.</quote>
3337 </para> 3337 </para>
3338 </section> 3338 </section>
3339 3339
3340 </chapter> 3340 </chapter>
3341 3341
3342 3342
3343 <!-- ****************************************************** --> 3343 <!-- ****************************************************** -->
3344 <!-- Control Interface --> 3344 <!-- Control Interface -->
3345 <!-- ****************************************************** --> 3345 <!-- ****************************************************** -->
3346 <chapter id="control-interface"> 3346 <chapter id="control-interface">
3347 <title>Control Interface</title> 3347 <title>Control Interface</title>
3348 3348
3349 <section id="control-interface-general"> 3349 <section id="control-interface-general">
3350 <title>General</title> 3350 <title>General</title>
3351 <para> 3351 <para>
3352 The control interface is used widely for many switches, 3352 The control interface is used widely for many switches,
3353 sliders, etc. which are accessed from the user-space. Its most 3353 sliders, etc. which are accessed from the user-space. Its most
3354 important use is the mixer interface. In other words, on ALSA 3354 important use is the mixer interface. In other words, on ALSA
3355 0.9.x, all the mixer stuff is implemented on the control kernel 3355 0.9.x, all the mixer stuff is implemented on the control kernel
3356 API (while there was an independent mixer kernel API on 0.5.x). 3356 API (while there was an independent mixer kernel API on 0.5.x).
3357 </para> 3357 </para>
3358 3358
3359 <para> 3359 <para>
3360 ALSA has a well-defined AC97 control module. If your chip 3360 ALSA has a well-defined AC97 control module. If your chip
3361 supports only the AC97 and nothing else, you can skip this 3361 supports only the AC97 and nothing else, you can skip this
3362 section. 3362 section.
3363 </para> 3363 </para>
3364 3364
3365 <para> 3365 <para>
3366 The control API is defined in 3366 The control API is defined in
3367 <filename>&lt;sound/control.h&gt;</filename>. 3367 <filename>&lt;sound/control.h&gt;</filename>.
3368 Include this file if you add your own controls. 3368 Include this file if you add your own controls.
3369 </para> 3369 </para>
3370 </section> 3370 </section>
3371 3371
3372 <section id="control-interface-definition"> 3372 <section id="control-interface-definition">
3373 <title>Definition of Controls</title> 3373 <title>Definition of Controls</title>
3374 <para> 3374 <para>
3375 For creating a new control, you need to define the three 3375 For creating a new control, you need to define the three
3376 callbacks: <structfield>info</structfield>, 3376 callbacks: <structfield>info</structfield>,
3377 <structfield>get</structfield> and 3377 <structfield>get</structfield> and
3378 <structfield>put</structfield>. Then, define a 3378 <structfield>put</structfield>. Then, define a
3379 struct <structname>snd_kcontrol_new</structname> record, such as: 3379 struct <structname>snd_kcontrol_new</structname> record, such as:
3380 3380
3381 <example> 3381 <example>
3382 <title>Definition of a Control</title> 3382 <title>Definition of a Control</title>
3383 <programlisting> 3383 <programlisting>
3384 <![CDATA[ 3384 <![CDATA[
3385 static struct snd_kcontrol_new my_control __devinitdata = { 3385 static struct snd_kcontrol_new my_control __devinitdata = {
3386 .iface = SNDRV_CTL_ELEM_IFACE_MIXER, 3386 .iface = SNDRV_CTL_ELEM_IFACE_MIXER,
3387 .name = "PCM Playback Switch", 3387 .name = "PCM Playback Switch",
3388 .index = 0, 3388 .index = 0,
3389 .access = SNDRV_CTL_ELEM_ACCESS_READWRITE, 3389 .access = SNDRV_CTL_ELEM_ACCESS_READWRITE,
3390 .private_value = 0xffff, 3390 .private_value = 0xffff,
3391 .info = my_control_info, 3391 .info = my_control_info,
3392 .get = my_control_get, 3392 .get = my_control_get,
3393 .put = my_control_put 3393 .put = my_control_put
3394 }; 3394 };
3395 ]]> 3395 ]]>
3396 </programlisting> 3396 </programlisting>
3397 </example> 3397 </example>
3398 </para> 3398 </para>
3399 3399
3400 <para> 3400 <para>
3401 Most likely the control is created via 3401 Most likely the control is created via
3402 <function>snd_ctl_new1()</function>, and in such a case, you can 3402 <function>snd_ctl_new1()</function>, and in such a case, you can
3403 add <parameter>__devinitdata</parameter> prefix to the 3403 add <parameter>__devinitdata</parameter> prefix to the
3404 definition like above. 3404 definition like above.
3405 </para> 3405 </para>
3406 3406
3407 <para> 3407 <para>
3408 The <structfield>iface</structfield> field specifies the type of 3408 The <structfield>iface</structfield> field specifies the type of
3409 the control, <constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>, which 3409 the control, <constant>SNDRV_CTL_ELEM_IFACE_XXX</constant>, which
3410 is usually <constant>MIXER</constant>. 3410 is usually <constant>MIXER</constant>.
3411 Use <constant>CARD</constant> for global controls that are not 3411 Use <constant>CARD</constant> for global controls that are not
3412 logically part of the mixer. 3412 logically part of the mixer.
3413 If the control is closely associated with some specific device on 3413 If the control is closely associated with some specific device on
3414 the sound card, use <constant>HWDEP</constant>, 3414 the sound card, use <constant>HWDEP</constant>,
3415 <constant>PCM</constant>, <constant>RAWMIDI</constant>, 3415 <constant>PCM</constant>, <constant>RAWMIDI</constant>,
3416 <constant>TIMER</constant>, or <constant>SEQUENCER</constant>, and 3416 <constant>TIMER</constant>, or <constant>SEQUENCER</constant>, and
3417 specify the device number with the 3417 specify the device number with the
3418 <structfield>device</structfield> and 3418 <structfield>device</structfield> and
3419 <structfield>subdevice</structfield> fields. 3419 <structfield>subdevice</structfield> fields.
3420 </para> 3420 </para>
3421 3421
3422 <para> 3422 <para>
3423 The <structfield>name</structfield> is the name identifier 3423 The <structfield>name</structfield> is the name identifier
3424 string. On ALSA 0.9.x, the control name is very important, 3424 string. On ALSA 0.9.x, the control name is very important,
3425 because its role is classified from its name. There are 3425 because its role is classified from its name. There are
3426 pre-defined standard control names. The details are described in 3426 pre-defined standard control names. The details are described in
3427 the subsection 3427 the subsection
3428 <link linkend="control-interface-control-names"><citetitle> 3428 <link linkend="control-interface-control-names"><citetitle>
3429 Control Names</citetitle></link>. 3429 Control Names</citetitle></link>.
3430 </para> 3430 </para>
3431 3431
3432 <para> 3432 <para>
3433 The <structfield>index</structfield> field holds the index number 3433 The <structfield>index</structfield> field holds the index number
3434 of this control. If there are several different controls with 3434 of this control. If there are several different controls with
3435 the same name, they can be distinguished by the index 3435 the same name, they can be distinguished by the index
3436 number. This is the case when 3436 number. This is the case when
3437 several codecs exist on the card. If the index is zero, you can 3437 several codecs exist on the card. If the index is zero, you can
3438 omit the definition above. 3438 omit the definition above.
3439 </para> 3439 </para>
3440 3440
3441 <para> 3441 <para>
3442 The <structfield>access</structfield> field contains the access 3442 The <structfield>access</structfield> field contains the access
3443 type of this control. Give the combination of bit masks, 3443 type of this control. Give the combination of bit masks,
3444 <constant>SNDRV_CTL_ELEM_ACCESS_XXX</constant>, there. 3444 <constant>SNDRV_CTL_ELEM_ACCESS_XXX</constant>, there.
3445 The detailed will be explained in the subsection 3445 The detailed will be explained in the subsection
3446 <link linkend="control-interface-access-flags"><citetitle> 3446 <link linkend="control-interface-access-flags"><citetitle>
3447 Access Flags</citetitle></link>. 3447 Access Flags</citetitle></link>.
3448 </para> 3448 </para>
3449 3449
3450 <para> 3450 <para>
3451 The <structfield>private_value</structfield> field contains 3451 The <structfield>private_value</structfield> field contains
3452 an arbitrary long integer value for this record. When using 3452 an arbitrary long integer value for this record. When using
3453 generic <structfield>info</structfield>, 3453 generic <structfield>info</structfield>,
3454 <structfield>get</structfield> and 3454 <structfield>get</structfield> and
3455 <structfield>put</structfield> callbacks, you can pass a value 3455 <structfield>put</structfield> callbacks, you can pass a value
3456 through this field. If several small numbers are necessary, you can 3456 through this field. If several small numbers are necessary, you can
3457 combine them in bitwise. Or, it's possible to give a pointer 3457 combine them in bitwise. Or, it's possible to give a pointer
3458 (casted to unsigned long) of some record to this field, too. 3458 (casted to unsigned long) of some record to this field, too.
3459 </para> 3459 </para>
3460 3460
3461 <para> 3461 <para>
3462 The other three are 3462 The other three are
3463 <link linkend="control-interface-callbacks"><citetitle> 3463 <link linkend="control-interface-callbacks"><citetitle>
3464 callback functions</citetitle></link>. 3464 callback functions</citetitle></link>.
3465 </para> 3465 </para>
3466 </section> 3466 </section>
3467 3467
3468 <section id="control-interface-control-names"> 3468 <section id="control-interface-control-names">
3469 <title>Control Names</title> 3469 <title>Control Names</title>
3470 <para> 3470 <para>
3471 There are some standards for defining the control names. A 3471 There are some standards for defining the control names. A
3472 control is usually defined from the three parts as 3472 control is usually defined from the three parts as
3473 <quote>SOURCE DIRECTION FUNCTION</quote>. 3473 <quote>SOURCE DIRECTION FUNCTION</quote>.
3474 </para> 3474 </para>
3475 3475
3476 <para> 3476 <para>
3477 The first, <constant>SOURCE</constant>, specifies the source 3477 The first, <constant>SOURCE</constant>, specifies the source
3478 of the control, and is a string such as <quote>Master</quote>, 3478 of the control, and is a string such as <quote>Master</quote>,
3479 <quote>PCM</quote>, <quote>CD</quote> or 3479 <quote>PCM</quote>, <quote>CD</quote> or
3480 <quote>Line</quote>. There are many pre-defined sources. 3480 <quote>Line</quote>. There are many pre-defined sources.
3481 </para> 3481 </para>
3482 3482
3483 <para> 3483 <para>
3484 The second, <constant>DIRECTION</constant>, is one of the 3484 The second, <constant>DIRECTION</constant>, is one of the
3485 following strings according to the direction of the control: 3485 following strings according to the direction of the control:
3486 <quote>Playback</quote>, <quote>Capture</quote>, <quote>Bypass 3486 <quote>Playback</quote>, <quote>Capture</quote>, <quote>Bypass
3487 Playback</quote> and <quote>Bypass Capture</quote>. Or, it can 3487 Playback</quote> and <quote>Bypass Capture</quote>. Or, it can
3488 be omitted, meaning both playback and capture directions. 3488 be omitted, meaning both playback and capture directions.
3489 </para> 3489 </para>
3490 3490
3491 <para> 3491 <para>
3492 The third, <constant>FUNCTION</constant>, is one of the 3492 The third, <constant>FUNCTION</constant>, is one of the
3493 following strings according to the function of the control: 3493 following strings according to the function of the control:
3494 <quote>Switch</quote>, <quote>Volume</quote> and 3494 <quote>Switch</quote>, <quote>Volume</quote> and
3495 <quote>Route</quote>. 3495 <quote>Route</quote>.
3496 </para> 3496 </para>
3497 3497
3498 <para> 3498 <para>
3499 The example of control names are, thus, <quote>Master Capture 3499 The example of control names are, thus, <quote>Master Capture
3500 Switch</quote> or <quote>PCM Playback Volume</quote>. 3500 Switch</quote> or <quote>PCM Playback Volume</quote>.
3501 </para> 3501 </para>
3502 3502
3503 <para> 3503 <para>
3504 There are some exceptions: 3504 There are some exceptions:
3505 </para> 3505 </para>
3506 3506
3507 <section id="control-interface-control-names-global"> 3507 <section id="control-interface-control-names-global">
3508 <title>Global capture and playback</title> 3508 <title>Global capture and playback</title>
3509 <para> 3509 <para>
3510 <quote>Capture Source</quote>, <quote>Capture Switch</quote> 3510 <quote>Capture Source</quote>, <quote>Capture Switch</quote>
3511 and <quote>Capture Volume</quote> are used for the global 3511 and <quote>Capture Volume</quote> are used for the global
3512 capture (input) source, switch and volume. Similarly, 3512 capture (input) source, switch and volume. Similarly,
3513 <quote>Playback Switch</quote> and <quote>Playback 3513 <quote>Playback Switch</quote> and <quote>Playback
3514 Volume</quote> are used for the global output gain switch and 3514 Volume</quote> are used for the global output gain switch and
3515 volume. 3515 volume.
3516 </para> 3516 </para>
3517 </section> 3517 </section>
3518 3518
3519 <section id="control-interface-control-names-tone"> 3519 <section id="control-interface-control-names-tone">
3520 <title>Tone-controls</title> 3520 <title>Tone-controls</title>
3521 <para> 3521 <para>
3522 tone-control switch and volumes are specified like 3522 tone-control switch and volumes are specified like
3523 <quote>Tone Control - XXX</quote>, e.g. <quote>Tone Control - 3523 <quote>Tone Control - XXX</quote>, e.g. <quote>Tone Control -
3524 Switch</quote>, <quote>Tone Control - Bass</quote>, 3524 Switch</quote>, <quote>Tone Control - Bass</quote>,
3525 <quote>Tone Control - Center</quote>. 3525 <quote>Tone Control - Center</quote>.
3526 </para> 3526 </para>
3527 </section> 3527 </section>
3528 3528
3529 <section id="control-interface-control-names-3d"> 3529 <section id="control-interface-control-names-3d">
3530 <title>3D controls</title> 3530 <title>3D controls</title>
3531 <para> 3531 <para>
3532 3D-control switches and volumes are specified like <quote>3D 3532 3D-control switches and volumes are specified like <quote>3D
3533 Control - XXX</quote>, e.g. <quote>3D Control - 3533 Control - XXX</quote>, e.g. <quote>3D Control -
3534 Switch</quote>, <quote>3D Control - Center</quote>, <quote>3D 3534 Switch</quote>, <quote>3D Control - Center</quote>, <quote>3D
3535 Control - Space</quote>. 3535 Control - Space</quote>.
3536 </para> 3536 </para>
3537 </section> 3537 </section>
3538 3538
3539 <section id="control-interface-control-names-mic"> 3539 <section id="control-interface-control-names-mic">
3540 <title>Mic boost</title> 3540 <title>Mic boost</title>
3541 <para> 3541 <para>
3542 Mic-boost switch is set as <quote>Mic Boost</quote> or 3542 Mic-boost switch is set as <quote>Mic Boost</quote> or
3543 <quote>Mic Boost (6dB)</quote>. 3543 <quote>Mic Boost (6dB)</quote>.
3544 </para> 3544 </para>
3545 3545
3546 <para> 3546 <para>
3547 More precise information can be found in 3547 More precise information can be found in
3548 <filename>Documentation/sound/alsa/ControlNames.txt</filename>. 3548 <filename>Documentation/sound/alsa/ControlNames.txt</filename>.
3549 </para> 3549 </para>
3550 </section> 3550 </section>
3551 </section> 3551 </section>
3552 3552
3553 <section id="control-interface-access-flags"> 3553 <section id="control-interface-access-flags">
3554 <title>Access Flags</title> 3554 <title>Access Flags</title>
3555 3555
3556 <para> 3556 <para>
3557 The access flag is the bit-flags which specifies the access type 3557 The access flag is the bit-flags which specifies the access type
3558 of the given control. The default access type is 3558 of the given control. The default access type is
3559 <constant>SNDRV_CTL_ELEM_ACCESS_READWRITE</constant>, 3559 <constant>SNDRV_CTL_ELEM_ACCESS_READWRITE</constant>,
3560 which means both read and write are allowed to this control. 3560 which means both read and write are allowed to this control.
3561 When the access flag is omitted (i.e. = 0), it is 3561 When the access flag is omitted (i.e. = 0), it is
3562 regarded as <constant>READWRITE</constant> access as default. 3562 regarded as <constant>READWRITE</constant> access as default.
3563 </para> 3563 </para>
3564 3564
3565 <para> 3565 <para>
3566 When the control is read-only, pass 3566 When the control is read-only, pass
3567 <constant>SNDRV_CTL_ELEM_ACCESS_READ</constant> instead. 3567 <constant>SNDRV_CTL_ELEM_ACCESS_READ</constant> instead.
3568 In this case, you don't have to define 3568 In this case, you don't have to define
3569 <structfield>put</structfield> callback. 3569 <structfield>put</structfield> callback.
3570 Similarly, when the control is write-only (although it's a rare 3570 Similarly, when the control is write-only (although it's a rare
3571 case), you can use <constant>WRITE</constant> flag instead, and 3571 case), you can use <constant>WRITE</constant> flag instead, and
3572 you don't need <structfield>get</structfield> callback. 3572 you don't need <structfield>get</structfield> callback.
3573 </para> 3573 </para>
3574 3574
3575 <para> 3575 <para>
3576 If the control value changes frequently (e.g. the VU meter), 3576 If the control value changes frequently (e.g. the VU meter),
3577 <constant>VOLATILE</constant> flag should be given. This means 3577 <constant>VOLATILE</constant> flag should be given. This means
3578 that the control may be changed without 3578 that the control may be changed without
3579 <link linkend="control-interface-change-notification"><citetitle> 3579 <link linkend="control-interface-change-notification"><citetitle>
3580 notification</citetitle></link>. Applications should poll such 3580 notification</citetitle></link>. Applications should poll such
3581 a control constantly. 3581 a control constantly.
3582 </para> 3582 </para>
3583 3583
3584 <para> 3584 <para>
3585 When the control is inactive, set 3585 When the control is inactive, set
3586 <constant>INACTIVE</constant> flag, too. 3586 <constant>INACTIVE</constant> flag, too.
3587 There are <constant>LOCK</constant> and 3587 There are <constant>LOCK</constant> and
3588 <constant>OWNER</constant> flags for changing the write 3588 <constant>OWNER</constant> flags for changing the write
3589 permissions. 3589 permissions.
3590 </para> 3590 </para>
3591 3591
3592 </section> 3592 </section>
3593 3593
3594 <section id="control-interface-callbacks"> 3594 <section id="control-interface-callbacks">
3595 <title>Callbacks</title> 3595 <title>Callbacks</title>
3596 3596
3597 <section id="control-interface-callbacks-info"> 3597 <section id="control-interface-callbacks-info">
3598 <title>info callback</title> 3598 <title>info callback</title>
3599 <para> 3599 <para>
3600 The <structfield>info</structfield> callback is used to get 3600 The <structfield>info</structfield> callback is used to get
3601 the detailed information of this control. This must store the 3601 the detailed information of this control. This must store the
3602 values of the given struct <structname>snd_ctl_elem_info</structname> 3602 values of the given struct <structname>snd_ctl_elem_info</structname>
3603 object. For example, for a boolean control with a single 3603 object. For example, for a boolean control with a single
3604 element will be: 3604 element will be:
3605 3605
3606 <example> 3606 <example>
3607 <title>Example of info callback</title> 3607 <title>Example of info callback</title>
3608 <programlisting> 3608 <programlisting>
3609 <![CDATA[ 3609 <![CDATA[
3610 static int snd_myctl_info(struct snd_kcontrol *kcontrol, 3610 static int snd_myctl_info(struct snd_kcontrol *kcontrol,
3611 struct snd_ctl_elem_info *uinfo) 3611 struct snd_ctl_elem_info *uinfo)
3612 { 3612 {
3613 uinfo->type = SNDRV_CTL_ELEM_TYPE_BOOLEAN; 3613 uinfo->type = SNDRV_CTL_ELEM_TYPE_BOOLEAN;
3614 uinfo->count = 1; 3614 uinfo->count = 1;
3615 uinfo->value.integer.min = 0; 3615 uinfo->value.integer.min = 0;
3616 uinfo->value.integer.max = 1; 3616 uinfo->value.integer.max = 1;
3617 return 0; 3617 return 0;
3618 } 3618 }
3619 ]]> 3619 ]]>
3620 </programlisting> 3620 </programlisting>
3621 </example> 3621 </example>
3622 </para> 3622 </para>
3623 3623
3624 <para> 3624 <para>
3625 The <structfield>type</structfield> field specifies the type 3625 The <structfield>type</structfield> field specifies the type
3626 of the control. There are <constant>BOOLEAN</constant>, 3626 of the control. There are <constant>BOOLEAN</constant>,
3627 <constant>INTEGER</constant>, <constant>ENUMERATED</constant>, 3627 <constant>INTEGER</constant>, <constant>ENUMERATED</constant>,
3628 <constant>BYTES</constant>, <constant>IEC958</constant> and 3628 <constant>BYTES</constant>, <constant>IEC958</constant> and
3629 <constant>INTEGER64</constant>. The 3629 <constant>INTEGER64</constant>. The
3630 <structfield>count</structfield> field specifies the 3630 <structfield>count</structfield> field specifies the
3631 number of elements in this control. For example, a stereo 3631 number of elements in this control. For example, a stereo
3632 volume would have count = 2. The 3632 volume would have count = 2. The
3633 <structfield>value</structfield> field is a union, and 3633 <structfield>value</structfield> field is a union, and
3634 the values stored are depending on the type. The boolean and 3634 the values stored are depending on the type. The boolean and
3635 integer are identical. 3635 integer are identical.
3636 </para> 3636 </para>
3637 3637
3638 <para> 3638 <para>
3639 The enumerated type is a bit different from others. You'll 3639 The enumerated type is a bit different from others. You'll
3640 need to set the string for the currently given item index. 3640 need to set the string for the currently given item index.
3641 3641
3642 <informalexample> 3642 <informalexample>
3643 <programlisting> 3643 <programlisting>
3644 <![CDATA[ 3644 <![CDATA[
3645 static int snd_myctl_info(struct snd_kcontrol *kcontrol, 3645 static int snd_myctl_info(struct snd_kcontrol *kcontrol,
3646 struct snd_ctl_elem_info *uinfo) 3646 struct snd_ctl_elem_info *uinfo)
3647 { 3647 {
3648 static char *texts[4] = { 3648 static char *texts[4] = {
3649 "First", "Second", "Third", "Fourth" 3649 "First", "Second", "Third", "Fourth"
3650 }; 3650 };
3651 uinfo->type = SNDRV_CTL_ELEM_TYPE_ENUMERATED; 3651 uinfo->type = SNDRV_CTL_ELEM_TYPE_ENUMERATED;
3652 uinfo->count = 1; 3652 uinfo->count = 1;
3653 uinfo->value.enumerated.items = 4; 3653 uinfo->value.enumerated.items = 4;
3654 if (uinfo->value.enumerated.item > 3) 3654 if (uinfo->value.enumerated.item > 3)
3655 uinfo->value.enumerated.item = 3; 3655 uinfo->value.enumerated.item = 3;
3656 strcpy(uinfo->value.enumerated.name, 3656 strcpy(uinfo->value.enumerated.name,
3657 texts[uinfo->value.enumerated.item]); 3657 texts[uinfo->value.enumerated.item]);
3658 return 0; 3658 return 0;
3659 } 3659 }
3660 ]]> 3660 ]]>
3661 </programlisting> 3661 </programlisting>
3662 </informalexample> 3662 </informalexample>
3663 </para> 3663 </para>
3664 </section> 3664 </section>
3665 3665
3666 <section id="control-interface-callbacks-get"> 3666 <section id="control-interface-callbacks-get">
3667 <title>get callback</title> 3667 <title>get callback</title>
3668 3668
3669 <para> 3669 <para>
3670 This callback is used to read the current value of the 3670 This callback is used to read the current value of the
3671 control and to return to the user-space. 3671 control and to return to the user-space.
3672 </para> 3672 </para>
3673 3673
3674 <para> 3674 <para>
3675 For example, 3675 For example,
3676 3676
3677 <example> 3677 <example>
3678 <title>Example of get callback</title> 3678 <title>Example of get callback</title>
3679 <programlisting> 3679 <programlisting>
3680 <![CDATA[ 3680 <![CDATA[
3681 static int snd_myctl_get(struct snd_kcontrol *kcontrol, 3681 static int snd_myctl_get(struct snd_kcontrol *kcontrol,
3682 struct snd_ctl_elem_value *ucontrol) 3682 struct snd_ctl_elem_value *ucontrol)
3683 { 3683 {
3684 struct mychip *chip = snd_kcontrol_chip(kcontrol); 3684 struct mychip *chip = snd_kcontrol_chip(kcontrol);
3685 ucontrol->value.integer.value[0] = get_some_value(chip); 3685 ucontrol->value.integer.value[0] = get_some_value(chip);
3686 return 0; 3686 return 0;
3687 } 3687 }
3688 ]]> 3688 ]]>
3689 </programlisting> 3689 </programlisting>
3690 </example> 3690 </example>
3691 </para> 3691 </para>
3692 3692
3693 <para> 3693 <para>
3694 Here, the chip instance is retrieved via 3694 Here, the chip instance is retrieved via
3695 <function>snd_kcontrol_chip()</function> macro. This macro 3695 <function>snd_kcontrol_chip()</function> macro. This macro
3696 just accesses to kcontrol-&gt;private_data. The 3696 just accesses to kcontrol-&gt;private_data. The
3697 kcontrol-&gt;private_data field is 3697 kcontrol-&gt;private_data field is
3698 given as the argument of <function>snd_ctl_new()</function> 3698 given as the argument of <function>snd_ctl_new()</function>
3699 (see the later subsection 3699 (see the later subsection
3700 <link linkend="control-interface-constructor"><citetitle>Constructor</citetitle></link>). 3700 <link linkend="control-interface-constructor"><citetitle>Constructor</citetitle></link>).
3701 </para> 3701 </para>
3702 3702
3703 <para> 3703 <para>
3704 The <structfield>value</structfield> field is depending on 3704 The <structfield>value</structfield> field is depending on
3705 the type of control as well as on info callback. For example, 3705 the type of control as well as on info callback. For example,
3706 the sb driver uses this field to store the register offset, 3706 the sb driver uses this field to store the register offset,
3707 the bit-shift and the bit-mask. The 3707 the bit-shift and the bit-mask. The
3708 <structfield>private_value</structfield> is set like 3708 <structfield>private_value</structfield> is set like
3709 <informalexample> 3709 <informalexample>
3710 <programlisting> 3710 <programlisting>
3711 <![CDATA[ 3711 <![CDATA[
3712 .private_value = reg | (shift << 16) | (mask << 24) 3712 .private_value = reg | (shift << 16) | (mask << 24)
3713 ]]> 3713 ]]>
3714 </programlisting> 3714 </programlisting>
3715 </informalexample> 3715 </informalexample>
3716 and is retrieved in callbacks like 3716 and is retrieved in callbacks like
3717 <informalexample> 3717 <informalexample>
3718 <programlisting> 3718 <programlisting>
3719 <![CDATA[ 3719 <![CDATA[
3720 static int snd_sbmixer_get_single(struct snd_kcontrol *kcontrol, 3720 static int snd_sbmixer_get_single(struct snd_kcontrol *kcontrol,
3721 struct snd_ctl_elem_value *ucontrol) 3721 struct snd_ctl_elem_value *ucontrol)
3722 { 3722 {
3723 int reg = kcontrol->private_value & 0xff; 3723 int reg = kcontrol->private_value & 0xff;
3724 int shift = (kcontrol->private_value >> 16) & 0xff; 3724 int shift = (kcontrol->private_value >> 16) & 0xff;
3725 int mask = (kcontrol->private_value >> 24) & 0xff; 3725 int mask = (kcontrol->private_value >> 24) & 0xff;
3726 .... 3726 ....
3727 } 3727 }
3728 ]]> 3728 ]]>
3729 </programlisting> 3729 </programlisting>
3730 </informalexample> 3730 </informalexample>
3731 </para> 3731 </para>
3732 3732
3733 <para> 3733 <para>
3734 In <structfield>get</structfield> callback, you have to fill all the elements if the 3734 In <structfield>get</structfield> callback, you have to fill all the elements if the
3735 control has more than one elements, 3735 control has more than one elements,
3736 i.e. <structfield>count</structfield> &gt; 1. 3736 i.e. <structfield>count</structfield> &gt; 1.
3737 In the example above, we filled only one element 3737 In the example above, we filled only one element
3738 (<structfield>value.integer.value[0]</structfield>) since it's 3738 (<structfield>value.integer.value[0]</structfield>) since it's
3739 assumed as <structfield>count</structfield> = 1. 3739 assumed as <structfield>count</structfield> = 1.
3740 </para> 3740 </para>
3741 </section> 3741 </section>
3742 3742
3743 <section id="control-interface-callbacks-put"> 3743 <section id="control-interface-callbacks-put">
3744 <title>put callback</title> 3744 <title>put callback</title>
3745 3745
3746 <para> 3746 <para>
3747 This callback is used to write a value from the user-space. 3747 This callback is used to write a value from the user-space.
3748 </para> 3748 </para>
3749 3749
3750 <para> 3750 <para>
3751 For example, 3751 For example,
3752 3752
3753 <example> 3753 <example>
3754 <title>Example of put callback</title> 3754 <title>Example of put callback</title>
3755 <programlisting> 3755 <programlisting>
3756 <![CDATA[ 3756 <![CDATA[
3757 static int snd_myctl_put(struct snd_kcontrol *kcontrol, 3757 static int snd_myctl_put(struct snd_kcontrol *kcontrol,
3758 struct snd_ctl_elem_value *ucontrol) 3758 struct snd_ctl_elem_value *ucontrol)
3759 { 3759 {
3760 struct mychip *chip = snd_kcontrol_chip(kcontrol); 3760 struct mychip *chip = snd_kcontrol_chip(kcontrol);
3761 int changed = 0; 3761 int changed = 0;
3762 if (chip->current_value != 3762 if (chip->current_value !=
3763 ucontrol->value.integer.value[0]) { 3763 ucontrol->value.integer.value[0]) {
3764 change_current_value(chip, 3764 change_current_value(chip,
3765 ucontrol->value.integer.value[0]); 3765 ucontrol->value.integer.value[0]);
3766 changed = 1; 3766 changed = 1;
3767 } 3767 }
3768 return changed; 3768 return changed;
3769 } 3769 }
3770 ]]> 3770 ]]>
3771 </programlisting> 3771 </programlisting>
3772 </example> 3772 </example>
3773 3773
3774 As seen above, you have to return 1 if the value is 3774 As seen above, you have to return 1 if the value is
3775 changed. If the value is not changed, return 0 instead. 3775 changed. If the value is not changed, return 0 instead.
3776 If any fatal error happens, return a negative error code as 3776 If any fatal error happens, return a negative error code as
3777 usual. 3777 usual.
3778 </para> 3778 </para>
3779 3779
3780 <para> 3780 <para>
3781 Like <structfield>get</structfield> callback, 3781 Like <structfield>get</structfield> callback,
3782 when the control has more than one elements, 3782 when the control has more than one elements,
3783 all elemehts must be evaluated in this callback, too. 3783 all elemehts must be evaluated in this callback, too.
3784 </para> 3784 </para>
3785 </section> 3785 </section>
3786 3786
3787 <section id="control-interface-callbacks-all"> 3787 <section id="control-interface-callbacks-all">
3788 <title>Callbacks are not atomic</title> 3788 <title>Callbacks are not atomic</title>
3789 <para> 3789 <para>
3790 All these three callbacks are basically not atomic. 3790 All these three callbacks are basically not atomic.
3791 </para> 3791 </para>
3792 </section> 3792 </section>
3793 </section> 3793 </section>
3794 3794
3795 <section id="control-interface-constructor"> 3795 <section id="control-interface-constructor">
3796 <title>Constructor</title> 3796 <title>Constructor</title>
3797 <para> 3797 <para>
3798 When everything is ready, finally we can create a new 3798 When everything is ready, finally we can create a new
3799 control. For creating a control, there are two functions to be 3799 control. For creating a control, there are two functions to be
3800 called, <function>snd_ctl_new1()</function> and 3800 called, <function>snd_ctl_new1()</function> and
3801 <function>snd_ctl_add()</function>. 3801 <function>snd_ctl_add()</function>.
3802 </para> 3802 </para>
3803 3803
3804 <para> 3804 <para>
3805 In the simplest way, you can do like this: 3805 In the simplest way, you can do like this:
3806 3806
3807 <informalexample> 3807 <informalexample>
3808 <programlisting> 3808 <programlisting>
3809 <![CDATA[ 3809 <![CDATA[
3810 if ((err = snd_ctl_add(card, snd_ctl_new1(&my_control, chip))) < 0) 3810 if ((err = snd_ctl_add(card, snd_ctl_new1(&my_control, chip))) < 0)
3811 return err; 3811 return err;
3812 ]]> 3812 ]]>
3813 </programlisting> 3813 </programlisting>
3814 </informalexample> 3814 </informalexample>
3815 3815
3816 where <parameter>my_control</parameter> is the 3816 where <parameter>my_control</parameter> is the
3817 struct <structname>snd_kcontrol_new</structname> object defined above, and chip 3817 struct <structname>snd_kcontrol_new</structname> object defined above, and chip
3818 is the object pointer to be passed to 3818 is the object pointer to be passed to
3819 kcontrol-&gt;private_data 3819 kcontrol-&gt;private_data
3820 which can be referred in callbacks. 3820 which can be referred in callbacks.
3821 </para> 3821 </para>
3822 3822
3823 <para> 3823 <para>
3824 <function>snd_ctl_new1()</function> allocates a new 3824 <function>snd_ctl_new1()</function> allocates a new
3825 <structname>snd_kcontrol</structname> instance (that's why the definition 3825 <structname>snd_kcontrol</structname> instance (that's why the definition
3826 of <parameter>my_control</parameter> can be with 3826 of <parameter>my_control</parameter> can be with
3827 <parameter>__devinitdata</parameter> 3827 <parameter>__devinitdata</parameter>
3828 prefix), and <function>snd_ctl_add</function> assigns the given 3828 prefix), and <function>snd_ctl_add</function> assigns the given
3829 control component to the card. 3829 control component to the card.
3830 </para> 3830 </para>
3831 </section> 3831 </section>
3832 3832
3833 <section id="control-interface-change-notification"> 3833 <section id="control-interface-change-notification">
3834 <title>Change Notification</title> 3834 <title>Change Notification</title>
3835 <para> 3835 <para>
3836 If you need to change and update a control in the interrupt 3836 If you need to change and update a control in the interrupt
3837 routine, you can call <function>snd_ctl_notify()</function>. For 3837 routine, you can call <function>snd_ctl_notify()</function>. For
3838 example, 3838 example,
3839 3839
3840 <informalexample> 3840 <informalexample>
3841 <programlisting> 3841 <programlisting>
3842 <![CDATA[ 3842 <![CDATA[
3843 snd_ctl_notify(card, SNDRV_CTL_EVENT_MASK_VALUE, id_pointer); 3843 snd_ctl_notify(card, SNDRV_CTL_EVENT_MASK_VALUE, id_pointer);
3844 ]]> 3844 ]]>
3845 </programlisting> 3845 </programlisting>
3846 </informalexample> 3846 </informalexample>
3847 3847
3848 This function takes the card pointer, the event-mask, and the 3848 This function takes the card pointer, the event-mask, and the
3849 control id pointer for the notification. The event-mask 3849 control id pointer for the notification. The event-mask
3850 specifies the types of notification, for example, in the above 3850 specifies the types of notification, for example, in the above
3851 example, the change of control values is notified. 3851 example, the change of control values is notified.
3852 The id pointer is the pointer of struct <structname>snd_ctl_elem_id</structname> 3852 The id pointer is the pointer of struct <structname>snd_ctl_elem_id</structname>
3853 to be notified. 3853 to be notified.
3854 You can find some examples in <filename>es1938.c</filename> or 3854 You can find some examples in <filename>es1938.c</filename> or
3855 <filename>es1968.c</filename> for hardware volume interrupts. 3855 <filename>es1968.c</filename> for hardware volume interrupts.
3856 </para> 3856 </para>
3857 </section> 3857 </section>
3858 3858
3859 </chapter> 3859 </chapter>
3860 3860
3861 3861
3862 <!-- ****************************************************** --> 3862 <!-- ****************************************************** -->
3863 <!-- API for AC97 Codec --> 3863 <!-- API for AC97 Codec -->
3864 <!-- ****************************************************** --> 3864 <!-- ****************************************************** -->
3865 <chapter id="api-ac97"> 3865 <chapter id="api-ac97">
3866 <title>API for AC97 Codec</title> 3866 <title>API for AC97 Codec</title>
3867 3867
3868 <section> 3868 <section>
3869 <title>General</title> 3869 <title>General</title>
3870 <para> 3870 <para>
3871 The ALSA AC97 codec layer is a well-defined one, and you don't 3871 The ALSA AC97 codec layer is a well-defined one, and you don't
3872 have to write many codes to control it. Only low-level control 3872 have to write many codes to control it. Only low-level control
3873 routines are necessary. The AC97 codec API is defined in 3873 routines are necessary. The AC97 codec API is defined in
3874 <filename>&lt;sound/ac97_codec.h&gt;</filename>. 3874 <filename>&lt;sound/ac97_codec.h&gt;</filename>.
3875 </para> 3875 </para>
3876 </section> 3876 </section>
3877 3877
3878 <section id="api-ac97-example"> 3878 <section id="api-ac97-example">
3879 <title>Full Code Example</title> 3879 <title>Full Code Example</title>
3880 <para> 3880 <para>
3881 <example> 3881 <example>
3882 <title>Example of AC97 Interface</title> 3882 <title>Example of AC97 Interface</title>
3883 <programlisting> 3883 <programlisting>
3884 <![CDATA[ 3884 <![CDATA[
3885 struct mychip { 3885 struct mychip {
3886 .... 3886 ....
3887 struct snd_ac97 *ac97; 3887 struct snd_ac97 *ac97;
3888 .... 3888 ....
3889 }; 3889 };
3890 3890
3891 static unsigned short snd_mychip_ac97_read(struct snd_ac97 *ac97, 3891 static unsigned short snd_mychip_ac97_read(struct snd_ac97 *ac97,
3892 unsigned short reg) 3892 unsigned short reg)
3893 { 3893 {
3894 struct mychip *chip = ac97->private_data; 3894 struct mychip *chip = ac97->private_data;
3895 .... 3895 ....
3896 // read a register value here from the codec 3896 // read a register value here from the codec
3897 return the_register_value; 3897 return the_register_value;
3898 } 3898 }
3899 3899
3900 static void snd_mychip_ac97_write(struct snd_ac97 *ac97, 3900 static void snd_mychip_ac97_write(struct snd_ac97 *ac97,
3901 unsigned short reg, unsigned short val) 3901 unsigned short reg, unsigned short val)
3902 { 3902 {
3903 struct mychip *chip = ac97->private_data; 3903 struct mychip *chip = ac97->private_data;
3904 .... 3904 ....
3905 // write the given register value to the codec 3905 // write the given register value to the codec
3906 } 3906 }
3907 3907
3908 static int snd_mychip_ac97(struct mychip *chip) 3908 static int snd_mychip_ac97(struct mychip *chip)
3909 { 3909 {
3910 struct snd_ac97_bus *bus; 3910 struct snd_ac97_bus *bus;
3911 struct snd_ac97_template ac97; 3911 struct snd_ac97_template ac97;
3912 int err; 3912 int err;
3913 static struct snd_ac97_bus_ops ops = { 3913 static struct snd_ac97_bus_ops ops = {
3914 .write = snd_mychip_ac97_write, 3914 .write = snd_mychip_ac97_write,
3915 .read = snd_mychip_ac97_read, 3915 .read = snd_mychip_ac97_read,
3916 }; 3916 };
3917 3917
3918 if ((err = snd_ac97_bus(chip->card, 0, &ops, NULL, &bus)) < 0) 3918 if ((err = snd_ac97_bus(chip->card, 0, &ops, NULL, &bus)) < 0)
3919 return err; 3919 return err;
3920 memset(&ac97, 0, sizeof(ac97)); 3920 memset(&ac97, 0, sizeof(ac97));
3921 ac97.private_data = chip; 3921 ac97.private_data = chip;
3922 return snd_ac97_mixer(bus, &ac97, &chip->ac97); 3922 return snd_ac97_mixer(bus, &ac97, &chip->ac97);
3923 } 3923 }
3924 3924
3925 ]]> 3925 ]]>
3926 </programlisting> 3926 </programlisting>
3927 </example> 3927 </example>
3928 </para> 3928 </para>
3929 </section> 3929 </section>
3930 3930
3931 <section id="api-ac97-constructor"> 3931 <section id="api-ac97-constructor">
3932 <title>Constructor</title> 3932 <title>Constructor</title>
3933 <para> 3933 <para>
3934 For creating an ac97 instance, first call <function>snd_ac97_bus</function> 3934 For creating an ac97 instance, first call <function>snd_ac97_bus</function>
3935 with an <type>ac97_bus_ops_t</type> record with callback functions. 3935 with an <type>ac97_bus_ops_t</type> record with callback functions.
3936 3936
3937 <informalexample> 3937 <informalexample>
3938 <programlisting> 3938 <programlisting>
3939 <![CDATA[ 3939 <![CDATA[
3940 struct snd_ac97_bus *bus; 3940 struct snd_ac97_bus *bus;
3941 static struct snd_ac97_bus_ops ops = { 3941 static struct snd_ac97_bus_ops ops = {
3942 .write = snd_mychip_ac97_write, 3942 .write = snd_mychip_ac97_write,
3943 .read = snd_mychip_ac97_read, 3943 .read = snd_mychip_ac97_read,
3944 }; 3944 };
3945 3945
3946 snd_ac97_bus(card, 0, &ops, NULL, &pbus); 3946 snd_ac97_bus(card, 0, &ops, NULL, &pbus);
3947 ]]> 3947 ]]>
3948 </programlisting> 3948 </programlisting>
3949 </informalexample> 3949 </informalexample>
3950 3950
3951 The bus record is shared among all belonging ac97 instances. 3951 The bus record is shared among all belonging ac97 instances.
3952 </para> 3952 </para>
3953 3953
3954 <para> 3954 <para>
3955 And then call <function>snd_ac97_mixer()</function> with an 3955 And then call <function>snd_ac97_mixer()</function> with an
3956 struct <structname>snd_ac97_template</structname> 3956 struct <structname>snd_ac97_template</structname>
3957 record together with the bus pointer created above. 3957 record together with the bus pointer created above.
3958 3958
3959 <informalexample> 3959 <informalexample>
3960 <programlisting> 3960 <programlisting>
3961 <![CDATA[ 3961 <![CDATA[
3962 struct snd_ac97_template ac97; 3962 struct snd_ac97_template ac97;
3963 int err; 3963 int err;
3964 3964
3965 memset(&ac97, 0, sizeof(ac97)); 3965 memset(&ac97, 0, sizeof(ac97));
3966 ac97.private_data = chip; 3966 ac97.private_data = chip;
3967 snd_ac97_mixer(bus, &ac97, &chip->ac97); 3967 snd_ac97_mixer(bus, &ac97, &chip->ac97);
3968 ]]> 3968 ]]>
3969 </programlisting> 3969 </programlisting>
3970 </informalexample> 3970 </informalexample>
3971 3971
3972 where chip-&gt;ac97 is the pointer of a newly created 3972 where chip-&gt;ac97 is the pointer of a newly created
3973 <type>ac97_t</type> instance. 3973 <type>ac97_t</type> instance.
3974 In this case, the chip pointer is set as the private data, so that 3974 In this case, the chip pointer is set as the private data, so that
3975 the read/write callback functions can refer to this chip instance. 3975 the read/write callback functions can refer to this chip instance.
3976 This instance is not necessarily stored in the chip 3976 This instance is not necessarily stored in the chip
3977 record. When you need to change the register values from the 3977 record. When you need to change the register values from the
3978 driver, or need the suspend/resume of ac97 codecs, keep this 3978 driver, or need the suspend/resume of ac97 codecs, keep this
3979 pointer to pass to the corresponding functions. 3979 pointer to pass to the corresponding functions.
3980 </para> 3980 </para>
3981 </section> 3981 </section>
3982 3982
3983 <section id="api-ac97-callbacks"> 3983 <section id="api-ac97-callbacks">
3984 <title>Callbacks</title> 3984 <title>Callbacks</title>
3985 <para> 3985 <para>
3986 The standard callbacks are <structfield>read</structfield> and 3986 The standard callbacks are <structfield>read</structfield> and
3987 <structfield>write</structfield>. Obviously they 3987 <structfield>write</structfield>. Obviously they
3988 correspond to the functions for read and write accesses to the 3988 correspond to the functions for read and write accesses to the
3989 hardware low-level codes. 3989 hardware low-level codes.
3990 </para> 3990 </para>
3991 3991
3992 <para> 3992 <para>
3993 The <structfield>read</structfield> callback returns the 3993 The <structfield>read</structfield> callback returns the
3994 register value specified in the argument. 3994 register value specified in the argument.
3995 3995
3996 <informalexample> 3996 <informalexample>
3997 <programlisting> 3997 <programlisting>
3998 <![CDATA[ 3998 <![CDATA[
3999 static unsigned short snd_mychip_ac97_read(struct snd_ac97 *ac97, 3999 static unsigned short snd_mychip_ac97_read(struct snd_ac97 *ac97,
4000 unsigned short reg) 4000 unsigned short reg)
4001 { 4001 {
4002 struct mychip *chip = ac97->private_data; 4002 struct mychip *chip = ac97->private_data;
4003 .... 4003 ....
4004 return the_register_value; 4004 return the_register_value;
4005 } 4005 }
4006 ]]> 4006 ]]>
4007 </programlisting> 4007 </programlisting>
4008 </informalexample> 4008 </informalexample>
4009 4009
4010 Here, the chip can be cast from ac97-&gt;private_data. 4010 Here, the chip can be cast from ac97-&gt;private_data.
4011 </para> 4011 </para>
4012 4012
4013 <para> 4013 <para>
4014 Meanwhile, the <structfield>write</structfield> callback is 4014 Meanwhile, the <structfield>write</structfield> callback is
4015 used to set the register value. 4015 used to set the register value.
4016 4016
4017 <informalexample> 4017 <informalexample>
4018 <programlisting> 4018 <programlisting>
4019 <![CDATA[ 4019 <![CDATA[
4020 static void snd_mychip_ac97_write(struct snd_ac97 *ac97, 4020 static void snd_mychip_ac97_write(struct snd_ac97 *ac97,
4021 unsigned short reg, unsigned short val) 4021 unsigned short reg, unsigned short val)
4022 ]]> 4022 ]]>
4023 </programlisting> 4023 </programlisting>
4024 </informalexample> 4024 </informalexample>
4025 </para> 4025 </para>
4026 4026
4027 <para> 4027 <para>
4028 These callbacks are non-atomic like the callbacks of control API. 4028 These callbacks are non-atomic like the callbacks of control API.
4029 </para> 4029 </para>
4030 4030
4031 <para> 4031 <para>
4032 There are also other callbacks: 4032 There are also other callbacks:
4033 <structfield>reset</structfield>, 4033 <structfield>reset</structfield>,
4034 <structfield>wait</structfield> and 4034 <structfield>wait</structfield> and
4035 <structfield>init</structfield>. 4035 <structfield>init</structfield>.
4036 </para> 4036 </para>
4037 4037
4038 <para> 4038 <para>
4039 The <structfield>reset</structfield> callback is used to reset 4039 The <structfield>reset</structfield> callback is used to reset
4040 the codec. If the chip requires a special way of reset, you can 4040 the codec. If the chip requires a special way of reset, you can
4041 define this callback. 4041 define this callback.
4042 </para> 4042 </para>
4043 4043
4044 <para> 4044 <para>
4045 The <structfield>wait</structfield> callback is used for a 4045 The <structfield>wait</structfield> callback is used for a
4046 certain wait at the standard initialization of the codec. If the 4046 certain wait at the standard initialization of the codec. If the
4047 chip requires the extra wait-time, define this callback. 4047 chip requires the extra wait-time, define this callback.
4048 </para> 4048 </para>
4049 4049
4050 <para> 4050 <para>
4051 The <structfield>init</structfield> callback is used for 4051 The <structfield>init</structfield> callback is used for
4052 additional initialization of the codec. 4052 additional initialization of the codec.
4053 </para> 4053 </para>
4054 </section> 4054 </section>
4055 4055
4056 <section id="api-ac97-updating-registers"> 4056 <section id="api-ac97-updating-registers">
4057 <title>Updating Registers in The Driver</title> 4057 <title>Updating Registers in The Driver</title>
4058 <para> 4058 <para>
4059 If you need to access to the codec from the driver, you can 4059 If you need to access to the codec from the driver, you can
4060 call the following functions: 4060 call the following functions:
4061 <function>snd_ac97_write()</function>, 4061 <function>snd_ac97_write()</function>,
4062 <function>snd_ac97_read()</function>, 4062 <function>snd_ac97_read()</function>,
4063 <function>snd_ac97_update()</function> and 4063 <function>snd_ac97_update()</function> and
4064 <function>snd_ac97_update_bits()</function>. 4064 <function>snd_ac97_update_bits()</function>.
4065 </para> 4065 </para>
4066 4066
4067 <para> 4067 <para>
4068 Both <function>snd_ac97_write()</function> and 4068 Both <function>snd_ac97_write()</function> and
4069 <function>snd_ac97_update()</function> functions are used to 4069 <function>snd_ac97_update()</function> functions are used to
4070 set a value to the given register 4070 set a value to the given register
4071 (<constant>AC97_XXX</constant>). The difference between them is 4071 (<constant>AC97_XXX</constant>). The difference between them is
4072 that <function>snd_ac97_update()</function> doesn't write a 4072 that <function>snd_ac97_update()</function> doesn't write a
4073 value if the given value has been already set, while 4073 value if the given value has been already set, while
4074 <function>snd_ac97_write()</function> always rewrites the 4074 <function>snd_ac97_write()</function> always rewrites the
4075 value. 4075 value.
4076 4076
4077 <informalexample> 4077 <informalexample>
4078 <programlisting> 4078 <programlisting>
4079 <![CDATA[ 4079 <![CDATA[
4080 snd_ac97_write(ac97, AC97_MASTER, 0x8080); 4080 snd_ac97_write(ac97, AC97_MASTER, 0x8080);
4081 snd_ac97_update(ac97, AC97_MASTER, 0x8080); 4081 snd_ac97_update(ac97, AC97_MASTER, 0x8080);
4082 ]]> 4082 ]]>
4083 </programlisting> 4083 </programlisting>
4084 </informalexample> 4084 </informalexample>
4085 </para> 4085 </para>
4086 4086
4087 <para> 4087 <para>
4088 <function>snd_ac97_read()</function> is used to read the value 4088 <function>snd_ac97_read()</function> is used to read the value
4089 of the given register. For example, 4089 of the given register. For example,
4090 4090
4091 <informalexample> 4091 <informalexample>
4092 <programlisting> 4092 <programlisting>
4093 <![CDATA[ 4093 <![CDATA[
4094 value = snd_ac97_read(ac97, AC97_MASTER); 4094 value = snd_ac97_read(ac97, AC97_MASTER);
4095 ]]> 4095 ]]>
4096 </programlisting> 4096 </programlisting>
4097 </informalexample> 4097 </informalexample>
4098 </para> 4098 </para>
4099 4099
4100 <para> 4100 <para>
4101 <function>snd_ac97_update_bits()</function> is used to update 4101 <function>snd_ac97_update_bits()</function> is used to update
4102 some bits of the given register. 4102 some bits of the given register.
4103 4103
4104 <informalexample> 4104 <informalexample>
4105 <programlisting> 4105 <programlisting>
4106 <![CDATA[ 4106 <![CDATA[
4107 snd_ac97_update_bits(ac97, reg, mask, value); 4107 snd_ac97_update_bits(ac97, reg, mask, value);
4108 ]]> 4108 ]]>
4109 </programlisting> 4109 </programlisting>
4110 </informalexample> 4110 </informalexample>
4111 </para> 4111 </para>
4112 4112
4113 <para> 4113 <para>
4114 Also, there is a function to change the sample rate (of a 4114 Also, there is a function to change the sample rate (of a
4115 certain register such as 4115 certain register such as
4116 <constant>AC97_PCM_FRONT_DAC_RATE</constant>) when VRA or 4116 <constant>AC97_PCM_FRONT_DAC_RATE</constant>) when VRA or
4117 DRA is supported by the codec: 4117 DRA is supported by the codec:
4118 <function>snd_ac97_set_rate()</function>. 4118 <function>snd_ac97_set_rate()</function>.
4119 4119
4120 <informalexample> 4120 <informalexample>
4121 <programlisting> 4121 <programlisting>
4122 <![CDATA[ 4122 <![CDATA[
4123 snd_ac97_set_rate(ac97, AC97_PCM_FRONT_DAC_RATE, 44100); 4123 snd_ac97_set_rate(ac97, AC97_PCM_FRONT_DAC_RATE, 44100);
4124 ]]> 4124 ]]>
4125 </programlisting> 4125 </programlisting>
4126 </informalexample> 4126 </informalexample>
4127 </para> 4127 </para>
4128 4128
4129 <para> 4129 <para>
4130 The following registers are available for setting the rate: 4130 The following registers are available for setting the rate:
4131 <constant>AC97_PCM_MIC_ADC_RATE</constant>, 4131 <constant>AC97_PCM_MIC_ADC_RATE</constant>,
4132 <constant>AC97_PCM_FRONT_DAC_RATE</constant>, 4132 <constant>AC97_PCM_FRONT_DAC_RATE</constant>,
4133 <constant>AC97_PCM_LR_ADC_RATE</constant>, 4133 <constant>AC97_PCM_LR_ADC_RATE</constant>,
4134 <constant>AC97_SPDIF</constant>. When the 4134 <constant>AC97_SPDIF</constant>. When the
4135 <constant>AC97_SPDIF</constant> is specified, the register is 4135 <constant>AC97_SPDIF</constant> is specified, the register is
4136 not really changed but the corresponding IEC958 status bits will 4136 not really changed but the corresponding IEC958 status bits will
4137 be updated. 4137 be updated.
4138 </para> 4138 </para>
4139 </section> 4139 </section>
4140 4140
4141 <section id="api-ac97-clock-adjustment"> 4141 <section id="api-ac97-clock-adjustment">
4142 <title>Clock Adjustment</title> 4142 <title>Clock Adjustment</title>
4143 <para> 4143 <para>
4144 On some chip, the clock of the codec isn't 48000 but using a 4144 On some chip, the clock of the codec isn't 48000 but using a
4145 PCI clock (to save a quartz!). In this case, change the field 4145 PCI clock (to save a quartz!). In this case, change the field
4146 bus-&gt;clock to the corresponding 4146 bus-&gt;clock to the corresponding
4147 value. For example, intel8x0 4147 value. For example, intel8x0
4148 and es1968 drivers have the auto-measurement function of the 4148 and es1968 drivers have the auto-measurement function of the
4149 clock. 4149 clock.
4150 </para> 4150 </para>
4151 </section> 4151 </section>
4152 4152
4153 <section id="api-ac97-proc-files"> 4153 <section id="api-ac97-proc-files">
4154 <title>Proc Files</title> 4154 <title>Proc Files</title>
4155 <para> 4155 <para>
4156 The ALSA AC97 interface will create a proc file such as 4156 The ALSA AC97 interface will create a proc file such as
4157 <filename>/proc/asound/card0/codec97#0/ac97#0-0</filename> and 4157 <filename>/proc/asound/card0/codec97#0/ac97#0-0</filename> and
4158 <filename>ac97#0-0+regs</filename>. You can refer to these files to 4158 <filename>ac97#0-0+regs</filename>. You can refer to these files to
4159 see the current status and registers of the codec. 4159 see the current status and registers of the codec.
4160 </para> 4160 </para>
4161 </section> 4161 </section>
4162 4162
4163 <section id="api-ac97-multiple-codecs"> 4163 <section id="api-ac97-multiple-codecs">
4164 <title>Multiple Codecs</title> 4164 <title>Multiple Codecs</title>
4165 <para> 4165 <para>
4166 When there are several codecs on the same card, you need to 4166 When there are several codecs on the same card, you need to
4167 call <function>snd_ac97_mixer()</function> multiple times with 4167 call <function>snd_ac97_mixer()</function> multiple times with
4168 ac97.num=1 or greater. The <structfield>num</structfield> field 4168 ac97.num=1 or greater. The <structfield>num</structfield> field
4169 specifies the codec 4169 specifies the codec
4170 number. 4170 number.
4171 </para> 4171 </para>
4172 4172
4173 <para> 4173 <para>
4174 If you have set up multiple codecs, you need to either write 4174 If you have set up multiple codecs, you need to either write
4175 different callbacks for each codec or check 4175 different callbacks for each codec or check
4176 ac97-&gt;num in the 4176 ac97-&gt;num in the
4177 callback routines. 4177 callback routines.
4178 </para> 4178 </para>
4179 </section> 4179 </section>
4180 4180
4181 </chapter> 4181 </chapter>
4182 4182
4183 4183
4184 <!-- ****************************************************** --> 4184 <!-- ****************************************************** -->
4185 <!-- MIDI (MPU401-UART) Interface --> 4185 <!-- MIDI (MPU401-UART) Interface -->
4186 <!-- ****************************************************** --> 4186 <!-- ****************************************************** -->
4187 <chapter id="midi-interface"> 4187 <chapter id="midi-interface">
4188 <title>MIDI (MPU401-UART) Interface</title> 4188 <title>MIDI (MPU401-UART) Interface</title>
4189 4189
4190 <section id="midi-interface-general"> 4190 <section id="midi-interface-general">
4191 <title>General</title> 4191 <title>General</title>
4192 <para> 4192 <para>
4193 Many soundcards have built-in MIDI (MPU401-UART) 4193 Many soundcards have built-in MIDI (MPU401-UART)
4194 interfaces. When the soundcard supports the standard MPU401-UART 4194 interfaces. When the soundcard supports the standard MPU401-UART
4195 interface, most likely you can use the ALSA MPU401-UART API. The 4195 interface, most likely you can use the ALSA MPU401-UART API. The
4196 MPU401-UART API is defined in 4196 MPU401-UART API is defined in
4197 <filename>&lt;sound/mpu401.h&gt;</filename>. 4197 <filename>&lt;sound/mpu401.h&gt;</filename>.
4198 </para> 4198 </para>
4199 4199
4200 <para> 4200 <para>
4201 Some soundchips have similar but a little bit different 4201 Some soundchips have similar but a little bit different
4202 implementation of mpu401 stuff. For example, emu10k1 has its own 4202 implementation of mpu401 stuff. For example, emu10k1 has its own
4203 mpu401 routines. 4203 mpu401 routines.
4204 </para> 4204 </para>
4205 </section> 4205 </section>
4206 4206
4207 <section id="midi-interface-constructor"> 4207 <section id="midi-interface-constructor">
4208 <title>Constructor</title> 4208 <title>Constructor</title>
4209 <para> 4209 <para>
4210 For creating a rawmidi object, call 4210 For creating a rawmidi object, call
4211 <function>snd_mpu401_uart_new()</function>. 4211 <function>snd_mpu401_uart_new()</function>.
4212 4212
4213 <informalexample> 4213 <informalexample>
4214 <programlisting> 4214 <programlisting>
4215 <![CDATA[ 4215 <![CDATA[
4216 struct snd_rawmidi *rmidi; 4216 struct snd_rawmidi *rmidi;
4217 snd_mpu401_uart_new(card, 0, MPU401_HW_MPU401, port, info_flags, 4217 snd_mpu401_uart_new(card, 0, MPU401_HW_MPU401, port, info_flags,
4218 irq, irq_flags, &rmidi); 4218 irq, irq_flags, &rmidi);
4219 ]]> 4219 ]]>
4220 </programlisting> 4220 </programlisting>
4221 </informalexample> 4221 </informalexample>
4222 </para> 4222 </para>
4223 4223
4224 <para> 4224 <para>
4225 The first argument is the card pointer, and the second is the 4225 The first argument is the card pointer, and the second is the
4226 index of this component. You can create up to 8 rawmidi 4226 index of this component. You can create up to 8 rawmidi
4227 devices. 4227 devices.
4228 </para> 4228 </para>
4229 4229
4230 <para> 4230 <para>
4231 The third argument is the type of the hardware, 4231 The third argument is the type of the hardware,
4232 <constant>MPU401_HW_XXX</constant>. If it's not a special one, 4232 <constant>MPU401_HW_XXX</constant>. If it's not a special one,
4233 you can use <constant>MPU401_HW_MPU401</constant>. 4233 you can use <constant>MPU401_HW_MPU401</constant>.
4234 </para> 4234 </para>
4235 4235
4236 <para> 4236 <para>
4237 The 4th argument is the i/o port address. Many 4237 The 4th argument is the i/o port address. Many
4238 backward-compatible MPU401 has an i/o port such as 0x330. Or, it 4238 backward-compatible MPU401 has an i/o port such as 0x330. Or, it
4239 might be a part of its own PCI i/o region. It depends on the 4239 might be a part of its own PCI i/o region. It depends on the
4240 chip design. 4240 chip design.
4241 </para> 4241 </para>
4242 4242
4243 <para> 4243 <para>
4244 The 5th argument is bitflags for additional information. 4244 The 5th argument is bitflags for additional information.
4245 When the i/o port address above is a part of the PCI i/o 4245 When the i/o port address above is a part of the PCI i/o
4246 region, the MPU401 i/o port might have been already allocated 4246 region, the MPU401 i/o port might have been already allocated
4247 (reserved) by the driver itself. In such a case, pass a bit flag 4247 (reserved) by the driver itself. In such a case, pass a bit flag
4248 <constant>MPU401_INFO_INTEGRATED</constant>, 4248 <constant>MPU401_INFO_INTEGRATED</constant>,
4249 and 4249 and
4250 the mpu401-uart layer will allocate the i/o ports by itself. 4250 the mpu401-uart layer will allocate the i/o ports by itself.
4251 </para> 4251 </para>
4252 4252
4253 <para> 4253 <para>
4254 When the controller supports only the input or output MIDI stream, 4254 When the controller supports only the input or output MIDI stream,
4255 pass <constant>MPU401_INFO_INPUT</constant> or 4255 pass <constant>MPU401_INFO_INPUT</constant> or
4256 <constant>MPU401_INFO_OUTPUT</constant> bitflag, respectively. 4256 <constant>MPU401_INFO_OUTPUT</constant> bitflag, respectively.
4257 Then the rawmidi instance is created as a single stream. 4257 Then the rawmidi instance is created as a single stream.
4258 </para> 4258 </para>
4259 4259
4260 <para> 4260 <para>
4261 <constant>MPU401_INFO_MMIO</constant> bitflag is used to change 4261 <constant>MPU401_INFO_MMIO</constant> bitflag is used to change
4262 the access method to MMIO (via readb and writeb) instead of 4262 the access method to MMIO (via readb and writeb) instead of
4263 iob and outb. In this case, you have to pass the iomapped address 4263 iob and outb. In this case, you have to pass the iomapped address
4264 to <function>snd_mpu401_uart_new()</function>. 4264 to <function>snd_mpu401_uart_new()</function>.
4265 </para> 4265 </para>
4266 4266
4267 <para> 4267 <para>
4268 When <constant>MPU401_INFO_TX_IRQ</constant> is set, the output 4268 When <constant>MPU401_INFO_TX_IRQ</constant> is set, the output
4269 stream isn't checked in the default interrupt handler. The driver 4269 stream isn't checked in the default interrupt handler. The driver
4270 needs to call <function>snd_mpu401_uart_interrupt_tx()</function> 4270 needs to call <function>snd_mpu401_uart_interrupt_tx()</function>
4271 by itself to start processing the output stream in irq handler. 4271 by itself to start processing the output stream in irq handler.
4272 </para> 4272 </para>
4273 4273
4274 <para> 4274 <para>
4275 Usually, the port address corresponds to the command port and 4275 Usually, the port address corresponds to the command port and
4276 port + 1 corresponds to the data port. If not, you may change 4276 port + 1 corresponds to the data port. If not, you may change
4277 the <structfield>cport</structfield> field of 4277 the <structfield>cport</structfield> field of
4278 struct <structname>snd_mpu401</structname> manually 4278 struct <structname>snd_mpu401</structname> manually
4279 afterward. However, <structname>snd_mpu401</structname> pointer is not 4279 afterward. However, <structname>snd_mpu401</structname> pointer is not
4280 returned explicitly by 4280 returned explicitly by
4281 <function>snd_mpu401_uart_new()</function>. You need to cast 4281 <function>snd_mpu401_uart_new()</function>. You need to cast
4282 rmidi-&gt;private_data to 4282 rmidi-&gt;private_data to
4283 <structname>snd_mpu401</structname> explicitly, 4283 <structname>snd_mpu401</structname> explicitly,
4284 4284
4285 <informalexample> 4285 <informalexample>
4286 <programlisting> 4286 <programlisting>
4287 <![CDATA[ 4287 <![CDATA[
4288 struct snd_mpu401 *mpu; 4288 struct snd_mpu401 *mpu;
4289 mpu = rmidi->private_data; 4289 mpu = rmidi->private_data;
4290 ]]> 4290 ]]>
4291 </programlisting> 4291 </programlisting>
4292 </informalexample> 4292 </informalexample>
4293 4293
4294 and reset the cport as you like: 4294 and reset the cport as you like:
4295 4295
4296 <informalexample> 4296 <informalexample>
4297 <programlisting> 4297 <programlisting>
4298 <![CDATA[ 4298 <![CDATA[
4299 mpu->cport = my_own_control_port; 4299 mpu->cport = my_own_control_port;
4300 ]]> 4300 ]]>
4301 </programlisting> 4301 </programlisting>
4302 </informalexample> 4302 </informalexample>
4303 </para> 4303 </para>
4304 4304
4305 <para> 4305 <para>
4306 The 6th argument specifies the irq number for UART. If the irq 4306 The 6th argument specifies the irq number for UART. If the irq
4307 is already allocated, pass 0 to the 7th argument 4307 is already allocated, pass 0 to the 7th argument
4308 (<parameter>irq_flags</parameter>). Otherwise, pass the flags 4308 (<parameter>irq_flags</parameter>). Otherwise, pass the flags
4309 for irq allocation 4309 for irq allocation
4310 (<constant>SA_XXX</constant> bits) to it, and the irq will be 4310 (<constant>SA_XXX</constant> bits) to it, and the irq will be
4311 reserved by the mpu401-uart layer. If the card doesn't generates 4311 reserved by the mpu401-uart layer. If the card doesn't generates
4312 UART interrupts, pass -1 as the irq number. Then a timer 4312 UART interrupts, pass -1 as the irq number. Then a timer
4313 interrupt will be invoked for polling. 4313 interrupt will be invoked for polling.
4314 </para> 4314 </para>
4315 </section> 4315 </section>
4316 4316
4317 <section id="midi-interface-interrupt-handler"> 4317 <section id="midi-interface-interrupt-handler">
4318 <title>Interrupt Handler</title> 4318 <title>Interrupt Handler</title>
4319 <para> 4319 <para>
4320 When the interrupt is allocated in 4320 When the interrupt is allocated in
4321 <function>snd_mpu401_uart_new()</function>, the private 4321 <function>snd_mpu401_uart_new()</function>, the private
4322 interrupt handler is used, hence you don't have to do nothing 4322 interrupt handler is used, hence you don't have to do nothing
4323 else than creating the mpu401 stuff. Otherwise, you have to call 4323 else than creating the mpu401 stuff. Otherwise, you have to call
4324 <function>snd_mpu401_uart_interrupt()</function> explicitly when 4324 <function>snd_mpu401_uart_interrupt()</function> explicitly when
4325 a UART interrupt is invoked and checked in your own interrupt 4325 a UART interrupt is invoked and checked in your own interrupt
4326 handler. 4326 handler.
4327 </para> 4327 </para>
4328 4328
4329 <para> 4329 <para>
4330 In this case, you need to pass the private_data of the 4330 In this case, you need to pass the private_data of the
4331 returned rawmidi object from 4331 returned rawmidi object from
4332 <function>snd_mpu401_uart_new()</function> as the second 4332 <function>snd_mpu401_uart_new()</function> as the second
4333 argument of <function>snd_mpu401_uart_interrupt()</function>. 4333 argument of <function>snd_mpu401_uart_interrupt()</function>.
4334 4334
4335 <informalexample> 4335 <informalexample>
4336 <programlisting> 4336 <programlisting>
4337 <![CDATA[ 4337 <![CDATA[
4338 snd_mpu401_uart_interrupt(irq, rmidi->private_data, regs); 4338 snd_mpu401_uart_interrupt(irq, rmidi->private_data, regs);
4339 ]]> 4339 ]]>
4340 </programlisting> 4340 </programlisting>
4341 </informalexample> 4341 </informalexample>
4342 </para> 4342 </para>
4343 </section> 4343 </section>
4344 4344
4345 </chapter> 4345 </chapter>
4346 4346
4347 4347
4348 <!-- ****************************************************** --> 4348 <!-- ****************************************************** -->
4349 <!-- RawMIDI Interface --> 4349 <!-- RawMIDI Interface -->
4350 <!-- ****************************************************** --> 4350 <!-- ****************************************************** -->
4351 <chapter id="rawmidi-interface"> 4351 <chapter id="rawmidi-interface">
4352 <title>RawMIDI Interface</title> 4352 <title>RawMIDI Interface</title>
4353 4353
4354 <section id="rawmidi-interface-overview"> 4354 <section id="rawmidi-interface-overview">
4355 <title>Overview</title> 4355 <title>Overview</title>
4356 4356
4357 <para> 4357 <para>
4358 The raw MIDI interface is used for hardware MIDI ports that can 4358 The raw MIDI interface is used for hardware MIDI ports that can
4359 be accessed as a byte stream. It is not used for synthesizer 4359 be accessed as a byte stream. It is not used for synthesizer
4360 chips that do not directly understand MIDI. 4360 chips that do not directly understand MIDI.
4361 </para> 4361 </para>
4362 4362
4363 <para> 4363 <para>
4364 ALSA handles file and buffer management. All you have to do is 4364 ALSA handles file and buffer management. All you have to do is
4365 to write some code to move data between the buffer and the 4365 to write some code to move data between the buffer and the
4366 hardware. 4366 hardware.
4367 </para> 4367 </para>
4368 4368
4369 <para> 4369 <para>
4370 The rawmidi API is defined in 4370 The rawmidi API is defined in
4371 <filename>&lt;sound/rawmidi.h&gt;</filename>. 4371 <filename>&lt;sound/rawmidi.h&gt;</filename>.
4372 </para> 4372 </para>
4373 </section> 4373 </section>
4374 4374
4375 <section id="rawmidi-interface-constructor"> 4375 <section id="rawmidi-interface-constructor">
4376 <title>Constructor</title> 4376 <title>Constructor</title>
4377 4377
4378 <para> 4378 <para>
4379 To create a rawmidi device, call the 4379 To create a rawmidi device, call the
4380 <function>snd_rawmidi_new</function> function: 4380 <function>snd_rawmidi_new</function> function:
4381 <informalexample> 4381 <informalexample>
4382 <programlisting> 4382 <programlisting>
4383 <![CDATA[ 4383 <![CDATA[
4384 struct snd_rawmidi *rmidi; 4384 struct snd_rawmidi *rmidi;
4385 err = snd_rawmidi_new(chip->card, "MyMIDI", 0, outs, ins, &rmidi); 4385 err = snd_rawmidi_new(chip->card, "MyMIDI", 0, outs, ins, &rmidi);
4386 if (err < 0) 4386 if (err < 0)
4387 return err; 4387 return err;
4388 rmidi->private_data = chip; 4388 rmidi->private_data = chip;
4389 strcpy(rmidi->name, "My MIDI"); 4389 strcpy(rmidi->name, "My MIDI");
4390 rmidi->info_flags = SNDRV_RAWMIDI_INFO_OUTPUT | 4390 rmidi->info_flags = SNDRV_RAWMIDI_INFO_OUTPUT |
4391 SNDRV_RAWMIDI_INFO_INPUT | 4391 SNDRV_RAWMIDI_INFO_INPUT |
4392 SNDRV_RAWMIDI_INFO_DUPLEX; 4392 SNDRV_RAWMIDI_INFO_DUPLEX;
4393 ]]> 4393 ]]>
4394 </programlisting> 4394 </programlisting>
4395 </informalexample> 4395 </informalexample>
4396 </para> 4396 </para>
4397 4397
4398 <para> 4398 <para>
4399 The first argument is the card pointer, the second argument is 4399 The first argument is the card pointer, the second argument is
4400 the ID string. 4400 the ID string.
4401 </para> 4401 </para>
4402 4402
4403 <para> 4403 <para>
4404 The third argument is the index of this component. You can 4404 The third argument is the index of this component. You can
4405 create up to 8 rawmidi devices. 4405 create up to 8 rawmidi devices.
4406 </para> 4406 </para>
4407 4407
4408 <para> 4408 <para>
4409 The fourth and fifth arguments are the number of output and 4409 The fourth and fifth arguments are the number of output and
4410 input substreams, respectively, of this device. (A substream is 4410 input substreams, respectively, of this device. (A substream is
4411 the equivalent of a MIDI port.) 4411 the equivalent of a MIDI port.)
4412 </para> 4412 </para>
4413 4413
4414 <para> 4414 <para>
4415 Set the <structfield>info_flags</structfield> field to specify 4415 Set the <structfield>info_flags</structfield> field to specify
4416 the capabilities of the device. 4416 the capabilities of the device.
4417 Set <constant>SNDRV_RAWMIDI_INFO_OUTPUT</constant> if there is 4417 Set <constant>SNDRV_RAWMIDI_INFO_OUTPUT</constant> if there is
4418 at least one output port, 4418 at least one output port,
4419 <constant>SNDRV_RAWMIDI_INFO_INPUT</constant> if there is at 4419 <constant>SNDRV_RAWMIDI_INFO_INPUT</constant> if there is at
4420 least one input port, 4420 least one input port,
4421 and <constant>SNDRV_RAWMIDI_INFO_DUPLEX</constant> if the device 4421 and <constant>SNDRV_RAWMIDI_INFO_DUPLEX</constant> if the device
4422 can handle output and input at the same time. 4422 can handle output and input at the same time.
4423 </para> 4423 </para>
4424 4424
4425 <para> 4425 <para>
4426 After the rawmidi device is created, you need to set the 4426 After the rawmidi device is created, you need to set the
4427 operators (callbacks) for each substream. There are helper 4427 operators (callbacks) for each substream. There are helper
4428 functions to set the operators for all substream of a device: 4428 functions to set the operators for all substream of a device:
4429 <informalexample> 4429 <informalexample>
4430 <programlisting> 4430 <programlisting>
4431 <![CDATA[ 4431 <![CDATA[
4432 snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_OUTPUT, &snd_mymidi_output_ops); 4432 snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_OUTPUT, &snd_mymidi_output_ops);
4433 snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_INPUT, &snd_mymidi_input_ops); 4433 snd_rawmidi_set_ops(rmidi, SNDRV_RAWMIDI_STREAM_INPUT, &snd_mymidi_input_ops);
4434 ]]> 4434 ]]>
4435 </programlisting> 4435 </programlisting>
4436 </informalexample> 4436 </informalexample>
4437 </para> 4437 </para>
4438 4438
4439 <para> 4439 <para>
4440 The operators are usually defined like this: 4440 The operators are usually defined like this:
4441 <informalexample> 4441 <informalexample>
4442 <programlisting> 4442 <programlisting>
4443 <![CDATA[ 4443 <![CDATA[
4444 static struct snd_rawmidi_ops snd_mymidi_output_ops = { 4444 static struct snd_rawmidi_ops snd_mymidi_output_ops = {
4445 .open = snd_mymidi_output_open, 4445 .open = snd_mymidi_output_open,
4446 .close = snd_mymidi_output_close, 4446 .close = snd_mymidi_output_close,
4447 .trigger = snd_mymidi_output_trigger, 4447 .trigger = snd_mymidi_output_trigger,
4448 }; 4448 };
4449 ]]> 4449 ]]>
4450 </programlisting> 4450 </programlisting>
4451 </informalexample> 4451 </informalexample>
4452 These callbacks are explained in the <link 4452 These callbacks are explained in the <link
4453 linkend="rawmidi-interface-callbacks"><citetitle>Callbacks</citetitle></link> 4453 linkend="rawmidi-interface-callbacks"><citetitle>Callbacks</citetitle></link>
4454 section. 4454 section.
4455 </para> 4455 </para>
4456 4456
4457 <para> 4457 <para>
4458 If there is more than one substream, you should give each one a 4458 If there is more than one substream, you should give each one a
4459 unique name: 4459 unique name:
4460 <informalexample> 4460 <informalexample>
4461 <programlisting> 4461 <programlisting>
4462 <![CDATA[ 4462 <![CDATA[
4463 struct list_head *list; 4463 struct list_head *list;
4464 struct snd_rawmidi_substream *substream; 4464 struct snd_rawmidi_substream *substream;
4465 list_for_each(list, &rmidi->streams[SNDRV_RAWMIDI_STREAM_OUTPUT].substreams) { 4465 list_for_each(list, &rmidi->streams[SNDRV_RAWMIDI_STREAM_OUTPUT].substreams) {
4466 substream = list_entry(list, struct snd_rawmidi_substream, list); 4466 substream = list_entry(list, struct snd_rawmidi_substream, list);
4467 sprintf(substream->name, "My MIDI Port %d", substream->number + 1); 4467 sprintf(substream->name, "My MIDI Port %d", substream->number + 1);
4468 } 4468 }
4469 /* same for SNDRV_RAWMIDI_STREAM_INPUT */ 4469 /* same for SNDRV_RAWMIDI_STREAM_INPUT */
4470 ]]> 4470 ]]>
4471 </programlisting> 4471 </programlisting>
4472 </informalexample> 4472 </informalexample>
4473 </para> 4473 </para>
4474 </section> 4474 </section>
4475 4475
4476 <section id="rawmidi-interface-callbacks"> 4476 <section id="rawmidi-interface-callbacks">
4477 <title>Callbacks</title> 4477 <title>Callbacks</title>
4478 4478
4479 <para> 4479 <para>
4480 In all callbacks, the private data that you've set for the 4480 In all callbacks, the private data that you've set for the
4481 rawmidi device can be accessed as 4481 rawmidi device can be accessed as
4482 substream-&gt;rmidi-&gt;private_data. 4482 substream-&gt;rmidi-&gt;private_data.
4483 <!-- <code> isn't available before DocBook 4.3 --> 4483 <!-- <code> isn't available before DocBook 4.3 -->
4484 </para> 4484 </para>
4485 4485
4486 <para> 4486 <para>
4487 If there is more than one port, your callbacks can determine the 4487 If there is more than one port, your callbacks can determine the
4488 port index from the struct snd_rawmidi_substream data passed to each 4488 port index from the struct snd_rawmidi_substream data passed to each
4489 callback: 4489 callback:
4490 <informalexample> 4490 <informalexample>
4491 <programlisting> 4491 <programlisting>
4492 <![CDATA[ 4492 <![CDATA[
4493 struct snd_rawmidi_substream *substream; 4493 struct snd_rawmidi_substream *substream;
4494 int index = substream->number; 4494 int index = substream->number;
4495 ]]> 4495 ]]>
4496 </programlisting> 4496 </programlisting>
4497 </informalexample> 4497 </informalexample>
4498 </para> 4498 </para>
4499 4499
4500 <section id="rawmidi-interface-op-open"> 4500 <section id="rawmidi-interface-op-open">
4501 <title><function>open</function> callback</title> 4501 <title><function>open</function> callback</title>
4502 4502
4503 <informalexample> 4503 <informalexample>
4504 <programlisting> 4504 <programlisting>
4505 <![CDATA[ 4505 <![CDATA[
4506 static int snd_xxx_open(struct snd_rawmidi_substream *substream); 4506 static int snd_xxx_open(struct snd_rawmidi_substream *substream);
4507 ]]> 4507 ]]>
4508 </programlisting> 4508 </programlisting>
4509 </informalexample> 4509 </informalexample>
4510 4510
4511 <para> 4511 <para>
4512 This is called when a substream is opened. 4512 This is called when a substream is opened.
4513 You can initialize the hardware here, but you should not yet 4513 You can initialize the hardware here, but you should not yet
4514 start transmitting/receiving data. 4514 start transmitting/receiving data.
4515 </para> 4515 </para>
4516 </section> 4516 </section>
4517 4517
4518 <section id="rawmidi-interface-op-close"> 4518 <section id="rawmidi-interface-op-close">
4519 <title><function>close</function> callback</title> 4519 <title><function>close</function> callback</title>
4520 4520
4521 <informalexample> 4521 <informalexample>
4522 <programlisting> 4522 <programlisting>
4523 <![CDATA[ 4523 <![CDATA[
4524 static int snd_xxx_close(struct snd_rawmidi_substream *substream); 4524 static int snd_xxx_close(struct snd_rawmidi_substream *substream);
4525 ]]> 4525 ]]>
4526 </programlisting> 4526 </programlisting>
4527 </informalexample> 4527 </informalexample>
4528 4528
4529 <para> 4529 <para>
4530 Guess what. 4530 Guess what.
4531 </para> 4531 </para>
4532 4532
4533 <para> 4533 <para>
4534 The <function>open</function> and <function>close</function> 4534 The <function>open</function> and <function>close</function>
4535 callbacks of a rawmidi device are serialized with a mutex, 4535 callbacks of a rawmidi device are serialized with a mutex,
4536 and can sleep. 4536 and can sleep.
4537 </para> 4537 </para>
4538 </section> 4538 </section>
4539 4539
4540 <section id="rawmidi-interface-op-trigger-out"> 4540 <section id="rawmidi-interface-op-trigger-out">
4541 <title><function>trigger</function> callback for output 4541 <title><function>trigger</function> callback for output
4542 substreams</title> 4542 substreams</title>
4543 4543
4544 <informalexample> 4544 <informalexample>
4545 <programlisting> 4545 <programlisting>
4546 <![CDATA[ 4546 <![CDATA[
4547 static void snd_xxx_output_trigger(struct snd_rawmidi_substream *substream, int up); 4547 static void snd_xxx_output_trigger(struct snd_rawmidi_substream *substream, int up);
4548 ]]> 4548 ]]>
4549 </programlisting> 4549 </programlisting>
4550 </informalexample> 4550 </informalexample>
4551 4551
4552 <para> 4552 <para>
4553 This is called with a nonzero <parameter>up</parameter> 4553 This is called with a nonzero <parameter>up</parameter>
4554 parameter when there is some data in the substream buffer that 4554 parameter when there is some data in the substream buffer that
4555 must be transmitted. 4555 must be transmitted.
4556 </para> 4556 </para>
4557 4557
4558 <para> 4558 <para>
4559 To read data from the buffer, call 4559 To read data from the buffer, call
4560 <function>snd_rawmidi_transmit_peek</function>. It will 4560 <function>snd_rawmidi_transmit_peek</function>. It will
4561 return the number of bytes that have been read; this will be 4561 return the number of bytes that have been read; this will be
4562 less than the number of bytes requested when there is no more 4562 less than the number of bytes requested when there is no more
4563 data in the buffer. 4563 data in the buffer.
4564 After the data has been transmitted successfully, call 4564 After the data has been transmitted successfully, call
4565 <function>snd_rawmidi_transmit_ack</function> to remove the 4565 <function>snd_rawmidi_transmit_ack</function> to remove the
4566 data from the substream buffer: 4566 data from the substream buffer:
4567 <informalexample> 4567 <informalexample>
4568 <programlisting> 4568 <programlisting>
4569 <![CDATA[ 4569 <![CDATA[
4570 unsigned char data; 4570 unsigned char data;
4571 while (snd_rawmidi_transmit_peek(substream, &data, 1) == 1) { 4571 while (snd_rawmidi_transmit_peek(substream, &data, 1) == 1) {
4572 if (snd_mychip_try_to_transmit(data)) 4572 if (snd_mychip_try_to_transmit(data))
4573 snd_rawmidi_transmit_ack(substream, 1); 4573 snd_rawmidi_transmit_ack(substream, 1);
4574 else 4574 else
4575 break; /* hardware FIFO full */ 4575 break; /* hardware FIFO full */
4576 } 4576 }
4577 ]]> 4577 ]]>
4578 </programlisting> 4578 </programlisting>
4579 </informalexample> 4579 </informalexample>
4580 </para> 4580 </para>
4581 4581
4582 <para> 4582 <para>
4583 If you know beforehand that the hardware will accept data, you 4583 If you know beforehand that the hardware will accept data, you
4584 can use the <function>snd_rawmidi_transmit</function> function 4584 can use the <function>snd_rawmidi_transmit</function> function
4585 which reads some data and removes it from the buffer at once: 4585 which reads some data and removes it from the buffer at once:
4586 <informalexample> 4586 <informalexample>
4587 <programlisting> 4587 <programlisting>
4588 <![CDATA[ 4588 <![CDATA[
4589 while (snd_mychip_transmit_possible()) { 4589 while (snd_mychip_transmit_possible()) {
4590 unsigned char data; 4590 unsigned char data;
4591 if (snd_rawmidi_transmit(substream, &data, 1) != 1) 4591 if (snd_rawmidi_transmit(substream, &data, 1) != 1)
4592 break; /* no more data */ 4592 break; /* no more data */
4593 snd_mychip_transmit(data); 4593 snd_mychip_transmit(data);
4594 } 4594 }
4595 ]]> 4595 ]]>
4596 </programlisting> 4596 </programlisting>
4597 </informalexample> 4597 </informalexample>
4598 </para> 4598 </para>
4599 4599
4600 <para> 4600 <para>
4601 If you know beforehand how many bytes you can accept, you can 4601 If you know beforehand how many bytes you can accept, you can
4602 use a buffer size greater than one with the 4602 use a buffer size greater than one with the
4603 <function>snd_rawmidi_transmit*</function> functions. 4603 <function>snd_rawmidi_transmit*</function> functions.
4604 </para> 4604 </para>
4605 4605
4606 <para> 4606 <para>
4607 The <function>trigger</function> callback must not sleep. If 4607 The <function>trigger</function> callback must not sleep. If
4608 the hardware FIFO is full before the substream buffer has been 4608 the hardware FIFO is full before the substream buffer has been
4609 emptied, you have to continue transmitting data later, either 4609 emptied, you have to continue transmitting data later, either
4610 in an interrupt handler, or with a timer if the hardware 4610 in an interrupt handler, or with a timer if the hardware
4611 doesn't have a MIDI transmit interrupt. 4611 doesn't have a MIDI transmit interrupt.
4612 </para> 4612 </para>
4613 4613
4614 <para> 4614 <para>
4615 The <function>trigger</function> callback is called with a 4615 The <function>trigger</function> callback is called with a
4616 zero <parameter>up</parameter> parameter when the transmission 4616 zero <parameter>up</parameter> parameter when the transmission
4617 of data should be aborted. 4617 of data should be aborted.
4618 </para> 4618 </para>
4619 </section> 4619 </section>
4620 4620
4621 <section id="rawmidi-interface-op-trigger-in"> 4621 <section id="rawmidi-interface-op-trigger-in">
4622 <title><function>trigger</function> callback for input 4622 <title><function>trigger</function> callback for input
4623 substreams</title> 4623 substreams</title>
4624 4624
4625 <informalexample> 4625 <informalexample>
4626 <programlisting> 4626 <programlisting>
4627 <![CDATA[ 4627 <![CDATA[
4628 static void snd_xxx_input_trigger(struct snd_rawmidi_substream *substream, int up); 4628 static void snd_xxx_input_trigger(struct snd_rawmidi_substream *substream, int up);
4629 ]]> 4629 ]]>
4630 </programlisting> 4630 </programlisting>
4631 </informalexample> 4631 </informalexample>
4632 4632
4633 <para> 4633 <para>
4634 This is called with a nonzero <parameter>up</parameter> 4634 This is called with a nonzero <parameter>up</parameter>
4635 parameter to enable receiving data, or with a zero 4635 parameter to enable receiving data, or with a zero
4636 <parameter>up</parameter> parameter do disable receiving data. 4636 <parameter>up</parameter> parameter do disable receiving data.
4637 </para> 4637 </para>
4638 4638
4639 <para> 4639 <para>
4640 The <function>trigger</function> callback must not sleep; the 4640 The <function>trigger</function> callback must not sleep; the
4641 actual reading of data from the device is usually done in an 4641 actual reading of data from the device is usually done in an
4642 interrupt handler. 4642 interrupt handler.
4643 </para> 4643 </para>
4644 4644
4645 <para> 4645 <para>
4646 When data reception is enabled, your interrupt handler should 4646 When data reception is enabled, your interrupt handler should
4647 call <function>snd_rawmidi_receive</function> for all received 4647 call <function>snd_rawmidi_receive</function> for all received
4648 data: 4648 data:
4649 <informalexample> 4649 <informalexample>
4650 <programlisting> 4650 <programlisting>
4651 <![CDATA[ 4651 <![CDATA[
4652 void snd_mychip_midi_interrupt(...) 4652 void snd_mychip_midi_interrupt(...)
4653 { 4653 {
4654 while (mychip_midi_available()) { 4654 while (mychip_midi_available()) {
4655 unsigned char data; 4655 unsigned char data;
4656 data = mychip_midi_read(); 4656 data = mychip_midi_read();
4657 snd_rawmidi_receive(substream, &data, 1); 4657 snd_rawmidi_receive(substream, &data, 1);
4658 } 4658 }
4659 } 4659 }
4660 ]]> 4660 ]]>
4661 </programlisting> 4661 </programlisting>
4662 </informalexample> 4662 </informalexample>
4663 </para> 4663 </para>
4664 </section> 4664 </section>
4665 4665
4666 <section id="rawmidi-interface-op-drain"> 4666 <section id="rawmidi-interface-op-drain">
4667 <title><function>drain</function> callback</title> 4667 <title><function>drain</function> callback</title>
4668 4668
4669 <informalexample> 4669 <informalexample>
4670 <programlisting> 4670 <programlisting>
4671 <![CDATA[ 4671 <![CDATA[
4672 static void snd_xxx_drain(struct snd_rawmidi_substream *substream); 4672 static void snd_xxx_drain(struct snd_rawmidi_substream *substream);
4673 ]]> 4673 ]]>
4674 </programlisting> 4674 </programlisting>
4675 </informalexample> 4675 </informalexample>
4676 4676
4677 <para> 4677 <para>
4678 This is only used with output substreams. This function should wait 4678 This is only used with output substreams. This function should wait
4679 until all data read from the substream buffer has been transmitted. 4679 until all data read from the substream buffer has been transmitted.
4680 This ensures that the device can be closed and the driver unloaded 4680 This ensures that the device can be closed and the driver unloaded
4681 without losing data. 4681 without losing data.
4682 </para> 4682 </para>
4683 4683
4684 <para> 4684 <para>
4685 This callback is optional. If you do not set 4685 This callback is optional. If you do not set
4686 <structfield>drain</structfield> in the struct snd_rawmidi_ops 4686 <structfield>drain</structfield> in the struct snd_rawmidi_ops
4687 structure, ALSA will simply wait for 50&nbsp;milliseconds 4687 structure, ALSA will simply wait for 50&nbsp;milliseconds
4688 instead. 4688 instead.
4689 </para> 4689 </para>
4690 </section> 4690 </section>
4691 </section> 4691 </section>
4692 4692
4693 </chapter> 4693 </chapter>
4694 4694
4695 4695
4696 <!-- ****************************************************** --> 4696 <!-- ****************************************************** -->
4697 <!-- Miscellaneous Devices --> 4697 <!-- Miscellaneous Devices -->
4698 <!-- ****************************************************** --> 4698 <!-- ****************************************************** -->
4699 <chapter id="misc-devices"> 4699 <chapter id="misc-devices">
4700 <title>Miscellaneous Devices</title> 4700 <title>Miscellaneous Devices</title>
4701 4701
4702 <section id="misc-devices-opl3"> 4702 <section id="misc-devices-opl3">
4703 <title>FM OPL3</title> 4703 <title>FM OPL3</title>
4704 <para> 4704 <para>
4705 The FM OPL3 is still used on many chips (mainly for backward 4705 The FM OPL3 is still used on many chips (mainly for backward
4706 compatibility). ALSA has a nice OPL3 FM control layer, too. The 4706 compatibility). ALSA has a nice OPL3 FM control layer, too. The
4707 OPL3 API is defined in 4707 OPL3 API is defined in
4708 <filename>&lt;sound/opl3.h&gt;</filename>. 4708 <filename>&lt;sound/opl3.h&gt;</filename>.
4709 </para> 4709 </para>
4710 4710
4711 <para> 4711 <para>
4712 FM registers can be directly accessed through direct-FM API, 4712 FM registers can be directly accessed through direct-FM API,
4713 defined in <filename>&lt;sound/asound_fm.h&gt;</filename>. In 4713 defined in <filename>&lt;sound/asound_fm.h&gt;</filename>. In
4714 ALSA native mode, FM registers are accessed through 4714 ALSA native mode, FM registers are accessed through
4715 Hardware-Dependant Device direct-FM extension API, whereas in 4715 Hardware-Dependant Device direct-FM extension API, whereas in
4716 OSS compatible mode, FM registers can be accessed with OSS 4716 OSS compatible mode, FM registers can be accessed with OSS
4717 direct-FM compatible API on <filename>/dev/dmfmX</filename> device. 4717 direct-FM compatible API on <filename>/dev/dmfmX</filename> device.
4718 </para> 4718 </para>
4719 4719
4720 <para> 4720 <para>
4721 For creating the OPL3 component, you have two functions to 4721 For creating the OPL3 component, you have two functions to
4722 call. The first one is a constructor for <type>opl3_t</type> 4722 call. The first one is a constructor for <type>opl3_t</type>
4723 instance. 4723 instance.
4724 4724
4725 <informalexample> 4725 <informalexample>
4726 <programlisting> 4726 <programlisting>
4727 <![CDATA[ 4727 <![CDATA[
4728 struct snd_opl3 *opl3; 4728 struct snd_opl3 *opl3;
4729 snd_opl3_create(card, lport, rport, OPL3_HW_OPL3_XXX, 4729 snd_opl3_create(card, lport, rport, OPL3_HW_OPL3_XXX,
4730 integrated, &opl3); 4730 integrated, &opl3);
4731 ]]> 4731 ]]>
4732 </programlisting> 4732 </programlisting>
4733 </informalexample> 4733 </informalexample>
4734 </para> 4734 </para>
4735 4735
4736 <para> 4736 <para>
4737 The first argument is the card pointer, the second one is the 4737 The first argument is the card pointer, the second one is the
4738 left port address, and the third is the right port address. In 4738 left port address, and the third is the right port address. In
4739 most cases, the right port is placed at the left port + 2. 4739 most cases, the right port is placed at the left port + 2.
4740 </para> 4740 </para>
4741 4741
4742 <para> 4742 <para>
4743 The fourth argument is the hardware type. 4743 The fourth argument is the hardware type.
4744 </para> 4744 </para>
4745 4745
4746 <para> 4746 <para>
4747 When the left and right ports have been already allocated by 4747 When the left and right ports have been already allocated by
4748 the card driver, pass non-zero to the fifth argument 4748 the card driver, pass non-zero to the fifth argument
4749 (<parameter>integrated</parameter>). Otherwise, opl3 module will 4749 (<parameter>integrated</parameter>). Otherwise, opl3 module will
4750 allocate the specified ports by itself. 4750 allocate the specified ports by itself.
4751 </para> 4751 </para>
4752 4752
4753 <para> 4753 <para>
4754 When the accessing to the hardware requires special method 4754 When the accessing to the hardware requires special method
4755 instead of the standard I/O access, you can create opl3 instance 4755 instead of the standard I/O access, you can create opl3 instance
4756 separately with <function>snd_opl3_new()</function>. 4756 separately with <function>snd_opl3_new()</function>.
4757 4757
4758 <informalexample> 4758 <informalexample>
4759 <programlisting> 4759 <programlisting>
4760 <![CDATA[ 4760 <![CDATA[
4761 struct snd_opl3 *opl3; 4761 struct snd_opl3 *opl3;
4762 snd_opl3_new(card, OPL3_HW_OPL3_XXX, &opl3); 4762 snd_opl3_new(card, OPL3_HW_OPL3_XXX, &opl3);
4763 ]]> 4763 ]]>
4764 </programlisting> 4764 </programlisting>
4765 </informalexample> 4765 </informalexample>
4766 </para> 4766 </para>
4767 4767
4768 <para> 4768 <para>
4769 Then set <structfield>command</structfield>, 4769 Then set <structfield>command</structfield>,
4770 <structfield>private_data</structfield> and 4770 <structfield>private_data</structfield> and
4771 <structfield>private_free</structfield> for the private 4771 <structfield>private_free</structfield> for the private
4772 access function, the private data and the destructor. 4772 access function, the private data and the destructor.
4773 The l_port and r_port are not necessarily set. Only the 4773 The l_port and r_port are not necessarily set. Only the
4774 command must be set properly. You can retrieve the data 4774 command must be set properly. You can retrieve the data
4775 from opl3-&gt;private_data field. 4775 from opl3-&gt;private_data field.
4776 </para> 4776 </para>
4777 4777
4778 <para> 4778 <para>
4779 After creating the opl3 instance via <function>snd_opl3_new()</function>, 4779 After creating the opl3 instance via <function>snd_opl3_new()</function>,
4780 call <function>snd_opl3_init()</function> to initialize the chip to the 4780 call <function>snd_opl3_init()</function> to initialize the chip to the
4781 proper state. Note that <function>snd_opl3_create()</function> always 4781 proper state. Note that <function>snd_opl3_create()</function> always
4782 calls it internally. 4782 calls it internally.
4783 </para> 4783 </para>
4784 4784
4785 <para> 4785 <para>
4786 If the opl3 instance is created successfully, then create a 4786 If the opl3 instance is created successfully, then create a
4787 hwdep device for this opl3. 4787 hwdep device for this opl3.
4788 4788
4789 <informalexample> 4789 <informalexample>
4790 <programlisting> 4790 <programlisting>
4791 <![CDATA[ 4791 <![CDATA[
4792 struct snd_hwdep *opl3hwdep; 4792 struct snd_hwdep *opl3hwdep;
4793 snd_opl3_hwdep_new(opl3, 0, 1, &opl3hwdep); 4793 snd_opl3_hwdep_new(opl3, 0, 1, &opl3hwdep);
4794 ]]> 4794 ]]>
4795 </programlisting> 4795 </programlisting>
4796 </informalexample> 4796 </informalexample>
4797 </para> 4797 </para>
4798 4798
4799 <para> 4799 <para>
4800 The first argument is the <type>opl3_t</type> instance you 4800 The first argument is the <type>opl3_t</type> instance you
4801 created, and the second is the index number, usually 0. 4801 created, and the second is the index number, usually 0.
4802 </para> 4802 </para>
4803 4803
4804 <para> 4804 <para>
4805 The third argument is the index-offset for the sequencer 4805 The third argument is the index-offset for the sequencer
4806 client assigned to the OPL3 port. When there is an MPU401-UART, 4806 client assigned to the OPL3 port. When there is an MPU401-UART,
4807 give 1 for here (UART always takes 0). 4807 give 1 for here (UART always takes 0).
4808 </para> 4808 </para>
4809 </section> 4809 </section>
4810 4810
4811 <section id="misc-devices-hardware-dependent"> 4811 <section id="misc-devices-hardware-dependent">
4812 <title>Hardware-Dependent Devices</title> 4812 <title>Hardware-Dependent Devices</title>
4813 <para> 4813 <para>
4814 Some chips need the access from the user-space for special 4814 Some chips need the access from the user-space for special
4815 controls or for loading the micro code. In such a case, you can 4815 controls or for loading the micro code. In such a case, you can
4816 create a hwdep (hardware-dependent) device. The hwdep API is 4816 create a hwdep (hardware-dependent) device. The hwdep API is
4817 defined in <filename>&lt;sound/hwdep.h&gt;</filename>. You can 4817 defined in <filename>&lt;sound/hwdep.h&gt;</filename>. You can
4818 find examples in opl3 driver or 4818 find examples in opl3 driver or
4819 <filename>isa/sb/sb16_csp.c</filename>. 4819 <filename>isa/sb/sb16_csp.c</filename>.
4820 </para> 4820 </para>
4821 4821
4822 <para> 4822 <para>
4823 Creation of the <type>hwdep</type> instance is done via 4823 Creation of the <type>hwdep</type> instance is done via
4824 <function>snd_hwdep_new()</function>. 4824 <function>snd_hwdep_new()</function>.
4825 4825
4826 <informalexample> 4826 <informalexample>
4827 <programlisting> 4827 <programlisting>
4828 <![CDATA[ 4828 <![CDATA[
4829 struct snd_hwdep *hw; 4829 struct snd_hwdep *hw;
4830 snd_hwdep_new(card, "My HWDEP", 0, &hw); 4830 snd_hwdep_new(card, "My HWDEP", 0, &hw);
4831 ]]> 4831 ]]>
4832 </programlisting> 4832 </programlisting>
4833 </informalexample> 4833 </informalexample>
4834 4834
4835 where the third argument is the index number. 4835 where the third argument is the index number.
4836 </para> 4836 </para>
4837 4837
4838 <para> 4838 <para>
4839 You can then pass any pointer value to the 4839 You can then pass any pointer value to the
4840 <parameter>private_data</parameter>. 4840 <parameter>private_data</parameter>.
4841 If you assign a private data, you should define the 4841 If you assign a private data, you should define the
4842 destructor, too. The destructor function is set to 4842 destructor, too. The destructor function is set to
4843 <structfield>private_free</structfield> field. 4843 <structfield>private_free</structfield> field.
4844 4844
4845 <informalexample> 4845 <informalexample>
4846 <programlisting> 4846 <programlisting>
4847 <![CDATA[ 4847 <![CDATA[
4848 struct mydata *p = kmalloc(sizeof(*p), GFP_KERNEL); 4848 struct mydata *p = kmalloc(sizeof(*p), GFP_KERNEL);
4849 hw->private_data = p; 4849 hw->private_data = p;
4850 hw->private_free = mydata_free; 4850 hw->private_free = mydata_free;
4851 ]]> 4851 ]]>
4852 </programlisting> 4852 </programlisting>
4853 </informalexample> 4853 </informalexample>
4854 4854
4855 and the implementation of destructor would be: 4855 and the implementation of destructor would be:
4856 4856
4857 <informalexample> 4857 <informalexample>
4858 <programlisting> 4858 <programlisting>
4859 <![CDATA[ 4859 <![CDATA[
4860 static void mydata_free(struct snd_hwdep *hw) 4860 static void mydata_free(struct snd_hwdep *hw)
4861 { 4861 {
4862 struct mydata *p = hw->private_data; 4862 struct mydata *p = hw->private_data;
4863 kfree(p); 4863 kfree(p);
4864 } 4864 }
4865 ]]> 4865 ]]>
4866 </programlisting> 4866 </programlisting>
4867 </informalexample> 4867 </informalexample>
4868 </para> 4868 </para>
4869 4869
4870 <para> 4870 <para>
4871 The arbitrary file operations can be defined for this 4871 The arbitrary file operations can be defined for this
4872 instance. The file operators are defined in 4872 instance. The file operators are defined in
4873 <parameter>ops</parameter> table. For example, assume that 4873 <parameter>ops</parameter> table. For example, assume that
4874 this chip needs an ioctl. 4874 this chip needs an ioctl.
4875 4875
4876 <informalexample> 4876 <informalexample>
4877 <programlisting> 4877 <programlisting>
4878 <![CDATA[ 4878 <![CDATA[
4879 hw->ops.open = mydata_open; 4879 hw->ops.open = mydata_open;
4880 hw->ops.ioctl = mydata_ioctl; 4880 hw->ops.ioctl = mydata_ioctl;
4881 hw->ops.release = mydata_release; 4881 hw->ops.release = mydata_release;
4882 ]]> 4882 ]]>
4883 </programlisting> 4883 </programlisting>
4884 </informalexample> 4884 </informalexample>
4885 4885
4886 And implement the callback functions as you like. 4886 And implement the callback functions as you like.
4887 </para> 4887 </para>
4888 </section> 4888 </section>
4889 4889
4890 <section id="misc-devices-IEC958"> 4890 <section id="misc-devices-IEC958">
4891 <title>IEC958 (S/PDIF)</title> 4891 <title>IEC958 (S/PDIF)</title>
4892 <para> 4892 <para>
4893 Usually the controls for IEC958 devices are implemented via 4893 Usually the controls for IEC958 devices are implemented via
4894 control interface. There is a macro to compose a name string for 4894 control interface. There is a macro to compose a name string for
4895 IEC958 controls, <function>SNDRV_CTL_NAME_IEC958()</function> 4895 IEC958 controls, <function>SNDRV_CTL_NAME_IEC958()</function>
4896 defined in <filename>&lt;include/asound.h&gt;</filename>. 4896 defined in <filename>&lt;include/asound.h&gt;</filename>.
4897 </para> 4897 </para>
4898 4898
4899 <para> 4899 <para>
4900 There are some standard controls for IEC958 status bits. These 4900 There are some standard controls for IEC958 status bits. These
4901 controls use the type <type>SNDRV_CTL_ELEM_TYPE_IEC958</type>, 4901 controls use the type <type>SNDRV_CTL_ELEM_TYPE_IEC958</type>,
4902 and the size of element is fixed as 4 bytes array 4902 and the size of element is fixed as 4 bytes array
4903 (value.iec958.status[x]). For <structfield>info</structfield> 4903 (value.iec958.status[x]). For <structfield>info</structfield>
4904 callback, you don't specify 4904 callback, you don't specify
4905 the value field for this type (the count field must be set, 4905 the value field for this type (the count field must be set,
4906 though). 4906 though).
4907 </para> 4907 </para>
4908 4908
4909 <para> 4909 <para>
4910 <quote>IEC958 Playback Con Mask</quote> is used to return the 4910 <quote>IEC958 Playback Con Mask</quote> is used to return the
4911 bit-mask for the IEC958 status bits of consumer mode. Similarly, 4911 bit-mask for the IEC958 status bits of consumer mode. Similarly,
4912 <quote>IEC958 Playback Pro Mask</quote> returns the bitmask for 4912 <quote>IEC958 Playback Pro Mask</quote> returns the bitmask for
4913 professional mode. They are read-only controls, and are defined 4913 professional mode. They are read-only controls, and are defined
4914 as MIXER controls (iface = 4914 as MIXER controls (iface =
4915 <constant>SNDRV_CTL_ELEM_IFACE_MIXER</constant>). 4915 <constant>SNDRV_CTL_ELEM_IFACE_MIXER</constant>).
4916 </para> 4916 </para>
4917 4917
4918 <para> 4918 <para>
4919 Meanwhile, <quote>IEC958 Playback Default</quote> control is 4919 Meanwhile, <quote>IEC958 Playback Default</quote> control is
4920 defined for getting and setting the current default IEC958 4920 defined for getting and setting the current default IEC958
4921 bits. Note that this one is usually defined as a PCM control 4921 bits. Note that this one is usually defined as a PCM control
4922 (iface = <constant>SNDRV_CTL_ELEM_IFACE_PCM</constant>), 4922 (iface = <constant>SNDRV_CTL_ELEM_IFACE_PCM</constant>),
4923 although in some places it's defined as a MIXER control. 4923 although in some places it's defined as a MIXER control.
4924 </para> 4924 </para>
4925 4925
4926 <para> 4926 <para>
4927 In addition, you can define the control switches to 4927 In addition, you can define the control switches to
4928 enable/disable or to set the raw bit mode. The implementation 4928 enable/disable or to set the raw bit mode. The implementation
4929 will depend on the chip, but the control should be named as 4929 will depend on the chip, but the control should be named as
4930 <quote>IEC958 xxx</quote>, preferably using 4930 <quote>IEC958 xxx</quote>, preferably using
4931 <function>SNDRV_CTL_NAME_IEC958()</function> macro. 4931 <function>SNDRV_CTL_NAME_IEC958()</function> macro.
4932 </para> 4932 </para>
4933 4933
4934 <para> 4934 <para>
4935 You can find several cases, for example, 4935 You can find several cases, for example,
4936 <filename>pci/emu10k1</filename>, 4936 <filename>pci/emu10k1</filename>,
4937 <filename>pci/ice1712</filename>, or 4937 <filename>pci/ice1712</filename>, or
4938 <filename>pci/cmipci.c</filename>. 4938 <filename>pci/cmipci.c</filename>.
4939 </para> 4939 </para>
4940 </section> 4940 </section>
4941 4941
4942 </chapter> 4942 </chapter>
4943 4943
4944 4944
4945 <!-- ****************************************************** --> 4945 <!-- ****************************************************** -->
4946 <!-- Buffer and Memory Management --> 4946 <!-- Buffer and Memory Management -->
4947 <!-- ****************************************************** --> 4947 <!-- ****************************************************** -->
4948 <chapter id="buffer-and-memory"> 4948 <chapter id="buffer-and-memory">
4949 <title>Buffer and Memory Management</title> 4949 <title>Buffer and Memory Management</title>
4950 4950
4951 <section id="buffer-and-memory-buffer-types"> 4951 <section id="buffer-and-memory-buffer-types">
4952 <title>Buffer Types</title> 4952 <title>Buffer Types</title>
4953 <para> 4953 <para>
4954 ALSA provides several different buffer allocation functions 4954 ALSA provides several different buffer allocation functions
4955 depending on the bus and the architecture. All these have a 4955 depending on the bus and the architecture. All these have a
4956 consistent API. The allocation of physically-contiguous pages is 4956 consistent API. The allocation of physically-contiguous pages is
4957 done via 4957 done via
4958 <function>snd_malloc_xxx_pages()</function> function, where xxx 4958 <function>snd_malloc_xxx_pages()</function> function, where xxx
4959 is the bus type. 4959 is the bus type.
4960 </para> 4960 </para>
4961 4961
4962 <para> 4962 <para>
4963 The allocation of pages with fallback is 4963 The allocation of pages with fallback is
4964 <function>snd_malloc_xxx_pages_fallback()</function>. This 4964 <function>snd_malloc_xxx_pages_fallback()</function>. This
4965 function tries to allocate the specified pages but if the pages 4965 function tries to allocate the specified pages but if the pages
4966 are not available, it tries to reduce the page sizes until the 4966 are not available, it tries to reduce the page sizes until the
4967 enough space is found. 4967 enough space is found.
4968 </para> 4968 </para>
4969 4969
4970 <para> 4970 <para>
4971 For releasing the space, call 4971 For releasing the space, call
4972 <function>snd_free_xxx_pages()</function> function. 4972 <function>snd_free_xxx_pages()</function> function.
4973 </para> 4973 </para>
4974 4974
4975 <para> 4975 <para>
4976 Usually, ALSA drivers try to allocate and reserve 4976 Usually, ALSA drivers try to allocate and reserve
4977 a large contiguous physical space 4977 a large contiguous physical space
4978 at the time the module is loaded for the later use. 4978 at the time the module is loaded for the later use.
4979 This is called <quote>pre-allocation</quote>. 4979 This is called <quote>pre-allocation</quote>.
4980 As already written, you can call the following function at the 4980 As already written, you can call the following function at the
4981 construction of pcm instance (in the case of PCI bus). 4981 construction of pcm instance (in the case of PCI bus).
4982 4982
4983 <informalexample> 4983 <informalexample>
4984 <programlisting> 4984 <programlisting>
4985 <![CDATA[ 4985 <![CDATA[
4986 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV, 4986 snd_pcm_lib_preallocate_pages_for_all(pcm, SNDRV_DMA_TYPE_DEV,
4987 snd_dma_pci_data(pci), size, max); 4987 snd_dma_pci_data(pci), size, max);
4988 ]]> 4988 ]]>
4989 </programlisting> 4989 </programlisting>
4990 </informalexample> 4990 </informalexample>
4991 4991
4992 where <parameter>size</parameter> is the byte size to be 4992 where <parameter>size</parameter> is the byte size to be
4993 pre-allocated and the <parameter>max</parameter> is the maximal 4993 pre-allocated and the <parameter>max</parameter> is the maximal
4994 size to be changed via <filename>prealloc</filename> proc file. 4994 size to be changed via <filename>prealloc</filename> proc file.
4995 The allocator will try to get as large area as possible 4995 The allocator will try to get as large area as possible
4996 within the given size. 4996 within the given size.
4997 </para> 4997 </para>
4998 4998
4999 <para> 4999 <para>
5000 The second argument (type) and the third argument (device pointer) 5000 The second argument (type) and the third argument (device pointer)
5001 are dependent on the bus. 5001 are dependent on the bus.
5002 In the case of ISA bus, pass <function>snd_dma_isa_data()</function> 5002 In the case of ISA bus, pass <function>snd_dma_isa_data()</function>
5003 as the third argument with <constant>SNDRV_DMA_TYPE_DEV</constant> type. 5003 as the third argument with <constant>SNDRV_DMA_TYPE_DEV</constant> type.
5004 For the continuous buffer unrelated to the bus can be pre-allocated 5004 For the continuous buffer unrelated to the bus can be pre-allocated
5005 with <constant>SNDRV_DMA_TYPE_CONTINUOUS</constant> type and the 5005 with <constant>SNDRV_DMA_TYPE_CONTINUOUS</constant> type and the
5006 <function>snd_dma_continuous_data(GFP_KERNEL)</function> device pointer, 5006 <function>snd_dma_continuous_data(GFP_KERNEL)</function> device pointer,
5007 whereh <constant>GFP_KERNEL</constant> is the kernel allocation flag to 5007 whereh <constant>GFP_KERNEL</constant> is the kernel allocation flag to
5008 use. For the SBUS, <constant>SNDRV_DMA_TYPE_SBUS</constant> and 5008 use. For the SBUS, <constant>SNDRV_DMA_TYPE_SBUS</constant> and
5009 <function>snd_dma_sbus_data(sbus_dev)</function> are used instead. 5009 <function>snd_dma_sbus_data(sbus_dev)</function> are used instead.
5010 For the PCI scatter-gather buffers, use 5010 For the PCI scatter-gather buffers, use
5011 <constant>SNDRV_DMA_TYPE_DEV_SG</constant> with 5011 <constant>SNDRV_DMA_TYPE_DEV_SG</constant> with
5012 <function>snd_dma_pci_data(pci)</function> 5012 <function>snd_dma_pci_data(pci)</function>
5013 (see the section 5013 (see the section
5014 <link linkend="buffer-and-memory-non-contiguous"><citetitle>Non-Contiguous Buffers 5014 <link linkend="buffer-and-memory-non-contiguous"><citetitle>Non-Contiguous Buffers
5015 </citetitle></link>). 5015 </citetitle></link>).
5016 </para> 5016 </para>
5017 5017
5018 <para> 5018 <para>
5019 Once when the buffer is pre-allocated, you can use the 5019 Once when the buffer is pre-allocated, you can use the
5020 allocator in the <structfield>hw_params</structfield> callback 5020 allocator in the <structfield>hw_params</structfield> callback
5021 5021
5022 <informalexample> 5022 <informalexample>
5023 <programlisting> 5023 <programlisting>
5024 <![CDATA[ 5024 <![CDATA[
5025 snd_pcm_lib_malloc_pages(substream, size); 5025 snd_pcm_lib_malloc_pages(substream, size);
5026 ]]> 5026 ]]>
5027 </programlisting> 5027 </programlisting>
5028 </informalexample> 5028 </informalexample>
5029 5029
5030 Note that you have to pre-allocate to use this function. 5030 Note that you have to pre-allocate to use this function.
5031 </para> 5031 </para>
5032 </section> 5032 </section>
5033 5033
5034 <section id="buffer-and-memory-external-hardware"> 5034 <section id="buffer-and-memory-external-hardware">
5035 <title>External Hardware Buffers</title> 5035 <title>External Hardware Buffers</title>
5036 <para> 5036 <para>
5037 Some chips have their own hardware buffers and the DMA 5037 Some chips have their own hardware buffers and the DMA
5038 transfer from the host memory is not available. In such a case, 5038 transfer from the host memory is not available. In such a case,
5039 you need to either 1) copy/set the audio data directly to the 5039 you need to either 1) copy/set the audio data directly to the
5040 external hardware buffer, or 2) make an intermediate buffer and 5040 external hardware buffer, or 2) make an intermediate buffer and
5041 copy/set the data from it to the external hardware buffer in 5041 copy/set the data from it to the external hardware buffer in
5042 interrupts (or in tasklets, preferably). 5042 interrupts (or in tasklets, preferably).
5043 </para> 5043 </para>
5044 5044
5045 <para> 5045 <para>
5046 The first case works fine if the external hardware buffer is enough 5046 The first case works fine if the external hardware buffer is enough
5047 large. This method doesn't need any extra buffers and thus is 5047 large. This method doesn't need any extra buffers and thus is
5048 more effective. You need to define the 5048 more effective. You need to define the
5049 <structfield>copy</structfield> and 5049 <structfield>copy</structfield> and
5050 <structfield>silence</structfield> callbacks for 5050 <structfield>silence</structfield> callbacks for
5051 the data transfer. However, there is a drawback: it cannot 5051 the data transfer. However, there is a drawback: it cannot
5052 be mmapped. The examples are GUS's GF1 PCM or emu8000's 5052 be mmapped. The examples are GUS's GF1 PCM or emu8000's
5053 wavetable PCM. 5053 wavetable PCM.
5054 </para> 5054 </para>
5055 5055
5056 <para> 5056 <para>
5057 The second case allows the mmap of the buffer, although you have 5057 The second case allows the mmap of the buffer, although you have
5058 to handle an interrupt or a tasklet for transferring the data 5058 to handle an interrupt or a tasklet for transferring the data
5059 from the intermediate buffer to the hardware buffer. You can find an 5059 from the intermediate buffer to the hardware buffer. You can find an
5060 example in vxpocket driver. 5060 example in vxpocket driver.
5061 </para> 5061 </para>
5062 5062
5063 <para> 5063 <para>
5064 Another case is that the chip uses a PCI memory-map 5064 Another case is that the chip uses a PCI memory-map
5065 region for the buffer instead of the host memory. In this case, 5065 region for the buffer instead of the host memory. In this case,
5066 mmap is available only on certain architectures like intel. In 5066 mmap is available only on certain architectures like intel. In
5067 non-mmap mode, the data cannot be transferred as the normal 5067 non-mmap mode, the data cannot be transferred as the normal
5068 way. Thus you need to define <structfield>copy</structfield> and 5068 way. Thus you need to define <structfield>copy</structfield> and
5069 <structfield>silence</structfield> callbacks as well 5069 <structfield>silence</structfield> callbacks as well
5070 as in the cases above. The examples are found in 5070 as in the cases above. The examples are found in
5071 <filename>rme32.c</filename> and <filename>rme96.c</filename>. 5071 <filename>rme32.c</filename> and <filename>rme96.c</filename>.
5072 </para> 5072 </para>
5073 5073
5074 <para> 5074 <para>
5075 The implementation of <structfield>copy</structfield> and 5075 The implementation of <structfield>copy</structfield> and
5076 <structfield>silence</structfield> callbacks depends upon 5076 <structfield>silence</structfield> callbacks depends upon
5077 whether the hardware supports interleaved or non-interleaved 5077 whether the hardware supports interleaved or non-interleaved
5078 samples. The <structfield>copy</structfield> callback is 5078 samples. The <structfield>copy</structfield> callback is
5079 defined like below, a bit 5079 defined like below, a bit
5080 differently depending whether the direction is playback or 5080 differently depending whether the direction is playback or
5081 capture: 5081 capture:
5082 5082
5083 <informalexample> 5083 <informalexample>
5084 <programlisting> 5084 <programlisting>
5085 <![CDATA[ 5085 <![CDATA[
5086 static int playback_copy(struct snd_pcm_substream *substream, int channel, 5086 static int playback_copy(struct snd_pcm_substream *substream, int channel,
5087 snd_pcm_uframes_t pos, void *src, snd_pcm_uframes_t count); 5087 snd_pcm_uframes_t pos, void *src, snd_pcm_uframes_t count);
5088 static int capture_copy(struct snd_pcm_substream *substream, int channel, 5088 static int capture_copy(struct snd_pcm_substream *substream, int channel,
5089 snd_pcm_uframes_t pos, void *dst, snd_pcm_uframes_t count); 5089 snd_pcm_uframes_t pos, void *dst, snd_pcm_uframes_t count);
5090 ]]> 5090 ]]>
5091 </programlisting> 5091 </programlisting>
5092 </informalexample> 5092 </informalexample>
5093 </para> 5093 </para>
5094 5094
5095 <para> 5095 <para>
5096 In the case of interleaved samples, the second argument 5096 In the case of interleaved samples, the second argument
5097 (<parameter>channel</parameter>) is not used. The third argument 5097 (<parameter>channel</parameter>) is not used. The third argument
5098 (<parameter>pos</parameter>) points the 5098 (<parameter>pos</parameter>) points the
5099 current position offset in frames. 5099 current position offset in frames.
5100 </para> 5100 </para>
5101 5101
5102 <para> 5102 <para>
5103 The meaning of the fourth argument is different between 5103 The meaning of the fourth argument is different between
5104 playback and capture. For playback, it holds the source data 5104 playback and capture. For playback, it holds the source data
5105 pointer, and for capture, it's the destination data pointer. 5105 pointer, and for capture, it's the destination data pointer.
5106 </para> 5106 </para>
5107 5107
5108 <para> 5108 <para>
5109 The last argument is the number of frames to be copied. 5109 The last argument is the number of frames to be copied.
5110 </para> 5110 </para>
5111 5111
5112 <para> 5112 <para>
5113 What you have to do in this callback is again different 5113 What you have to do in this callback is again different
5114 between playback and capture directions. In the case of 5114 between playback and capture directions. In the case of
5115 playback, you do: copy the given amount of data 5115 playback, you do: copy the given amount of data
5116 (<parameter>count</parameter>) at the specified pointer 5116 (<parameter>count</parameter>) at the specified pointer
5117 (<parameter>src</parameter>) to the specified offset 5117 (<parameter>src</parameter>) to the specified offset
5118 (<parameter>pos</parameter>) on the hardware buffer. When 5118 (<parameter>pos</parameter>) on the hardware buffer. When
5119 coded like memcpy-like way, the copy would be like: 5119 coded like memcpy-like way, the copy would be like:
5120 5120
5121 <informalexample> 5121 <informalexample>
5122 <programlisting> 5122 <programlisting>
5123 <![CDATA[ 5123 <![CDATA[
5124 my_memcpy(my_buffer + frames_to_bytes(runtime, pos), src, 5124 my_memcpy(my_buffer + frames_to_bytes(runtime, pos), src,
5125 frames_to_bytes(runtime, count)); 5125 frames_to_bytes(runtime, count));
5126 ]]> 5126 ]]>
5127 </programlisting> 5127 </programlisting>
5128 </informalexample> 5128 </informalexample>
5129 </para> 5129 </para>
5130 5130
5131 <para> 5131 <para>
5132 For the capture direction, you do: copy the given amount of 5132 For the capture direction, you do: copy the given amount of
5133 data (<parameter>count</parameter>) at the specified offset 5133 data (<parameter>count</parameter>) at the specified offset
5134 (<parameter>pos</parameter>) on the hardware buffer to the 5134 (<parameter>pos</parameter>) on the hardware buffer to the
5135 specified pointer (<parameter>dst</parameter>). 5135 specified pointer (<parameter>dst</parameter>).
5136 5136
5137 <informalexample> 5137 <informalexample>
5138 <programlisting> 5138 <programlisting>
5139 <![CDATA[ 5139 <![CDATA[
5140 my_memcpy(dst, my_buffer + frames_to_bytes(runtime, pos), 5140 my_memcpy(dst, my_buffer + frames_to_bytes(runtime, pos),
5141 frames_to_bytes(runtime, count)); 5141 frames_to_bytes(runtime, count));
5142 ]]> 5142 ]]>
5143 </programlisting> 5143 </programlisting>
5144 </informalexample> 5144 </informalexample>
5145 5145
5146 Note that both of the position and the data amount are given 5146 Note that both of the position and the data amount are given
5147 in frames. 5147 in frames.
5148 </para> 5148 </para>
5149 5149
5150 <para> 5150 <para>
5151 In the case of non-interleaved samples, the implementation 5151 In the case of non-interleaved samples, the implementation
5152 will be a bit more complicated. 5152 will be a bit more complicated.
5153 </para> 5153 </para>
5154 5154
5155 <para> 5155 <para>
5156 You need to check the channel argument, and if it's -1, copy 5156 You need to check the channel argument, and if it's -1, copy
5157 the whole channels. Otherwise, you have to copy only the 5157 the whole channels. Otherwise, you have to copy only the
5158 specified channel. Please check 5158 specified channel. Please check
5159 <filename>isa/gus/gus_pcm.c</filename> as an example. 5159 <filename>isa/gus/gus_pcm.c</filename> as an example.
5160 </para> 5160 </para>
5161 5161
5162 <para> 5162 <para>
5163 The <structfield>silence</structfield> callback is also 5163 The <structfield>silence</structfield> callback is also
5164 implemented in a similar way. 5164 implemented in a similar way.
5165 5165
5166 <informalexample> 5166 <informalexample>
5167 <programlisting> 5167 <programlisting>
5168 <![CDATA[ 5168 <![CDATA[
5169 static int silence(struct snd_pcm_substream *substream, int channel, 5169 static int silence(struct snd_pcm_substream *substream, int channel,
5170 snd_pcm_uframes_t pos, snd_pcm_uframes_t count); 5170 snd_pcm_uframes_t pos, snd_pcm_uframes_t count);
5171 ]]> 5171 ]]>
5172 </programlisting> 5172 </programlisting>
5173 </informalexample> 5173 </informalexample>
5174 </para> 5174 </para>
5175 5175
5176 <para> 5176 <para>
5177 The meanings of arguments are identical with the 5177 The meanings of arguments are identical with the
5178 <structfield>copy</structfield> 5178 <structfield>copy</structfield>
5179 callback, although there is no <parameter>src/dst</parameter> 5179 callback, although there is no <parameter>src/dst</parameter>
5180 argument. In the case of interleaved samples, the channel 5180 argument. In the case of interleaved samples, the channel
5181 argument has no meaning, as well as on 5181 argument has no meaning, as well as on
5182 <structfield>copy</structfield> callback. 5182 <structfield>copy</structfield> callback.
5183 </para> 5183 </para>
5184 5184
5185 <para> 5185 <para>
5186 The role of <structfield>silence</structfield> callback is to 5186 The role of <structfield>silence</structfield> callback is to
5187 set the given amount 5187 set the given amount
5188 (<parameter>count</parameter>) of silence data at the 5188 (<parameter>count</parameter>) of silence data at the
5189 specified offset (<parameter>pos</parameter>) on the hardware 5189 specified offset (<parameter>pos</parameter>) on the hardware
5190 buffer. Suppose that the data format is signed (that is, the 5190 buffer. Suppose that the data format is signed (that is, the
5191 silent-data is 0), and the implementation using a memset-like 5191 silent-data is 0), and the implementation using a memset-like
5192 function would be like: 5192 function would be like:
5193 5193
5194 <informalexample> 5194 <informalexample>
5195 <programlisting> 5195 <programlisting>
5196 <![CDATA[ 5196 <![CDATA[
5197 my_memcpy(my_buffer + frames_to_bytes(runtime, pos), 0, 5197 my_memcpy(my_buffer + frames_to_bytes(runtime, pos), 0,
5198 frames_to_bytes(runtime, count)); 5198 frames_to_bytes(runtime, count));
5199 ]]> 5199 ]]>
5200 </programlisting> 5200 </programlisting>
5201 </informalexample> 5201 </informalexample>
5202 </para> 5202 </para>
5203 5203
5204 <para> 5204 <para>
5205 In the case of non-interleaved samples, again, the 5205 In the case of non-interleaved samples, again, the
5206 implementation becomes a bit more complicated. See, for example, 5206 implementation becomes a bit more complicated. See, for example,
5207 <filename>isa/gus/gus_pcm.c</filename>. 5207 <filename>isa/gus/gus_pcm.c</filename>.
5208 </para> 5208 </para>
5209 </section> 5209 </section>
5210 5210
5211 <section id="buffer-and-memory-non-contiguous"> 5211 <section id="buffer-and-memory-non-contiguous">
5212 <title>Non-Contiguous Buffers</title> 5212 <title>Non-Contiguous Buffers</title>
5213 <para> 5213 <para>
5214 If your hardware supports the page table like emu10k1 or the 5214 If your hardware supports the page table like emu10k1 or the
5215 buffer descriptors like via82xx, you can use the scatter-gather 5215 buffer descriptors like via82xx, you can use the scatter-gather
5216 (SG) DMA. ALSA provides an interface for handling SG-buffers. 5216 (SG) DMA. ALSA provides an interface for handling SG-buffers.
5217 The API is provided in <filename>&lt;sound/pcm.h&gt;</filename>. 5217 The API is provided in <filename>&lt;sound/pcm.h&gt;</filename>.
5218 </para> 5218 </para>
5219 5219
5220 <para> 5220 <para>
5221 For creating the SG-buffer handler, call 5221 For creating the SG-buffer handler, call
5222 <function>snd_pcm_lib_preallocate_pages()</function> or 5222 <function>snd_pcm_lib_preallocate_pages()</function> or
5223 <function>snd_pcm_lib_preallocate_pages_for_all()</function> 5223 <function>snd_pcm_lib_preallocate_pages_for_all()</function>
5224 with <constant>SNDRV_DMA_TYPE_DEV_SG</constant> 5224 with <constant>SNDRV_DMA_TYPE_DEV_SG</constant>
5225 in the PCM constructor like other PCI pre-allocator. 5225 in the PCM constructor like other PCI pre-allocator.
5226 You need to pass the <function>snd_dma_pci_data(pci)</function>, 5226 You need to pass the <function>snd_dma_pci_data(pci)</function>,
5227 where pci is the struct <structname>pci_dev</structname> pointer 5227 where pci is the struct <structname>pci_dev</structname> pointer
5228 of the chip as well. 5228 of the chip as well.
5229 The <type>struct snd_sg_buf</type> instance is created as 5229 The <type>struct snd_sg_buf</type> instance is created as
5230 substream-&gt;dma_private. You can cast 5230 substream-&gt;dma_private. You can cast
5231 the pointer like: 5231 the pointer like:
5232 5232
5233 <informalexample> 5233 <informalexample>
5234 <programlisting> 5234 <programlisting>
5235 <![CDATA[ 5235 <![CDATA[
5236 struct snd_sg_buf *sgbuf = (struct snd_sg_buf *)substream->dma_private; 5236 struct snd_sg_buf *sgbuf = (struct snd_sg_buf *)substream->dma_private;
5237 ]]> 5237 ]]>
5238 </programlisting> 5238 </programlisting>
5239 </informalexample> 5239 </informalexample>
5240 </para> 5240 </para>
5241 5241
5242 <para> 5242 <para>
5243 Then call <function>snd_pcm_lib_malloc_pages()</function> 5243 Then call <function>snd_pcm_lib_malloc_pages()</function>
5244 in <structfield>hw_params</structfield> callback 5244 in <structfield>hw_params</structfield> callback
5245 as well as in the case of normal PCI buffer. 5245 as well as in the case of normal PCI buffer.
5246 The SG-buffer handler will allocate the non-contiguous kernel 5246 The SG-buffer handler will allocate the non-contiguous kernel
5247 pages of the given size and map them onto the virtually contiguous 5247 pages of the given size and map them onto the virtually contiguous
5248 memory. The virtual pointer is addressed in runtime-&gt;dma_area. 5248 memory. The virtual pointer is addressed in runtime-&gt;dma_area.
5249 The physical address (runtime-&gt;dma_addr) is set to zero, 5249 The physical address (runtime-&gt;dma_addr) is set to zero,
5250 because the buffer is physically non-contigous. 5250 because the buffer is physically non-contigous.
5251 The physical address table is set up in sgbuf-&gt;table. 5251 The physical address table is set up in sgbuf-&gt;table.
5252 You can get the physical address at a certain offset via 5252 You can get the physical address at a certain offset via
5253 <function>snd_pcm_sgbuf_get_addr()</function>. 5253 <function>snd_pcm_sgbuf_get_addr()</function>.
5254 </para> 5254 </para>
5255 5255
5256 <para> 5256 <para>
5257 When a SG-handler is used, you need to set 5257 When a SG-handler is used, you need to set
5258 <function>snd_pcm_sgbuf_ops_page</function> as 5258 <function>snd_pcm_sgbuf_ops_page</function> as
5259 the <structfield>page</structfield> callback. 5259 the <structfield>page</structfield> callback.
5260 (See <link linkend="pcm-interface-operators-page-callback"> 5260 (See <link linkend="pcm-interface-operators-page-callback">
5261 <citetitle>page callback section</citetitle></link>.) 5261 <citetitle>page callback section</citetitle></link>.)
5262 </para> 5262 </para>
5263 5263
5264 <para> 5264 <para>
5265 For releasing the data, call 5265 For releasing the data, call
5266 <function>snd_pcm_lib_free_pages()</function> in the 5266 <function>snd_pcm_lib_free_pages()</function> in the
5267 <structfield>hw_free</structfield> callback as usual. 5267 <structfield>hw_free</structfield> callback as usual.
5268 </para> 5268 </para>
5269 </section> 5269 </section>
5270 5270
5271 <section id="buffer-and-memory-vmalloced"> 5271 <section id="buffer-and-memory-vmalloced">
5272 <title>Vmalloc'ed Buffers</title> 5272 <title>Vmalloc'ed Buffers</title>
5273 <para> 5273 <para>
5274 It's possible to use a buffer allocated via 5274 It's possible to use a buffer allocated via
5275 <function>vmalloc</function>, for example, for an intermediate 5275 <function>vmalloc</function>, for example, for an intermediate
5276 buffer. Since the allocated pages are not contiguous, you need 5276 buffer. Since the allocated pages are not contiguous, you need
5277 to set the <structfield>page</structfield> callback to obtain 5277 to set the <structfield>page</structfield> callback to obtain
5278 the physical address at every offset. 5278 the physical address at every offset.
5279 </para> 5279 </para>
5280 5280
5281 <para> 5281 <para>
5282 The implementation of <structfield>page</structfield> callback 5282 The implementation of <structfield>page</structfield> callback
5283 would be like this: 5283 would be like this:
5284 5284
5285 <informalexample> 5285 <informalexample>
5286 <programlisting> 5286 <programlisting>
5287 <![CDATA[ 5287 <![CDATA[
5288 #include <linux/vmalloc.h> 5288 #include <linux/vmalloc.h>
5289 5289
5290 /* get the physical page pointer on the given offset */ 5290 /* get the physical page pointer on the given offset */
5291 static struct page *mychip_page(struct snd_pcm_substream *substream, 5291 static struct page *mychip_page(struct snd_pcm_substream *substream,
5292 unsigned long offset) 5292 unsigned long offset)
5293 { 5293 {
5294 void *pageptr = substream->runtime->dma_area + offset; 5294 void *pageptr = substream->runtime->dma_area + offset;
5295 return vmalloc_to_page(pageptr); 5295 return vmalloc_to_page(pageptr);
5296 } 5296 }
5297 ]]> 5297 ]]>
5298 </programlisting> 5298 </programlisting>
5299 </informalexample> 5299 </informalexample>
5300 </para> 5300 </para>
5301 </section> 5301 </section>
5302 5302
5303 </chapter> 5303 </chapter>
5304 5304
5305 5305
5306 <!-- ****************************************************** --> 5306 <!-- ****************************************************** -->
5307 <!-- Proc Interface --> 5307 <!-- Proc Interface -->
5308 <!-- ****************************************************** --> 5308 <!-- ****************************************************** -->
5309 <chapter id="proc-interface"> 5309 <chapter id="proc-interface">
5310 <title>Proc Interface</title> 5310 <title>Proc Interface</title>
5311 <para> 5311 <para>
5312 ALSA provides an easy interface for procfs. The proc files are 5312 ALSA provides an easy interface for procfs. The proc files are
5313 very useful for debugging. I recommend you set up proc files if 5313 very useful for debugging. I recommend you set up proc files if
5314 you write a driver and want to get a running status or register 5314 you write a driver and want to get a running status or register
5315 dumps. The API is found in 5315 dumps. The API is found in
5316 <filename>&lt;sound/info.h&gt;</filename>. 5316 <filename>&lt;sound/info.h&gt;</filename>.
5317 </para> 5317 </para>
5318 5318
5319 <para> 5319 <para>
5320 For creating a proc file, call 5320 For creating a proc file, call
5321 <function>snd_card_proc_new()</function>. 5321 <function>snd_card_proc_new()</function>.
5322 5322
5323 <informalexample> 5323 <informalexample>
5324 <programlisting> 5324 <programlisting>
5325 <![CDATA[ 5325 <![CDATA[
5326 struct snd_info_entry *entry; 5326 struct snd_info_entry *entry;
5327 int err = snd_card_proc_new(card, "my-file", &entry); 5327 int err = snd_card_proc_new(card, "my-file", &entry);
5328 ]]> 5328 ]]>
5329 </programlisting> 5329 </programlisting>
5330 </informalexample> 5330 </informalexample>
5331 5331
5332 where the second argument specifies the proc-file name to be 5332 where the second argument specifies the proc-file name to be
5333 created. The above example will create a file 5333 created. The above example will create a file
5334 <filename>my-file</filename> under the card directory, 5334 <filename>my-file</filename> under the card directory,
5335 e.g. <filename>/proc/asound/card0/my-file</filename>. 5335 e.g. <filename>/proc/asound/card0/my-file</filename>.
5336 </para> 5336 </para>
5337 5337
5338 <para> 5338 <para>
5339 Like other components, the proc entry created via 5339 Like other components, the proc entry created via
5340 <function>snd_card_proc_new()</function> will be registered and 5340 <function>snd_card_proc_new()</function> will be registered and
5341 released automatically in the card registration and release 5341 released automatically in the card registration and release
5342 functions. 5342 functions.
5343 </para> 5343 </para>
5344 5344
5345 <para> 5345 <para>
5346 When the creation is successful, the function stores a new 5346 When the creation is successful, the function stores a new
5347 instance at the pointer given in the third argument. 5347 instance at the pointer given in the third argument.
5348 It is initialized as a text proc file for read only. For using 5348 It is initialized as a text proc file for read only. For using
5349 this proc file as a read-only text file as it is, set the read 5349 this proc file as a read-only text file as it is, set the read
5350 callback with a private data via 5350 callback with a private data via
5351 <function>snd_info_set_text_ops()</function>. 5351 <function>snd_info_set_text_ops()</function>.
5352 5352
5353 <informalexample> 5353 <informalexample>
5354 <programlisting> 5354 <programlisting>
5355 <![CDATA[ 5355 <![CDATA[
5356 snd_info_set_text_ops(entry, chip, my_proc_read); 5356 snd_info_set_text_ops(entry, chip, my_proc_read);
5357 ]]> 5357 ]]>
5358 </programlisting> 5358 </programlisting>
5359 </informalexample> 5359 </informalexample>
5360 5360
5361 where the second argument (<parameter>chip</parameter>) is the 5361 where the second argument (<parameter>chip</parameter>) is the
5362 private data to be used in the callbacks. The third parameter 5362 private data to be used in the callbacks. The third parameter
5363 specifies the read buffer size and the fourth 5363 specifies the read buffer size and the fourth
5364 (<parameter>my_proc_read</parameter>) is the callback function, which 5364 (<parameter>my_proc_read</parameter>) is the callback function, which
5365 is defined like 5365 is defined like
5366 5366
5367 <informalexample> 5367 <informalexample>
5368 <programlisting> 5368 <programlisting>
5369 <![CDATA[ 5369 <![CDATA[
5370 static void my_proc_read(struct snd_info_entry *entry, 5370 static void my_proc_read(struct snd_info_entry *entry,
5371 struct snd_info_buffer *buffer); 5371 struct snd_info_buffer *buffer);
5372 ]]> 5372 ]]>
5373 </programlisting> 5373 </programlisting>
5374 </informalexample> 5374 </informalexample>
5375 5375
5376 </para> 5376 </para>
5377 5377
5378 <para> 5378 <para>
5379 In the read callback, use <function>snd_iprintf()</function> for 5379 In the read callback, use <function>snd_iprintf()</function> for
5380 output strings, which works just like normal 5380 output strings, which works just like normal
5381 <function>printf()</function>. For example, 5381 <function>printf()</function>. For example,
5382 5382
5383 <informalexample> 5383 <informalexample>
5384 <programlisting> 5384 <programlisting>
5385 <![CDATA[ 5385 <![CDATA[
5386 static void my_proc_read(struct snd_info_entry *entry, 5386 static void my_proc_read(struct snd_info_entry *entry,
5387 struct snd_info_buffer *buffer) 5387 struct snd_info_buffer *buffer)
5388 { 5388 {
5389 struct my_chip *chip = entry->private_data; 5389 struct my_chip *chip = entry->private_data;
5390 5390
5391 snd_iprintf(buffer, "This is my chip!\n"); 5391 snd_iprintf(buffer, "This is my chip!\n");
5392 snd_iprintf(buffer, "Port = %ld\n", chip->port); 5392 snd_iprintf(buffer, "Port = %ld\n", chip->port);
5393 } 5393 }
5394 ]]> 5394 ]]>
5395 </programlisting> 5395 </programlisting>
5396 </informalexample> 5396 </informalexample>
5397 </para> 5397 </para>
5398 5398
5399 <para> 5399 <para>
5400 The file permission can be changed afterwards. As default, it's 5400 The file permission can be changed afterwards. As default, it's
5401 set as read only for all users. If you want to add the write 5401 set as read only for all users. If you want to add the write
5402 permission to the user (root as default), set like below: 5402 permission to the user (root as default), set like below:
5403 5403
5404 <informalexample> 5404 <informalexample>
5405 <programlisting> 5405 <programlisting>
5406 <![CDATA[ 5406 <![CDATA[
5407 entry->mode = S_IFREG | S_IRUGO | S_IWUSR; 5407 entry->mode = S_IFREG | S_IRUGO | S_IWUSR;
5408 ]]> 5408 ]]>
5409 </programlisting> 5409 </programlisting>
5410 </informalexample> 5410 </informalexample>
5411 5411
5412 and set the write buffer size and the callback 5412 and set the write buffer size and the callback
5413 5413
5414 <informalexample> 5414 <informalexample>
5415 <programlisting> 5415 <programlisting>
5416 <![CDATA[ 5416 <![CDATA[
5417 entry->c.text.write = my_proc_write; 5417 entry->c.text.write = my_proc_write;
5418 ]]> 5418 ]]>
5419 </programlisting> 5419 </programlisting>
5420 </informalexample> 5420 </informalexample>
5421 </para> 5421 </para>
5422 5422
5423 <para> 5423 <para>
5424 For the write callback, you can use 5424 For the write callback, you can use
5425 <function>snd_info_get_line()</function> to get a text line, and 5425 <function>snd_info_get_line()</function> to get a text line, and
5426 <function>snd_info_get_str()</function> to retrieve a string from 5426 <function>snd_info_get_str()</function> to retrieve a string from
5427 the line. Some examples are found in 5427 the line. Some examples are found in
5428 <filename>core/oss/mixer_oss.c</filename>, core/oss/and 5428 <filename>core/oss/mixer_oss.c</filename>, core/oss/and
5429 <filename>pcm_oss.c</filename>. 5429 <filename>pcm_oss.c</filename>.
5430 </para> 5430 </para>
5431 5431
5432 <para> 5432 <para>
5433 For a raw-data proc-file, set the attributes like the following: 5433 For a raw-data proc-file, set the attributes like the following:
5434 5434
5435 <informalexample> 5435 <informalexample>
5436 <programlisting> 5436 <programlisting>
5437 <![CDATA[ 5437 <![CDATA[
5438 static struct snd_info_entry_ops my_file_io_ops = { 5438 static struct snd_info_entry_ops my_file_io_ops = {
5439 .read = my_file_io_read, 5439 .read = my_file_io_read,
5440 }; 5440 };
5441 5441
5442 entry->content = SNDRV_INFO_CONTENT_DATA; 5442 entry->content = SNDRV_INFO_CONTENT_DATA;
5443 entry->private_data = chip; 5443 entry->private_data = chip;
5444 entry->c.ops = &my_file_io_ops; 5444 entry->c.ops = &my_file_io_ops;
5445 entry->size = 4096; 5445 entry->size = 4096;
5446 entry->mode = S_IFREG | S_IRUGO; 5446 entry->mode = S_IFREG | S_IRUGO;
5447 ]]> 5447 ]]>
5448 </programlisting> 5448 </programlisting>
5449 </informalexample> 5449 </informalexample>
5450 </para> 5450 </para>
5451 5451
5452 <para> 5452 <para>
5453 The callback is much more complicated than the text-file 5453 The callback is much more complicated than the text-file
5454 version. You need to use a low-level i/o functions such as 5454 version. You need to use a low-level i/o functions such as
5455 <function>copy_from/to_user()</function> to transfer the 5455 <function>copy_from/to_user()</function> to transfer the
5456 data. 5456 data.
5457 5457
5458 <informalexample> 5458 <informalexample>
5459 <programlisting> 5459 <programlisting>
5460 <![CDATA[ 5460 <![CDATA[
5461 static long my_file_io_read(struct snd_info_entry *entry, 5461 static long my_file_io_read(struct snd_info_entry *entry,
5462 void *file_private_data, 5462 void *file_private_data,
5463 struct file *file, 5463 struct file *file,
5464 char *buf, 5464 char *buf,
5465 unsigned long count, 5465 unsigned long count,
5466 unsigned long pos) 5466 unsigned long pos)
5467 { 5467 {
5468 long size = count; 5468 long size = count;
5469 if (pos + size > local_max_size) 5469 if (pos + size > local_max_size)
5470 size = local_max_size - pos; 5470 size = local_max_size - pos;
5471 if (copy_to_user(buf, local_data + pos, size)) 5471 if (copy_to_user(buf, local_data + pos, size))
5472 return -EFAULT; 5472 return -EFAULT;
5473 return size; 5473 return size;
5474 } 5474 }
5475 ]]> 5475 ]]>
5476 </programlisting> 5476 </programlisting>
5477 </informalexample> 5477 </informalexample>
5478 </para> 5478 </para>
5479 5479
5480 </chapter> 5480 </chapter>
5481 5481
5482 5482
5483 <!-- ****************************************************** --> 5483 <!-- ****************************************************** -->
5484 <!-- Power Management --> 5484 <!-- Power Management -->
5485 <!-- ****************************************************** --> 5485 <!-- ****************************************************** -->
5486 <chapter id="power-management"> 5486 <chapter id="power-management">
5487 <title>Power Management</title> 5487 <title>Power Management</title>
5488 <para> 5488 <para>
5489 If the chip is supposed to work with with suspend/resume 5489 If the chip is supposed to work with suspend/resume
5490 functions, you need to add the power-management codes to the 5490 functions, you need to add the power-management codes to the
5491 driver. The additional codes for the power-management should be 5491 driver. The additional codes for the power-management should be
5492 <function>ifdef</function>'ed with 5492 <function>ifdef</function>'ed with
5493 <constant>CONFIG_PM</constant>. 5493 <constant>CONFIG_PM</constant>.
5494 </para> 5494 </para>
5495 5495
5496 <para> 5496 <para>
5497 If the driver supports the suspend/resume 5497 If the driver supports the suspend/resume
5498 <emphasis>fully</emphasis>, that is, the device can be 5498 <emphasis>fully</emphasis>, that is, the device can be
5499 properly resumed to the status at the suspend is called, 5499 properly resumed to the status at the suspend is called,
5500 you can set <constant>SNDRV_PCM_INFO_RESUME</constant> flag 5500 you can set <constant>SNDRV_PCM_INFO_RESUME</constant> flag
5501 to pcm info field. Usually, this is possible when the 5501 to pcm info field. Usually, this is possible when the
5502 registers of ths chip can be safely saved and restored to the 5502 registers of ths chip can be safely saved and restored to the
5503 RAM. If this is set, the trigger callback is called with 5503 RAM. If this is set, the trigger callback is called with
5504 <constant>SNDRV_PCM_TRIGGER_RESUME</constant> after resume 5504 <constant>SNDRV_PCM_TRIGGER_RESUME</constant> after resume
5505 callback is finished. 5505 callback is finished.
5506 </para> 5506 </para>
5507 5507
5508 <para> 5508 <para>
5509 Even if the driver doesn't support PM fully but only the 5509 Even if the driver doesn't support PM fully but only the
5510 partial suspend/resume is possible, it's still worthy to 5510 partial suspend/resume is possible, it's still worthy to
5511 implement suspend/resume callbacks. In such a case, applications 5511 implement suspend/resume callbacks. In such a case, applications
5512 would reset the status by calling 5512 would reset the status by calling
5513 <function>snd_pcm_prepare()</function> and restart the stream 5513 <function>snd_pcm_prepare()</function> and restart the stream
5514 appropriately. Hence, you can define suspend/resume callbacks 5514 appropriately. Hence, you can define suspend/resume callbacks
5515 below but don't set <constant>SNDRV_PCM_INFO_RESUME</constant> 5515 below but don't set <constant>SNDRV_PCM_INFO_RESUME</constant>
5516 info flag to the PCM. 5516 info flag to the PCM.
5517 </para> 5517 </para>
5518 5518
5519 <para> 5519 <para>
5520 Note that the trigger with SUSPEND can be always called when 5520 Note that the trigger with SUSPEND can be always called when
5521 <function>snd_pcm_suspend_all</function> is called, 5521 <function>snd_pcm_suspend_all</function> is called,
5522 regardless of <constant>SNDRV_PCM_INFO_RESUME</constant> flag. 5522 regardless of <constant>SNDRV_PCM_INFO_RESUME</constant> flag.
5523 The <constant>RESUME</constant> flag affects only the behavior 5523 The <constant>RESUME</constant> flag affects only the behavior
5524 of <function>snd_pcm_resume()</function>. 5524 of <function>snd_pcm_resume()</function>.
5525 (Thus, in theory, 5525 (Thus, in theory,
5526 <constant>SNDRV_PCM_TRIGGER_RESUME</constant> isn't needed 5526 <constant>SNDRV_PCM_TRIGGER_RESUME</constant> isn't needed
5527 to be handled in the trigger callback when no 5527 to be handled in the trigger callback when no
5528 <constant>SNDRV_PCM_INFO_RESUME</constant> flag is set. But, 5528 <constant>SNDRV_PCM_INFO_RESUME</constant> flag is set. But,
5529 it's better to keep it for compatibility reason.) 5529 it's better to keep it for compatibility reason.)
5530 </para> 5530 </para>
5531 <para> 5531 <para>
5532 In the earlier version of ALSA drivers, a common 5532 In the earlier version of ALSA drivers, a common
5533 power-management layer was provided, but it has been removed. 5533 power-management layer was provided, but it has been removed.
5534 The driver needs to define the suspend/resume hooks according to 5534 The driver needs to define the suspend/resume hooks according to
5535 the bus the device is assigned. In the case of PCI driver, the 5535 the bus the device is assigned. In the case of PCI driver, the
5536 callbacks look like below: 5536 callbacks look like below:
5537 5537
5538 <informalexample> 5538 <informalexample>
5539 <programlisting> 5539 <programlisting>
5540 <![CDATA[ 5540 <![CDATA[
5541 #ifdef CONFIG_PM 5541 #ifdef CONFIG_PM
5542 static int snd_my_suspend(struct pci_dev *pci, pm_message_t state) 5542 static int snd_my_suspend(struct pci_dev *pci, pm_message_t state)
5543 { 5543 {
5544 .... /* do things for suspsend */ 5544 .... /* do things for suspsend */
5545 return 0; 5545 return 0;
5546 } 5546 }
5547 static int snd_my_resume(struct pci_dev *pci) 5547 static int snd_my_resume(struct pci_dev *pci)
5548 { 5548 {
5549 .... /* do things for suspsend */ 5549 .... /* do things for suspsend */
5550 return 0; 5550 return 0;
5551 } 5551 }
5552 #endif 5552 #endif
5553 ]]> 5553 ]]>
5554 </programlisting> 5554 </programlisting>
5555 </informalexample> 5555 </informalexample>
5556 </para> 5556 </para>
5557 5557
5558 <para> 5558 <para>
5559 The scheme of the real suspend job is as following. 5559 The scheme of the real suspend job is as following.
5560 5560
5561 <orderedlist> 5561 <orderedlist>
5562 <listitem><para>Retrieve the card and the chip data.</para></listitem> 5562 <listitem><para>Retrieve the card and the chip data.</para></listitem>
5563 <listitem><para>Call <function>snd_power_change_state()</function> with 5563 <listitem><para>Call <function>snd_power_change_state()</function> with
5564 <constant>SNDRV_CTL_POWER_D3hot</constant> to change the 5564 <constant>SNDRV_CTL_POWER_D3hot</constant> to change the
5565 power status.</para></listitem> 5565 power status.</para></listitem>
5566 <listitem><para>Call <function>snd_pcm_suspend_all()</function> to suspend the running PCM streams.</para></listitem> 5566 <listitem><para>Call <function>snd_pcm_suspend_all()</function> to suspend the running PCM streams.</para></listitem>
5567 <listitem><para>If AC97 codecs are used, call 5567 <listitem><para>If AC97 codecs are used, call
5568 <function>snd_ac97_suspend()</function> for each codec.</para></listitem> 5568 <function>snd_ac97_suspend()</function> for each codec.</para></listitem>
5569 <listitem><para>Save the register values if necessary.</para></listitem> 5569 <listitem><para>Save the register values if necessary.</para></listitem>
5570 <listitem><para>Stop the hardware if necessary.</para></listitem> 5570 <listitem><para>Stop the hardware if necessary.</para></listitem>
5571 <listitem><para>Disable the PCI device by calling 5571 <listitem><para>Disable the PCI device by calling
5572 <function>pci_disable_device()</function>. Then, call 5572 <function>pci_disable_device()</function>. Then, call
5573 <function>pci_save_state()</function> at last.</para></listitem> 5573 <function>pci_save_state()</function> at last.</para></listitem>
5574 </orderedlist> 5574 </orderedlist>
5575 </para> 5575 </para>
5576 5576
5577 <para> 5577 <para>
5578 A typical code would be like: 5578 A typical code would be like:
5579 5579
5580 <informalexample> 5580 <informalexample>
5581 <programlisting> 5581 <programlisting>
5582 <![CDATA[ 5582 <![CDATA[
5583 static int mychip_suspend(struct pci_dev *pci, pm_message_t state) 5583 static int mychip_suspend(struct pci_dev *pci, pm_message_t state)
5584 { 5584 {
5585 /* (1) */ 5585 /* (1) */
5586 struct snd_card *card = pci_get_drvdata(pci); 5586 struct snd_card *card = pci_get_drvdata(pci);
5587 struct mychip *chip = card->private_data; 5587 struct mychip *chip = card->private_data;
5588 /* (2) */ 5588 /* (2) */
5589 snd_power_change_state(card, SNDRV_CTL_POWER_D3hot); 5589 snd_power_change_state(card, SNDRV_CTL_POWER_D3hot);
5590 /* (3) */ 5590 /* (3) */
5591 snd_pcm_suspend_all(chip->pcm); 5591 snd_pcm_suspend_all(chip->pcm);
5592 /* (4) */ 5592 /* (4) */
5593 snd_ac97_suspend(chip->ac97); 5593 snd_ac97_suspend(chip->ac97);
5594 /* (5) */ 5594 /* (5) */
5595 snd_mychip_save_registers(chip); 5595 snd_mychip_save_registers(chip);
5596 /* (6) */ 5596 /* (6) */
5597 snd_mychip_stop_hardware(chip); 5597 snd_mychip_stop_hardware(chip);
5598 /* (7) */ 5598 /* (7) */
5599 pci_disable_device(pci); 5599 pci_disable_device(pci);
5600 pci_save_state(pci); 5600 pci_save_state(pci);
5601 return 0; 5601 return 0;
5602 } 5602 }
5603 ]]> 5603 ]]>
5604 </programlisting> 5604 </programlisting>
5605 </informalexample> 5605 </informalexample>
5606 </para> 5606 </para>
5607 5607
5608 <para> 5608 <para>
5609 The scheme of the real resume job is as following. 5609 The scheme of the real resume job is as following.
5610 5610
5611 <orderedlist> 5611 <orderedlist>
5612 <listitem><para>Retrieve the card and the chip data.</para></listitem> 5612 <listitem><para>Retrieve the card and the chip data.</para></listitem>
5613 <listitem><para>Set up PCI. First, call <function>pci_restore_state()</function>. 5613 <listitem><para>Set up PCI. First, call <function>pci_restore_state()</function>.
5614 Then enable the pci device again by calling <function>pci_enable_device()</function>. 5614 Then enable the pci device again by calling <function>pci_enable_device()</function>.
5615 Call <function>pci_set_master()</function> if necessary, too.</para></listitem> 5615 Call <function>pci_set_master()</function> if necessary, too.</para></listitem>
5616 <listitem><para>Re-initialize the chip.</para></listitem> 5616 <listitem><para>Re-initialize the chip.</para></listitem>
5617 <listitem><para>Restore the saved registers if necessary.</para></listitem> 5617 <listitem><para>Restore the saved registers if necessary.</para></listitem>
5618 <listitem><para>Resume the mixer, e.g. calling 5618 <listitem><para>Resume the mixer, e.g. calling
5619 <function>snd_ac97_resume()</function>.</para></listitem> 5619 <function>snd_ac97_resume()</function>.</para></listitem>
5620 <listitem><para>Restart the hardware (if any).</para></listitem> 5620 <listitem><para>Restart the hardware (if any).</para></listitem>
5621 <listitem><para>Call <function>snd_power_change_state()</function> with 5621 <listitem><para>Call <function>snd_power_change_state()</function> with
5622 <constant>SNDRV_CTL_POWER_D0</constant> to notify the processes.</para></listitem> 5622 <constant>SNDRV_CTL_POWER_D0</constant> to notify the processes.</para></listitem>
5623 </orderedlist> 5623 </orderedlist>
5624 </para> 5624 </para>
5625 5625
5626 <para> 5626 <para>
5627 A typical code would be like: 5627 A typical code would be like:
5628 5628
5629 <informalexample> 5629 <informalexample>
5630 <programlisting> 5630 <programlisting>
5631 <![CDATA[ 5631 <![CDATA[
5632 static int mychip_resume(struct pci_dev *pci) 5632 static int mychip_resume(struct pci_dev *pci)
5633 { 5633 {
5634 /* (1) */ 5634 /* (1) */
5635 struct snd_card *card = pci_get_drvdata(pci); 5635 struct snd_card *card = pci_get_drvdata(pci);
5636 struct mychip *chip = card->private_data; 5636 struct mychip *chip = card->private_data;
5637 /* (2) */ 5637 /* (2) */
5638 pci_restore_state(pci); 5638 pci_restore_state(pci);
5639 pci_enable_device(pci); 5639 pci_enable_device(pci);
5640 pci_set_master(pci); 5640 pci_set_master(pci);
5641 /* (3) */ 5641 /* (3) */
5642 snd_mychip_reinit_chip(chip); 5642 snd_mychip_reinit_chip(chip);
5643 /* (4) */ 5643 /* (4) */
5644 snd_mychip_restore_registers(chip); 5644 snd_mychip_restore_registers(chip);
5645 /* (5) */ 5645 /* (5) */
5646 snd_ac97_resume(chip->ac97); 5646 snd_ac97_resume(chip->ac97);
5647 /* (6) */ 5647 /* (6) */
5648 snd_mychip_restart_chip(chip); 5648 snd_mychip_restart_chip(chip);
5649 /* (7) */ 5649 /* (7) */
5650 snd_power_change_state(card, SNDRV_CTL_POWER_D0); 5650 snd_power_change_state(card, SNDRV_CTL_POWER_D0);
5651 return 0; 5651 return 0;
5652 } 5652 }
5653 ]]> 5653 ]]>
5654 </programlisting> 5654 </programlisting>
5655 </informalexample> 5655 </informalexample>
5656 </para> 5656 </para>
5657 5657
5658 <para> 5658 <para>
5659 As shown in the above, it's better to save registers after 5659 As shown in the above, it's better to save registers after
5660 suspending the PCM operations via 5660 suspending the PCM operations via
5661 <function>snd_pcm_suspend_all()</function> or 5661 <function>snd_pcm_suspend_all()</function> or
5662 <function>snd_pcm_suspend()</function>. It means that the PCM 5662 <function>snd_pcm_suspend()</function>. It means that the PCM
5663 streams are already stoppped when the register snapshot is 5663 streams are already stoppped when the register snapshot is
5664 taken. But, remind that you don't have to restart the PCM 5664 taken. But, remind that you don't have to restart the PCM
5665 stream in the resume callback. It'll be restarted via 5665 stream in the resume callback. It'll be restarted via
5666 trigger call with <constant>SNDRV_PCM_TRIGGER_RESUME</constant> 5666 trigger call with <constant>SNDRV_PCM_TRIGGER_RESUME</constant>
5667 when necessary. 5667 when necessary.
5668 </para> 5668 </para>
5669 5669
5670 <para> 5670 <para>
5671 OK, we have all callbacks now. Let's set them up. In the 5671 OK, we have all callbacks now. Let's set them up. In the
5672 initialization of the card, make sure that you can get the chip 5672 initialization of the card, make sure that you can get the chip
5673 data from the card instance, typically via 5673 data from the card instance, typically via
5674 <structfield>private_data</structfield> field, in case you 5674 <structfield>private_data</structfield> field, in case you
5675 created the chip data individually. 5675 created the chip data individually.
5676 5676
5677 <informalexample> 5677 <informalexample>
5678 <programlisting> 5678 <programlisting>
5679 <![CDATA[ 5679 <![CDATA[
5680 static int __devinit snd_mychip_probe(struct pci_dev *pci, 5680 static int __devinit snd_mychip_probe(struct pci_dev *pci,
5681 const struct pci_device_id *pci_id) 5681 const struct pci_device_id *pci_id)
5682 { 5682 {
5683 .... 5683 ....
5684 struct snd_card *card; 5684 struct snd_card *card;
5685 struct mychip *chip; 5685 struct mychip *chip;
5686 .... 5686 ....
5687 card = snd_card_new(index[dev], id[dev], THIS_MODULE, NULL); 5687 card = snd_card_new(index[dev], id[dev], THIS_MODULE, NULL);
5688 .... 5688 ....
5689 chip = kzalloc(sizeof(*chip), GFP_KERNEL); 5689 chip = kzalloc(sizeof(*chip), GFP_KERNEL);
5690 .... 5690 ....
5691 card->private_data = chip; 5691 card->private_data = chip;
5692 .... 5692 ....
5693 } 5693 }
5694 ]]> 5694 ]]>
5695 </programlisting> 5695 </programlisting>
5696 </informalexample> 5696 </informalexample>
5697 5697
5698 When you created the chip data with 5698 When you created the chip data with
5699 <function>snd_card_new()</function>, it's anyway accessible 5699 <function>snd_card_new()</function>, it's anyway accessible
5700 via <structfield>private_data</structfield> field. 5700 via <structfield>private_data</structfield> field.
5701 5701
5702 <informalexample> 5702 <informalexample>
5703 <programlisting> 5703 <programlisting>
5704 <![CDATA[ 5704 <![CDATA[
5705 static int __devinit snd_mychip_probe(struct pci_dev *pci, 5705 static int __devinit snd_mychip_probe(struct pci_dev *pci,
5706 const struct pci_device_id *pci_id) 5706 const struct pci_device_id *pci_id)
5707 { 5707 {
5708 .... 5708 ....
5709 struct snd_card *card; 5709 struct snd_card *card;
5710 struct mychip *chip; 5710 struct mychip *chip;
5711 .... 5711 ....
5712 card = snd_card_new(index[dev], id[dev], THIS_MODULE, 5712 card = snd_card_new(index[dev], id[dev], THIS_MODULE,
5713 sizeof(struct mychip)); 5713 sizeof(struct mychip));
5714 .... 5714 ....
5715 chip = card->private_data; 5715 chip = card->private_data;
5716 .... 5716 ....
5717 } 5717 }
5718 ]]> 5718 ]]>
5719 </programlisting> 5719 </programlisting>
5720 </informalexample> 5720 </informalexample>
5721 5721
5722 </para> 5722 </para>
5723 5723
5724 <para> 5724 <para>
5725 If you need a space for saving the registers, allocate the 5725 If you need a space for saving the registers, allocate the
5726 buffer for it here, too, since it would be fatal 5726 buffer for it here, too, since it would be fatal
5727 if you cannot allocate a memory in the suspend phase. 5727 if you cannot allocate a memory in the suspend phase.
5728 The allocated buffer should be released in the corresponding 5728 The allocated buffer should be released in the corresponding
5729 destructor. 5729 destructor.
5730 </para> 5730 </para>
5731 5731
5732 <para> 5732 <para>
5733 And next, set suspend/resume callbacks to the pci_driver. 5733 And next, set suspend/resume callbacks to the pci_driver.
5734 5734
5735 <informalexample> 5735 <informalexample>
5736 <programlisting> 5736 <programlisting>
5737 <![CDATA[ 5737 <![CDATA[
5738 static struct pci_driver driver = { 5738 static struct pci_driver driver = {
5739 .name = "My Chip", 5739 .name = "My Chip",
5740 .id_table = snd_my_ids, 5740 .id_table = snd_my_ids,
5741 .probe = snd_my_probe, 5741 .probe = snd_my_probe,
5742 .remove = __devexit_p(snd_my_remove), 5742 .remove = __devexit_p(snd_my_remove),
5743 #ifdef CONFIG_PM 5743 #ifdef CONFIG_PM
5744 .suspend = snd_my_suspend, 5744 .suspend = snd_my_suspend,
5745 .resume = snd_my_resume, 5745 .resume = snd_my_resume,
5746 #endif 5746 #endif
5747 }; 5747 };
5748 ]]> 5748 ]]>
5749 </programlisting> 5749 </programlisting>
5750 </informalexample> 5750 </informalexample>
5751 </para> 5751 </para>
5752 5752
5753 </chapter> 5753 </chapter>
5754 5754
5755 5755
5756 <!-- ****************************************************** --> 5756 <!-- ****************************************************** -->
5757 <!-- Module Parameters --> 5757 <!-- Module Parameters -->
5758 <!-- ****************************************************** --> 5758 <!-- ****************************************************** -->
5759 <chapter id="module-parameters"> 5759 <chapter id="module-parameters">
5760 <title>Module Parameters</title> 5760 <title>Module Parameters</title>
5761 <para> 5761 <para>
5762 There are standard module options for ALSA. At least, each 5762 There are standard module options for ALSA. At least, each
5763 module should have <parameter>index</parameter>, 5763 module should have <parameter>index</parameter>,
5764 <parameter>id</parameter> and <parameter>enable</parameter> 5764 <parameter>id</parameter> and <parameter>enable</parameter>
5765 options. 5765 options.
5766 </para> 5766 </para>
5767 5767
5768 <para> 5768 <para>
5769 If the module supports multiple cards (usually up to 5769 If the module supports multiple cards (usually up to
5770 8 = <constant>SNDRV_CARDS</constant> cards), they should be 5770 8 = <constant>SNDRV_CARDS</constant> cards), they should be
5771 arrays. The default initial values are defined already as 5771 arrays. The default initial values are defined already as
5772 constants for ease of programming: 5772 constants for ease of programming:
5773 5773
5774 <informalexample> 5774 <informalexample>
5775 <programlisting> 5775 <programlisting>
5776 <![CDATA[ 5776 <![CDATA[
5777 static int index[SNDRV_CARDS] = SNDRV_DEFAULT_IDX; 5777 static int index[SNDRV_CARDS] = SNDRV_DEFAULT_IDX;
5778 static char *id[SNDRV_CARDS] = SNDRV_DEFAULT_STR; 5778 static char *id[SNDRV_CARDS] = SNDRV_DEFAULT_STR;
5779 static int enable[SNDRV_CARDS] = SNDRV_DEFAULT_ENABLE_PNP; 5779 static int enable[SNDRV_CARDS] = SNDRV_DEFAULT_ENABLE_PNP;
5780 ]]> 5780 ]]>
5781 </programlisting> 5781 </programlisting>
5782 </informalexample> 5782 </informalexample>
5783 </para> 5783 </para>
5784 5784
5785 <para> 5785 <para>
5786 If the module supports only a single card, they could be single 5786 If the module supports only a single card, they could be single
5787 variables, instead. <parameter>enable</parameter> option is not 5787 variables, instead. <parameter>enable</parameter> option is not
5788 always necessary in this case, but it wouldn't be so bad to have a 5788 always necessary in this case, but it wouldn't be so bad to have a
5789 dummy option for compatibility. 5789 dummy option for compatibility.
5790 </para> 5790 </para>
5791 5791
5792 <para> 5792 <para>
5793 The module parameters must be declared with the standard 5793 The module parameters must be declared with the standard
5794 <function>module_param()()</function>, 5794 <function>module_param()()</function>,
5795 <function>module_param_array()()</function> and 5795 <function>module_param_array()()</function> and
5796 <function>MODULE_PARM_DESC()</function> macros. 5796 <function>MODULE_PARM_DESC()</function> macros.
5797 </para> 5797 </para>
5798 5798
5799 <para> 5799 <para>
5800 The typical coding would be like below: 5800 The typical coding would be like below:
5801 5801
5802 <informalexample> 5802 <informalexample>
5803 <programlisting> 5803 <programlisting>
5804 <![CDATA[ 5804 <![CDATA[
5805 #define CARD_NAME "My Chip" 5805 #define CARD_NAME "My Chip"
5806 5806
5807 module_param_array(index, int, NULL, 0444); 5807 module_param_array(index, int, NULL, 0444);
5808 MODULE_PARM_DESC(index, "Index value for " CARD_NAME " soundcard."); 5808 MODULE_PARM_DESC(index, "Index value for " CARD_NAME " soundcard.");
5809 module_param_array(id, charp, NULL, 0444); 5809 module_param_array(id, charp, NULL, 0444);
5810 MODULE_PARM_DESC(id, "ID string for " CARD_NAME " soundcard."); 5810 MODULE_PARM_DESC(id, "ID string for " CARD_NAME " soundcard.");
5811 module_param_array(enable, bool, NULL, 0444); 5811 module_param_array(enable, bool, NULL, 0444);
5812 MODULE_PARM_DESC(enable, "Enable " CARD_NAME " soundcard."); 5812 MODULE_PARM_DESC(enable, "Enable " CARD_NAME " soundcard.");
5813 ]]> 5813 ]]>
5814 </programlisting> 5814 </programlisting>
5815 </informalexample> 5815 </informalexample>
5816 </para> 5816 </para>
5817 5817
5818 <para> 5818 <para>
5819 Also, don't forget to define the module description, classes, 5819 Also, don't forget to define the module description, classes,
5820 license and devices. Especially, the recent modprobe requires to 5820 license and devices. Especially, the recent modprobe requires to
5821 define the module license as GPL, etc., otherwise the system is 5821 define the module license as GPL, etc., otherwise the system is
5822 shown as <quote>tainted</quote>. 5822 shown as <quote>tainted</quote>.
5823 5823
5824 <informalexample> 5824 <informalexample>
5825 <programlisting> 5825 <programlisting>
5826 <![CDATA[ 5826 <![CDATA[
5827 MODULE_DESCRIPTION("My Chip"); 5827 MODULE_DESCRIPTION("My Chip");
5828 MODULE_LICENSE("GPL"); 5828 MODULE_LICENSE("GPL");
5829 MODULE_SUPPORTED_DEVICE("{{Vendor,My Chip Name}}"); 5829 MODULE_SUPPORTED_DEVICE("{{Vendor,My Chip Name}}");
5830 ]]> 5830 ]]>
5831 </programlisting> 5831 </programlisting>
5832 </informalexample> 5832 </informalexample>
5833 </para> 5833 </para>
5834 5834
5835 </chapter> 5835 </chapter>
5836 5836
5837 5837
5838 <!-- ****************************************************** --> 5838 <!-- ****************************************************** -->
5839 <!-- How To Put Your Driver --> 5839 <!-- How To Put Your Driver -->
5840 <!-- ****************************************************** --> 5840 <!-- ****************************************************** -->
5841 <chapter id="how-to-put-your-driver"> 5841 <chapter id="how-to-put-your-driver">
5842 <title>How To Put Your Driver Into ALSA Tree</title> 5842 <title>How To Put Your Driver Into ALSA Tree</title>
5843 <section> 5843 <section>
5844 <title>General</title> 5844 <title>General</title>
5845 <para> 5845 <para>
5846 So far, you've learned how to write the driver codes. 5846 So far, you've learned how to write the driver codes.
5847 And you might have a question now: how to put my own 5847 And you might have a question now: how to put my own
5848 driver into the ALSA driver tree? 5848 driver into the ALSA driver tree?
5849 Here (finally :) the standard procedure is described briefly. 5849 Here (finally :) the standard procedure is described briefly.
5850 </para> 5850 </para>
5851 5851
5852 <para> 5852 <para>
5853 Suppose that you'll create a new PCI driver for the card 5853 Suppose that you'll create a new PCI driver for the card
5854 <quote>xyz</quote>. The card module name would be 5854 <quote>xyz</quote>. The card module name would be
5855 snd-xyz. The new driver is usually put into alsa-driver 5855 snd-xyz. The new driver is usually put into alsa-driver
5856 tree, <filename>alsa-driver/pci</filename> directory in 5856 tree, <filename>alsa-driver/pci</filename> directory in
5857 the case of PCI cards. 5857 the case of PCI cards.
5858 Then the driver is evaluated, audited and tested 5858 Then the driver is evaluated, audited and tested
5859 by developers and users. After a certain time, the driver 5859 by developers and users. After a certain time, the driver
5860 will go to alsa-kernel tree (to the corresponding directory, 5860 will go to alsa-kernel tree (to the corresponding directory,
5861 such as <filename>alsa-kernel/pci</filename>) and eventually 5861 such as <filename>alsa-kernel/pci</filename>) and eventually
5862 integrated into Linux 2.6 tree (the directory would be 5862 integrated into Linux 2.6 tree (the directory would be
5863 <filename>linux/sound/pci</filename>). 5863 <filename>linux/sound/pci</filename>).
5864 </para> 5864 </para>
5865 5865
5866 <para> 5866 <para>
5867 In the following sections, the driver code is supposed 5867 In the following sections, the driver code is supposed
5868 to be put into alsa-driver tree. The two cases are assumed: 5868 to be put into alsa-driver tree. The two cases are assumed:
5869 a driver consisting of a single source file and one consisting 5869 a driver consisting of a single source file and one consisting
5870 of several source files. 5870 of several source files.
5871 </para> 5871 </para>
5872 </section> 5872 </section>
5873 5873
5874 <section> 5874 <section>
5875 <title>Driver with A Single Source File</title> 5875 <title>Driver with A Single Source File</title>
5876 <para> 5876 <para>
5877 <orderedlist> 5877 <orderedlist>
5878 <listitem> 5878 <listitem>
5879 <para> 5879 <para>
5880 Modify alsa-driver/pci/Makefile 5880 Modify alsa-driver/pci/Makefile
5881 </para> 5881 </para>
5882 5882
5883 <para> 5883 <para>
5884 Suppose you have a file xyz.c. Add the following 5884 Suppose you have a file xyz.c. Add the following
5885 two lines 5885 two lines
5886 <informalexample> 5886 <informalexample>
5887 <programlisting> 5887 <programlisting>
5888 <![CDATA[ 5888 <![CDATA[
5889 snd-xyz-objs := xyz.o 5889 snd-xyz-objs := xyz.o
5890 obj-$(CONFIG_SND_XYZ) += snd-xyz.o 5890 obj-$(CONFIG_SND_XYZ) += snd-xyz.o
5891 ]]> 5891 ]]>
5892 </programlisting> 5892 </programlisting>
5893 </informalexample> 5893 </informalexample>
5894 </para> 5894 </para>
5895 </listitem> 5895 </listitem>
5896 5896
5897 <listitem> 5897 <listitem>
5898 <para> 5898 <para>
5899 Create the Kconfig entry 5899 Create the Kconfig entry
5900 </para> 5900 </para>
5901 5901
5902 <para> 5902 <para>
5903 Add the new entry of Kconfig for your xyz driver. 5903 Add the new entry of Kconfig for your xyz driver.
5904 <informalexample> 5904 <informalexample>
5905 <programlisting> 5905 <programlisting>
5906 <![CDATA[ 5906 <![CDATA[
5907 config SND_XYZ 5907 config SND_XYZ
5908 tristate "Foobar XYZ" 5908 tristate "Foobar XYZ"
5909 depends on SND 5909 depends on SND
5910 select SND_PCM 5910 select SND_PCM
5911 help 5911 help
5912 Say Y here to include support for Foobar XYZ soundcard. 5912 Say Y here to include support for Foobar XYZ soundcard.
5913 5913
5914 To compile this driver as a module, choose M here: the module 5914 To compile this driver as a module, choose M here: the module
5915 will be called snd-xyz. 5915 will be called snd-xyz.
5916 ]]> 5916 ]]>
5917 </programlisting> 5917 </programlisting>
5918 </informalexample> 5918 </informalexample>
5919 5919
5920 the line, select SND_PCM, specifies that the driver xyz supports 5920 the line, select SND_PCM, specifies that the driver xyz supports
5921 PCM. In addition to SND_PCM, the following components are 5921 PCM. In addition to SND_PCM, the following components are
5922 supported for select command: 5922 supported for select command:
5923 SND_RAWMIDI, SND_TIMER, SND_HWDEP, SND_MPU401_UART, 5923 SND_RAWMIDI, SND_TIMER, SND_HWDEP, SND_MPU401_UART,
5924 SND_OPL3_LIB, SND_OPL4_LIB, SND_VX_LIB, SND_AC97_CODEC. 5924 SND_OPL3_LIB, SND_OPL4_LIB, SND_VX_LIB, SND_AC97_CODEC.
5925 Add the select command for each supported component. 5925 Add the select command for each supported component.
5926 </para> 5926 </para>
5927 5927
5928 <para> 5928 <para>
5929 Note that some selections imply the lowlevel selections. 5929 Note that some selections imply the lowlevel selections.
5930 For example, PCM includes TIMER, MPU401_UART includes RAWMIDI, 5930 For example, PCM includes TIMER, MPU401_UART includes RAWMIDI,
5931 AC97_CODEC includes PCM, and OPL3_LIB includes HWDEP. 5931 AC97_CODEC includes PCM, and OPL3_LIB includes HWDEP.
5932 You don't need to give the lowlevel selections again. 5932 You don't need to give the lowlevel selections again.
5933 </para> 5933 </para>
5934 5934
5935 <para> 5935 <para>
5936 For the details of Kconfig script, refer to the kbuild 5936 For the details of Kconfig script, refer to the kbuild
5937 documentation. 5937 documentation.
5938 </para> 5938 </para>
5939 5939
5940 </listitem> 5940 </listitem>
5941 5941
5942 <listitem> 5942 <listitem>
5943 <para> 5943 <para>
5944 Run cvscompile script to re-generate the configure script and 5944 Run cvscompile script to re-generate the configure script and
5945 build the whole stuff again. 5945 build the whole stuff again.
5946 </para> 5946 </para>
5947 </listitem> 5947 </listitem>
5948 </orderedlist> 5948 </orderedlist>
5949 </para> 5949 </para>
5950 </section> 5950 </section>
5951 5951
5952 <section> 5952 <section>
5953 <title>Drivers with Several Source Files</title> 5953 <title>Drivers with Several Source Files</title>
5954 <para> 5954 <para>
5955 Suppose that the driver snd-xyz have several source files. 5955 Suppose that the driver snd-xyz have several source files.
5956 They are located in the new subdirectory, 5956 They are located in the new subdirectory,
5957 pci/xyz. 5957 pci/xyz.
5958 5958
5959 <orderedlist> 5959 <orderedlist>
5960 <listitem> 5960 <listitem>
5961 <para> 5961 <para>
5962 Add a new directory (<filename>xyz</filename>) in 5962 Add a new directory (<filename>xyz</filename>) in
5963 <filename>alsa-driver/pci/Makefile</filename> like below 5963 <filename>alsa-driver/pci/Makefile</filename> like below
5964 5964
5965 <informalexample> 5965 <informalexample>
5966 <programlisting> 5966 <programlisting>
5967 <![CDATA[ 5967 <![CDATA[
5968 obj-$(CONFIG_SND) += xyz/ 5968 obj-$(CONFIG_SND) += xyz/
5969 ]]> 5969 ]]>
5970 </programlisting> 5970 </programlisting>
5971 </informalexample> 5971 </informalexample>
5972 </para> 5972 </para>
5973 </listitem> 5973 </listitem>
5974 5974
5975 <listitem> 5975 <listitem>
5976 <para> 5976 <para>
5977 Under the directory <filename>xyz</filename>, create a Makefile 5977 Under the directory <filename>xyz</filename>, create a Makefile
5978 5978
5979 <example> 5979 <example>
5980 <title>Sample Makefile for a driver xyz</title> 5980 <title>Sample Makefile for a driver xyz</title>
5981 <programlisting> 5981 <programlisting>
5982 <![CDATA[ 5982 <![CDATA[
5983 ifndef SND_TOPDIR 5983 ifndef SND_TOPDIR
5984 SND_TOPDIR=../.. 5984 SND_TOPDIR=../..
5985 endif 5985 endif
5986 5986
5987 include $(SND_TOPDIR)/toplevel.config 5987 include $(SND_TOPDIR)/toplevel.config
5988 include $(SND_TOPDIR)/Makefile.conf 5988 include $(SND_TOPDIR)/Makefile.conf
5989 5989
5990 snd-xyz-objs := xyz.o abc.o def.o 5990 snd-xyz-objs := xyz.o abc.o def.o
5991 5991
5992 obj-$(CONFIG_SND_XYZ) += snd-xyz.o 5992 obj-$(CONFIG_SND_XYZ) += snd-xyz.o
5993 5993
5994 include $(SND_TOPDIR)/Rules.make 5994 include $(SND_TOPDIR)/Rules.make
5995 ]]> 5995 ]]>
5996 </programlisting> 5996 </programlisting>
5997 </example> 5997 </example>
5998 </para> 5998 </para>
5999 </listitem> 5999 </listitem>
6000 6000
6001 <listitem> 6001 <listitem>
6002 <para> 6002 <para>
6003 Create the Kconfig entry 6003 Create the Kconfig entry
6004 </para> 6004 </para>
6005 6005
6006 <para> 6006 <para>
6007 This procedure is as same as in the last section. 6007 This procedure is as same as in the last section.
6008 </para> 6008 </para>
6009 </listitem> 6009 </listitem>
6010 6010
6011 <listitem> 6011 <listitem>
6012 <para> 6012 <para>
6013 Run cvscompile script to re-generate the configure script and 6013 Run cvscompile script to re-generate the configure script and
6014 build the whole stuff again. 6014 build the whole stuff again.
6015 </para> 6015 </para>
6016 </listitem> 6016 </listitem>
6017 </orderedlist> 6017 </orderedlist>
6018 </para> 6018 </para>
6019 </section> 6019 </section>
6020 6020
6021 </chapter> 6021 </chapter>
6022 6022
6023 <!-- ****************************************************** --> 6023 <!-- ****************************************************** -->
6024 <!-- Useful Functions --> 6024 <!-- Useful Functions -->
6025 <!-- ****************************************************** --> 6025 <!-- ****************************************************** -->
6026 <chapter id="useful-functions"> 6026 <chapter id="useful-functions">
6027 <title>Useful Functions</title> 6027 <title>Useful Functions</title>
6028 6028
6029 <section id="useful-functions-snd-printk"> 6029 <section id="useful-functions-snd-printk">
6030 <title><function>snd_printk()</function> and friends</title> 6030 <title><function>snd_printk()</function> and friends</title>
6031 <para> 6031 <para>
6032 ALSA provides a verbose version of 6032 ALSA provides a verbose version of
6033 <function>printk()</function> function. If a kernel config 6033 <function>printk()</function> function. If a kernel config
6034 <constant>CONFIG_SND_VERBOSE_PRINTK</constant> is set, this 6034 <constant>CONFIG_SND_VERBOSE_PRINTK</constant> is set, this
6035 function prints the given message together with the file name 6035 function prints the given message together with the file name
6036 and the line of the caller. The <constant>KERN_XXX</constant> 6036 and the line of the caller. The <constant>KERN_XXX</constant>
6037 prefix is processed as 6037 prefix is processed as
6038 well as the original <function>printk()</function> does, so it's 6038 well as the original <function>printk()</function> does, so it's
6039 recommended to add this prefix, e.g. 6039 recommended to add this prefix, e.g.
6040 6040
6041 <informalexample> 6041 <informalexample>
6042 <programlisting> 6042 <programlisting>
6043 <![CDATA[ 6043 <![CDATA[
6044 snd_printk(KERN_ERR "Oh my, sorry, it's extremely bad!\n"); 6044 snd_printk(KERN_ERR "Oh my, sorry, it's extremely bad!\n");
6045 ]]> 6045 ]]>
6046 </programlisting> 6046 </programlisting>
6047 </informalexample> 6047 </informalexample>
6048 </para> 6048 </para>
6049 6049
6050 <para> 6050 <para>
6051 There are also <function>printk()</function>'s for 6051 There are also <function>printk()</function>'s for
6052 debugging. <function>snd_printd()</function> can be used for 6052 debugging. <function>snd_printd()</function> can be used for
6053 general debugging purposes. If 6053 general debugging purposes. If
6054 <constant>CONFIG_SND_DEBUG</constant> is set, this function is 6054 <constant>CONFIG_SND_DEBUG</constant> is set, this function is
6055 compiled, and works just like 6055 compiled, and works just like
6056 <function>snd_printk()</function>. If the ALSA is compiled 6056 <function>snd_printk()</function>. If the ALSA is compiled
6057 without the debugging flag, it's ignored. 6057 without the debugging flag, it's ignored.
6058 </para> 6058 </para>
6059 6059
6060 <para> 6060 <para>
6061 <function>snd_printdd()</function> is compiled in only when 6061 <function>snd_printdd()</function> is compiled in only when
6062 <constant>CONFIG_SND_DEBUG_DETECT</constant> is set. Please note 6062 <constant>CONFIG_SND_DEBUG_DETECT</constant> is set. Please note
6063 that <constant>DEBUG_DETECT</constant> is not set as default 6063 that <constant>DEBUG_DETECT</constant> is not set as default
6064 even if you configure the alsa-driver with 6064 even if you configure the alsa-driver with
6065 <option>--with-debug=full</option> option. You need to give 6065 <option>--with-debug=full</option> option. You need to give
6066 explicitly <option>--with-debug=detect</option> option instead. 6066 explicitly <option>--with-debug=detect</option> option instead.
6067 </para> 6067 </para>
6068 </section> 6068 </section>
6069 6069
6070 <section id="useful-functions-snd-assert"> 6070 <section id="useful-functions-snd-assert">
6071 <title><function>snd_assert()</function></title> 6071 <title><function>snd_assert()</function></title>
6072 <para> 6072 <para>
6073 <function>snd_assert()</function> macro is similar with the 6073 <function>snd_assert()</function> macro is similar with the
6074 normal <function>assert()</function> macro. For example, 6074 normal <function>assert()</function> macro. For example,
6075 6075
6076 <informalexample> 6076 <informalexample>
6077 <programlisting> 6077 <programlisting>
6078 <![CDATA[ 6078 <![CDATA[
6079 snd_assert(pointer != NULL, return -EINVAL); 6079 snd_assert(pointer != NULL, return -EINVAL);
6080 ]]> 6080 ]]>
6081 </programlisting> 6081 </programlisting>
6082 </informalexample> 6082 </informalexample>
6083 </para> 6083 </para>
6084 6084
6085 <para> 6085 <para>
6086 The first argument is the expression to evaluate, and the 6086 The first argument is the expression to evaluate, and the
6087 second argument is the action if it fails. When 6087 second argument is the action if it fails. When
6088 <constant>CONFIG_SND_DEBUG</constant>, is set, it will show an 6088 <constant>CONFIG_SND_DEBUG</constant>, is set, it will show an
6089 error message such as <computeroutput>BUG? (xxx)</computeroutput> 6089 error message such as <computeroutput>BUG? (xxx)</computeroutput>
6090 together with stack trace. 6090 together with stack trace.
6091 </para> 6091 </para>
6092 <para> 6092 <para>
6093 When no debug flag is set, this macro is ignored. 6093 When no debug flag is set, this macro is ignored.
6094 </para> 6094 </para>
6095 </section> 6095 </section>
6096 6096
6097 <section id="useful-functions-snd-bug"> 6097 <section id="useful-functions-snd-bug">
6098 <title><function>snd_BUG()</function></title> 6098 <title><function>snd_BUG()</function></title>
6099 <para> 6099 <para>
6100 It shows <computeroutput>BUG?</computeroutput> message and 6100 It shows <computeroutput>BUG?</computeroutput> message and
6101 stack trace as well as <function>snd_assert</function> at the point. 6101 stack trace as well as <function>snd_assert</function> at the point.
6102 It's useful to show that a fatal error happens there. 6102 It's useful to show that a fatal error happens there.
6103 </para> 6103 </para>
6104 <para> 6104 <para>
6105 When no debug flag is set, this macro is ignored. 6105 When no debug flag is set, this macro is ignored.
6106 </para> 6106 </para>
6107 </section> 6107 </section>
6108 </chapter> 6108 </chapter>
6109 6109
6110 6110
6111 <!-- ****************************************************** --> 6111 <!-- ****************************************************** -->
6112 <!-- Acknowledgments --> 6112 <!-- Acknowledgments -->
6113 <!-- ****************************************************** --> 6113 <!-- ****************************************************** -->
6114 <chapter id="acknowledments"> 6114 <chapter id="acknowledments">
6115 <title>Acknowledgments</title> 6115 <title>Acknowledgments</title>
6116 <para> 6116 <para>
6117 I would like to thank Phil Kerr for his help for improvement and 6117 I would like to thank Phil Kerr for his help for improvement and
6118 corrections of this document. 6118 corrections of this document.
6119 </para> 6119 </para>
6120 <para> 6120 <para>
6121 Kevin Conder reformatted the original plain-text to the 6121 Kevin Conder reformatted the original plain-text to the
6122 DocBook format. 6122 DocBook format.
6123 </para> 6123 </para>
6124 <para> 6124 <para>
6125 Giuliano Pochini corrected typos and contributed the example codes 6125 Giuliano Pochini corrected typos and contributed the example codes
6126 in the hardware constraints section. 6126 in the hardware constraints section.
6127 </para> 6127 </para>
6128 </chapter> 6128 </chapter>
6129 6129
6130 6130
6131 </book> 6131 </book>
6132 6132
Documentation/sound/oss/AWE32
1 Installing and using Creative AWE midi sound under Linux. 1 Installing and using Creative AWE midi sound under Linux.
2 2
3 This documentation is devoted to the Creative Sound Blaster AWE32, AWE64 and 3 This documentation is devoted to the Creative Sound Blaster AWE32, AWE64 and
4 SB32. 4 SB32.
5 5
6 1) Make sure you have an ORIGINAL Creative SB32, AWE32 or AWE64 card. This 6 1) Make sure you have an ORIGINAL Creative SB32, AWE32 or AWE64 card. This
7 is important, because the driver works only with real Creative cards. 7 is important, because the driver works only with real Creative cards.
8 8
9 2) The first thing you need to do is re-compile your kernel with support for 9 2) The first thing you need to do is re-compile your kernel with support for
10 your sound card. Run your favourite tool to configure the kernel and when 10 your sound card. Run your favourite tool to configure the kernel and when
11 you get to the "Sound" menu you should enable support for the following: 11 you get to the "Sound" menu you should enable support for the following:
12 12
13 Sound card support, 13 Sound card support,
14 OSS sound modules, 14 OSS sound modules,
15 100% Sound Blaster compatibles (SB16/32/64, ESS, Jazz16) support, 15 100% Sound Blaster compatibles (SB16/32/64, ESS, Jazz16) support,
16 AWE32 synth 16 AWE32 synth
17 17
18 If your card is "Plug and Play" you will also need to enable these two 18 If your card is "Plug and Play" you will also need to enable these two
19 options, found under the "Plug and Play configuration" menu: 19 options, found under the "Plug and Play configuration" menu:
20 20
21 Plug and Play support 21 Plug and Play support
22 ISA Plug and Play support 22 ISA Plug and Play support
23 23
24 Now compile and install the kernel in normal fashion. If you don't know 24 Now compile and install the kernel in normal fashion. If you don't know
25 how to do this you can find instructions for this in the README file 25 how to do this you can find instructions for this in the README file
26 located in the root directory of the kernel source. 26 located in the root directory of the kernel source.
27 27
28 3) Before you can start playing midi files you will have to load a sound 28 3) Before you can start playing midi files you will have to load a sound
29 bank file. The utility needed for doing this is called "sfxload", and it 29 bank file. The utility needed for doing this is called "sfxload", and it
30 is one of the utilities found in a package called "awesfx". If this 30 is one of the utilities found in a package called "awesfx". If this
31 package is not available in your distribution you can download the AWE 31 package is not available in your distribution you can download the AWE
32 snapshot from Creative Labs Open Source website: 32 snapshot from Creative Labs Open Source website:
33 33
34 http://www.opensource.creative.com/snapshot.html 34 http://www.opensource.creative.com/snapshot.html
35 35
36 Once you have unpacked the AWE snapshot you will see a "awesfx" 36 Once you have unpacked the AWE snapshot you will see a "awesfx"
37 directory. Follow the instructions in awesfx/docs/INSTALL to install the 37 directory. Follow the instructions in awesfx/docs/INSTALL to install the
38 utilities in this package. After doing this, sfxload should be installed 38 utilities in this package. After doing this, sfxload should be installed
39 as: 39 as:
40 40
41 /usr/local/bin/sfxload 41 /usr/local/bin/sfxload
42 42
43 To enable AWE general midi synthesis you should also get the sound bank 43 To enable AWE general midi synthesis you should also get the sound bank
44 file for general midi from: 44 file for general midi from:
45 45
46 http://members.xoom.com/yar/synthgm.sbk.gz 46 http://members.xoom.com/yar/synthgm.sbk.gz
47 47
48 Copy it to a directory of your choice, and unpack it there. 48 Copy it to a directory of your choice, and unpack it there.
49 49
50 4) Edit /etc/modprobe.conf, and insert the following lines at the end of the 50 4) Edit /etc/modprobe.conf, and insert the following lines at the end of the
51 file: 51 file:
52 52
53 alias sound-slot-0 sb 53 alias sound-slot-0 sb
54 alias sound-service-0-1 awe_wave 54 alias sound-service-0-1 awe_wave
55 install awe_wave /sbin/modprobe --first-time -i awe_wave && /usr/local/bin/sfxload PATH_TO_SOUND_BANK_FILE 55 install awe_wave /sbin/modprobe --first-time -i awe_wave && /usr/local/bin/sfxload PATH_TO_SOUND_BANK_FILE
56 56
57 You will of course have to change "PATH_TO_SOUND_BANK_FILE" to the full 57 You will of course have to change "PATH_TO_SOUND_BANK_FILE" to the full
58 path of of the sound bank file. That will enable the Sound Blaster and AWE 58 path of the sound bank file. That will enable the Sound Blaster and AWE
59 wave synthesis. To play midi files you should get one of these programs if 59 wave synthesis. To play midi files you should get one of these programs if
60 you don't already have them: 60 you don't already have them:
61 61
62 Playmidi: http://playmidi.openprojects.net 62 Playmidi: http://playmidi.openprojects.net
63 63
64 AWEMidi Player (drvmidi) Included in the previously mentioned AWE 64 AWEMidi Player (drvmidi) Included in the previously mentioned AWE
65 snapshot. 65 snapshot.
66 66
67 You will probably have to pass the "-e" switch to playmidi to have it use 67 You will probably have to pass the "-e" switch to playmidi to have it use
68 your midi device. drvmidi should work without switches. 68 your midi device. drvmidi should work without switches.
69 69
70 If something goes wrong please e-mail me. All comments and suggestions are 70 If something goes wrong please e-mail me. All comments and suggestions are
71 welcome. 71 welcome.
72 72
73 Yaroslav Rosomakho (alons55@dialup.ptt.ru) 73 Yaroslav Rosomakho (alons55@dialup.ptt.ru)
74 http://www.yar.opennet.ru 74 http://www.yar.opennet.ru
75 75
76 Last Updated: Feb 3 2001 76 Last Updated: Feb 3 2001
77 77
Documentation/sound/oss/solo1
1 Recording 1 Recording
2 --------- 2 ---------
3 3
4 Recording does not work on the author's card, but there 4 Recording does not work on the author's card, but there
5 is at least one report of it working on later silicon. 5 is at least one report of it working on later silicon.
6 The chip behaves differently than described in the data sheet, 6 The chip behaves differently than described in the data sheet,
7 likely due to a chip bug. Working around this would require 7 likely due to a chip bug. Working around this would require
8 the help of ESS (for example by publishing an errata sheet), 8 the help of ESS (for example by publishing an errata sheet),
9 but ESS has not done so so far. 9 but ESS has not done so far.
10 10
11 Also, the chip only supports 24 bit addresses for recording, 11 Also, the chip only supports 24 bit addresses for recording,
12 which means it cannot work on some Alpha mainboards. 12 which means it cannot work on some Alpha mainboards.
13 13
14 14
15 /proc/sound, /dev/sndstat 15 /proc/sound, /dev/sndstat
16 ------------------------- 16 -------------------------
17 17
18 /proc/sound and /dev/sndstat is not supported by the 18 /proc/sound and /dev/sndstat is not supported by the
19 driver. To find out whether the driver succeeded loading, 19 driver. To find out whether the driver succeeded loading,
20 check the kernel log (dmesg). 20 check the kernel log (dmesg).
21 21
22 22
23 ALaw/uLaw sample formats 23 ALaw/uLaw sample formats
24 ------------------------ 24 ------------------------
25 25
26 This driver does not support the ALaw/uLaw sample formats. 26 This driver does not support the ALaw/uLaw sample formats.
27 ALaw is the default mode when opening a sound device 27 ALaw is the default mode when opening a sound device
28 using OSS/Free. The reason for the lack of support is 28 using OSS/Free. The reason for the lack of support is
29 that the hardware does not support these formats, and adding 29 that the hardware does not support these formats, and adding
30 conversion routines to the kernel would lead to very ugly 30 conversion routines to the kernel would lead to very ugly
31 code in the presence of the mmap interface to the driver. 31 code in the presence of the mmap interface to the driver.
32 And since xquake uses mmap, mmap is considered important :-) 32 And since xquake uses mmap, mmap is considered important :-)
33 and no sane application uses ALaw/uLaw these days anyway. 33 and no sane application uses ALaw/uLaw these days anyway.
34 In short, playing a Sun .au file as follows: 34 In short, playing a Sun .au file as follows:
35 35
36 cat my_file.au > /dev/dsp 36 cat my_file.au > /dev/dsp
37 37
38 does not work. Instead, you may use the play script from 38 does not work. Instead, you may use the play script from
39 Chris Bagwell's sox-12.14 package (or later, available from the URL 39 Chris Bagwell's sox-12.14 package (or later, available from the URL
40 below) to play many different audio file formats. 40 below) to play many different audio file formats.
41 The script automatically determines the audio format 41 The script automatically determines the audio format
42 and does do audio conversions if necessary. 42 and does do audio conversions if necessary.
43 http://home.sprynet.com/sprynet/cbagwell/projects.html 43 http://home.sprynet.com/sprynet/cbagwell/projects.html
44 44
45 45
46 Blocking vs. nonblocking IO 46 Blocking vs. nonblocking IO
47 --------------------------- 47 ---------------------------
48 48
49 Unlike OSS/Free this driver honours the O_NONBLOCK file flag 49 Unlike OSS/Free this driver honours the O_NONBLOCK file flag
50 not only during open, but also during read and write. 50 not only during open, but also during read and write.
51 This is an effort to make the sound driver interface more 51 This is an effort to make the sound driver interface more
52 regular. Timidity has problems with this; a patch 52 regular. Timidity has problems with this; a patch
53 is available from http://www.ife.ee.ethz.ch/~sailer/linux/pciaudio.html. 53 is available from http://www.ife.ee.ethz.ch/~sailer/linux/pciaudio.html.
54 (Timidity patched will also run on OSS/Free). 54 (Timidity patched will also run on OSS/Free).
55 55
56 56
57 MIDI UART 57 MIDI UART
58 --------- 58 ---------
59 59
60 The driver supports a simple MIDI UART interface, with 60 The driver supports a simple MIDI UART interface, with
61 no ioctl's supported. 61 no ioctl's supported.
62 62
63 63
64 MIDI synthesizer 64 MIDI synthesizer
65 ---------------- 65 ----------------
66 66
67 The card has an OPL compatible FM synthesizer. 67 The card has an OPL compatible FM synthesizer.
68 68
69 Thomas Sailer 69 Thomas Sailer
70 t.sailer@alumni.ethz.ch 70 t.sailer@alumni.ethz.ch
71 71
Documentation/sound/oss/ultrasound
1 modprobe sound 1 modprobe sound
2 insmod ad1848 2 insmod ad1848
3 insmod gus io=* irq=* dma=* ... 3 insmod gus io=* irq=* dma=* ...
4 4
5 This loads the driver for the Gravis Ultrasound family of sound cards. 5 This loads the driver for the Gravis Ultrasound family of sound cards.
6 6
7 The gus module takes the following arguments 7 The gus module takes the following arguments
8 8
9 io I/O address of the Ultrasound card (eg. io=0x220) 9 io I/O address of the Ultrasound card (eg. io=0x220)
10 irq IRQ of the Sound Blaster card 10 irq IRQ of the Sound Blaster card
11 dma DMA channel for the Sound Blaster 11 dma DMA channel for the Sound Blaster
12 dma16 2nd DMA channel, only needed for full duplex operation 12 dma16 2nd DMA channel, only needed for full duplex operation
13 type 1 for PnP card 13 type 1 for PnP card
14 gus16 1 for using 16 bit sampling daughter board 14 gus16 1 for using 16 bit sampling daughter board
15 no_wave_dma Set to disable DMA usage for wavetable (see note) 15 no_wave_dma Set to disable DMA usage for wavetable (see note)
16 db16 ??? 16 db16 ???
17 17
18 18
19 no_wave_dma option 19 no_wave_dma option
20 20
21 This option defaults to a value of 0, which allows the Ultrasound wavetable 21 This option defaults to a value of 0, which allows the Ultrasound wavetable
22 DSP to use DMA for for playback and downloading samples. This is the same 22 DSP to use DMA for playback and downloading samples. This is the same
23 as the old behaviour. If set to 1, no DMA is needed for downloading samples, 23 as the old behaviour. If set to 1, no DMA is needed for downloading samples,
24 and allows owners of a GUS MAX to make use of simultaneous digital audio 24 and allows owners of a GUS MAX to make use of simultaneous digital audio
25 (/dev/dsp), MIDI, and wavetable playback. 25 (/dev/dsp), MIDI, and wavetable playback.
26 26
27 27
28 If you have problems in recording with GUS MAX, you could try to use 28 If you have problems in recording with GUS MAX, you could try to use
29 just one 8 bit DMA channel. Recording will not work with one DMA 29 just one 8 bit DMA channel. Recording will not work with one DMA
30 channel if it's a 16 bit one. 30 channel if it's a 16 bit one.
31 31
Documentation/sound/oss/vwsnd
1 vwsnd - Sound driver for the Silicon Graphics 320 and 540 Visual 1 vwsnd - Sound driver for the Silicon Graphics 320 and 540 Visual
2 Workstations' onboard audio. 2 Workstations' onboard audio.
3 3
4 Copyright 1999 Silicon Graphics, Inc. All rights reserved. 4 Copyright 1999 Silicon Graphics, Inc. All rights reserved.
5 5
6 6
7 At the time of this writing, March 1999, there are two models of 7 At the time of this writing, March 1999, there are two models of
8 Visual Workstation, the 320 and the 540. This document only describes 8 Visual Workstation, the 320 and the 540. This document only describes
9 those models. Future Visual Workstation models may have different 9 those models. Future Visual Workstation models may have different
10 sound capabilities, and this driver will probably not work on those 10 sound capabilities, and this driver will probably not work on those
11 boxes. 11 boxes.
12 12
13 The Visual Workstation has an Analog Devices AD1843 "SoundComm" audio 13 The Visual Workstation has an Analog Devices AD1843 "SoundComm" audio
14 codec chip. The AD1843 is accessed through the Cobalt I/O ASIC, also 14 codec chip. The AD1843 is accessed through the Cobalt I/O ASIC, also
15 known as Lithium. This driver programs both both chips. 15 known as Lithium. This driver programs both chips.
16 16
17 ============================================================================== 17 ==============================================================================
18 QUICK CONFIGURATION 18 QUICK CONFIGURATION
19 19
20 # insmod soundcore 20 # insmod soundcore
21 # insmod vwsnd 21 # insmod vwsnd
22 22
23 ============================================================================== 23 ==============================================================================
24 I/O CONNECTIONS 24 I/O CONNECTIONS
25 25
26 On the Visual Workstation, only three of the AD1843 inputs are hooked 26 On the Visual Workstation, only three of the AD1843 inputs are hooked
27 up. The analog line in jacks are connected to the AD1843's AUX1 27 up. The analog line in jacks are connected to the AD1843's AUX1
28 input. The CD audio lines are connected to the AD1843's AUX2 input. 28 input. The CD audio lines are connected to the AD1843's AUX2 input.
29 The microphone jack is connected to the AD1843's MIC input. The mic 29 The microphone jack is connected to the AD1843's MIC input. The mic
30 jack is mono, but the signal is delivered to both the left and right 30 jack is mono, but the signal is delivered to both the left and right
31 MIC inputs. You can record in stereo from the mic input, but you will 31 MIC inputs. You can record in stereo from the mic input, but you will
32 get the same signal on both channels (within the limits of A/D 32 get the same signal on both channels (within the limits of A/D
33 accuracy). Full scale on the Line input is +/- 2.0 V. Full scale on 33 accuracy). Full scale on the Line input is +/- 2.0 V. Full scale on
34 the MIC input is 20 dB less, or +/- 0.2 V. 34 the MIC input is 20 dB less, or +/- 0.2 V.
35 35
36 The AD1843's LOUT1 outputs are connected to the Line Out jacks. The 36 The AD1843's LOUT1 outputs are connected to the Line Out jacks. The
37 AD1843's HPOUT outputs are connected to the speaker/headphone jack. 37 AD1843's HPOUT outputs are connected to the speaker/headphone jack.
38 LOUT2 is not connected. Line out's maximum level is +/- 2.0 V peak to 38 LOUT2 is not connected. Line out's maximum level is +/- 2.0 V peak to
39 peak. The speaker/headphone out's maximum is +/- 4.0 V peak to peak. 39 peak. The speaker/headphone out's maximum is +/- 4.0 V peak to peak.
40 40
41 The AD1843's PCM input channel and one of its output channels (DAC1) 41 The AD1843's PCM input channel and one of its output channels (DAC1)
42 are connected to Lithium. The other output channel (DAC2) is not 42 are connected to Lithium. The other output channel (DAC2) is not
43 connected. 43 connected.
44 44
45 ============================================================================== 45 ==============================================================================
46 CAPABILITIES 46 CAPABILITIES
47 47
48 The AD1843 has PCM input and output (Pulse Code Modulation, also known 48 The AD1843 has PCM input and output (Pulse Code Modulation, also known
49 as wavetable). PCM input and output can be mono or stereo in any of 49 as wavetable). PCM input and output can be mono or stereo in any of
50 four formats. The formats are 16 bit signed and 8 bit unsigned, 50 four formats. The formats are 16 bit signed and 8 bit unsigned,
51 u-Law, and A-Law format. Any sample rate from 4 KHz to 49 KHz is 51 u-Law, and A-Law format. Any sample rate from 4 KHz to 49 KHz is
52 available, in 1 Hz increments. 52 available, in 1 Hz increments.
53 53
54 The AD1843 includes an analog mixer that can mix all three input 54 The AD1843 includes an analog mixer that can mix all three input
55 signals (line, mic and CD) into the analog outputs. The mixer has a 55 signals (line, mic and CD) into the analog outputs. The mixer has a
56 separate gain control and mute switch for each input. 56 separate gain control and mute switch for each input.
57 57
58 There are two outputs, line out and speaker/headphone out. They 58 There are two outputs, line out and speaker/headphone out. They
59 always produce the same signal, and the speaker always has 3 dB more 59 always produce the same signal, and the speaker always has 3 dB more
60 gain than the line out. The speaker/headphone output can be muted, 60 gain than the line out. The speaker/headphone output can be muted,
61 but this driver does not export that function. 61 but this driver does not export that function.
62 62
63 The hardware can sync audio to the video clock, but this driver does 63 The hardware can sync audio to the video clock, but this driver does
64 not have a way to specify syncing to video. 64 not have a way to specify syncing to video.
65 65
66 ============================================================================== 66 ==============================================================================
67 PROGRAMMING 67 PROGRAMMING
68 68
69 This section explains the API supported by the driver. Also see the 69 This section explains the API supported by the driver. Also see the
70 Open Sound Programming Guide at http://www.opensound.com/pguide/ . 70 Open Sound Programming Guide at http://www.opensound.com/pguide/ .
71 This section assumes familiarity with that document. 71 This section assumes familiarity with that document.
72 72
73 The driver has two interfaces, an I/O interface and a mixer interface. 73 The driver has two interfaces, an I/O interface and a mixer interface.
74 There is no MIDI or sequencer capability. 74 There is no MIDI or sequencer capability.
75 75
76 ============================================================================== 76 ==============================================================================
77 PROGRAMMING PCM I/O 77 PROGRAMMING PCM I/O
78 78
79 The I/O interface is usually accessed as /dev/audio or /dev/dsp. 79 The I/O interface is usually accessed as /dev/audio or /dev/dsp.
80 Using the standard Open Sound System (OSS) ioctl calls, the sample 80 Using the standard Open Sound System (OSS) ioctl calls, the sample
81 rate, number of channels, and sample format may be set within the 81 rate, number of channels, and sample format may be set within the
82 limitations described above. The driver supports triggering. It also 82 limitations described above. The driver supports triggering. It also
83 supports getting the input and output pointers with one-sample 83 supports getting the input and output pointers with one-sample
84 accuracy. 84 accuracy.
85 85
86 The SNDCTL_DSP_GETCAP ioctl returns these capabilities. 86 The SNDCTL_DSP_GETCAP ioctl returns these capabilities.
87 87
88 DSP_CAP_DUPLEX - driver supports full duplex. 88 DSP_CAP_DUPLEX - driver supports full duplex.
89 89
90 DSP_CAP_TRIGGER - driver supports triggering. 90 DSP_CAP_TRIGGER - driver supports triggering.
91 91
92 DSP_CAP_REALTIME - values returned by SNDCTL_DSP_GETIPTR 92 DSP_CAP_REALTIME - values returned by SNDCTL_DSP_GETIPTR
93 and SNDCTL_DSP_GETOPTR are accurate to a few samples. 93 and SNDCTL_DSP_GETOPTR are accurate to a few samples.
94 94
95 Memory mapping (mmap) is not implemented. 95 Memory mapping (mmap) is not implemented.
96 96
97 The driver permits subdivided fragment sizes from 64 to 4096 bytes. 97 The driver permits subdivided fragment sizes from 64 to 4096 bytes.
98 The number of fragments can be anything from 3 fragments to however 98 The number of fragments can be anything from 3 fragments to however
99 many fragments fit into 124 kilobytes. It is up to the user to 99 many fragments fit into 124 kilobytes. It is up to the user to
100 determine how few/small fragments can be used without introducing 100 determine how few/small fragments can be used without introducing
101 glitches with a given workload. Linux is not realtime, so we can't 101 glitches with a given workload. Linux is not realtime, so we can't
102 promise anything. (sigh...) 102 promise anything. (sigh...)
103 103
104 When this driver is switched into or out of mu-Law or A-Law mode on 104 When this driver is switched into or out of mu-Law or A-Law mode on
105 output, it may produce an audible click. This is unavoidable. To 105 output, it may produce an audible click. This is unavoidable. To
106 prevent clicking, use signed 16-bit mode instead, and convert from 106 prevent clicking, use signed 16-bit mode instead, and convert from
107 mu-Law or A-Law format in software. 107 mu-Law or A-Law format in software.
108 108
109 ============================================================================== 109 ==============================================================================
110 PROGRAMMING THE MIXER INTERFACE 110 PROGRAMMING THE MIXER INTERFACE
111 111
112 The mixer interface is usually accessed as /dev/mixer. It is accessed 112 The mixer interface is usually accessed as /dev/mixer. It is accessed
113 through ioctls. The mixer allows the application to control gain or 113 through ioctls. The mixer allows the application to control gain or
114 mute several audio signal paths, and also allows selection of the 114 mute several audio signal paths, and also allows selection of the
115 recording source. 115 recording source.
116 116
117 Each of the constants described here can be read using the 117 Each of the constants described here can be read using the
118 MIXER_READ(SOUND_MIXER_xxx) ioctl. Those that are not read-only can 118 MIXER_READ(SOUND_MIXER_xxx) ioctl. Those that are not read-only can
119 also be written using the MIXER_WRITE(SOUND_MIXER_xxx) ioctl. In most 119 also be written using the MIXER_WRITE(SOUND_MIXER_xxx) ioctl. In most
120 cases, <sys/soundcard.h> defines constants SOUND_MIXER_READ_xxx and 120 cases, <sys/soundcard.h> defines constants SOUND_MIXER_READ_xxx and
121 SOUND_MIXER_WRITE_xxx which work just as well. 121 SOUND_MIXER_WRITE_xxx which work just as well.
122 122
123 SOUND_MIXER_CAPS Read-only 123 SOUND_MIXER_CAPS Read-only
124 124
125 This is a mask of optional driver capabilities that are implemented. 125 This is a mask of optional driver capabilities that are implemented.
126 This driver's only capability is SOUND_CAP_EXCL_INPUT, which means 126 This driver's only capability is SOUND_CAP_EXCL_INPUT, which means
127 that only one recording source can be active at a time. 127 that only one recording source can be active at a time.
128 128
129 SOUND_MIXER_DEVMASK Read-only 129 SOUND_MIXER_DEVMASK Read-only
130 130
131 This is a mask of the sound channels. This driver's channels are PCM, 131 This is a mask of the sound channels. This driver's channels are PCM,
132 LINE, MIC, CD, and RECLEV. 132 LINE, MIC, CD, and RECLEV.
133 133
134 SOUND_MIXER_STEREODEVS Read-only 134 SOUND_MIXER_STEREODEVS Read-only
135 135
136 This is a mask of which sound channels are capable of stereo. All 136 This is a mask of which sound channels are capable of stereo. All
137 channels are capable of stereo. (But see caveat on MIC input in I/O 137 channels are capable of stereo. (But see caveat on MIC input in I/O
138 CONNECTIONS section above). 138 CONNECTIONS section above).
139 139
140 SOUND_MIXER_OUTMASK Read-only 140 SOUND_MIXER_OUTMASK Read-only
141 141
142 This is a mask of channels that route inputs through to outputs. 142 This is a mask of channels that route inputs through to outputs.
143 Those are LINE, MIC, and CD. 143 Those are LINE, MIC, and CD.
144 144
145 SOUND_MIXER_RECMASK Read-only 145 SOUND_MIXER_RECMASK Read-only
146 146
147 This is a mask of channels that can be recording sources. Those are 147 This is a mask of channels that can be recording sources. Those are
148 PCM, LINE, MIC, CD. 148 PCM, LINE, MIC, CD.
149 149
150 SOUND_MIXER_PCM Default: 0x5757 (0 dB) 150 SOUND_MIXER_PCM Default: 0x5757 (0 dB)
151 151
152 This is the gain control for PCM output. The left and right channel 152 This is the gain control for PCM output. The left and right channel
153 gain are controlled independently. This gain control has 64 levels, 153 gain are controlled independently. This gain control has 64 levels,
154 which range from -82.5 dB to +12.0 dB in 1.5 dB steps. Those 64 154 which range from -82.5 dB to +12.0 dB in 1.5 dB steps. Those 64
155 levels are mapped onto 100 levels at the ioctl, see below. 155 levels are mapped onto 100 levels at the ioctl, see below.
156 156
157 SOUND_MIXER_LINE Default: 0x4a4a (0 dB) 157 SOUND_MIXER_LINE Default: 0x4a4a (0 dB)
158 158
159 This is the gain control for mixing the Line In source into the 159 This is the gain control for mixing the Line In source into the
160 outputs. The left and right channel gain are controlled 160 outputs. The left and right channel gain are controlled
161 independently. This gain control has 32 levels, which range from 161 independently. This gain control has 32 levels, which range from
162 -34.5 dB to +12.0 dB in 1.5 dB steps. Those 32 levels are mapped onto 162 -34.5 dB to +12.0 dB in 1.5 dB steps. Those 32 levels are mapped onto
163 100 levels at the ioctl, see below. 163 100 levels at the ioctl, see below.
164 164
165 SOUND_MIXER_MIC Default: 0x4a4a (0 dB) 165 SOUND_MIXER_MIC Default: 0x4a4a (0 dB)
166 166
167 This is the gain control for mixing the MIC source into the outputs. 167 This is the gain control for mixing the MIC source into the outputs.
168 The left and right channel gain are controlled independently. This 168 The left and right channel gain are controlled independently. This
169 gain control has 32 levels, which range from -34.5 dB to +12.0 dB in 169 gain control has 32 levels, which range from -34.5 dB to +12.0 dB in
170 1.5 dB steps. Those 32 levels are mapped onto 100 levels at the 170 1.5 dB steps. Those 32 levels are mapped onto 100 levels at the
171 ioctl, see below. 171 ioctl, see below.
172 172
173 SOUND_MIXER_CD Default: 0x4a4a (0 dB) 173 SOUND_MIXER_CD Default: 0x4a4a (0 dB)
174 174
175 This is the gain control for mixing the CD audio source into the 175 This is the gain control for mixing the CD audio source into the
176 outputs. The left and right channel gain are controlled 176 outputs. The left and right channel gain are controlled
177 independently. This gain control has 32 levels, which range from 177 independently. This gain control has 32 levels, which range from
178 -34.5 dB to +12.0 dB in 1.5 dB steps. Those 32 levels are mapped onto 178 -34.5 dB to +12.0 dB in 1.5 dB steps. Those 32 levels are mapped onto
179 100 levels at the ioctl, see below. 179 100 levels at the ioctl, see below.
180 180
181 SOUND_MIXER_RECLEV Default: 0 (0 dB) 181 SOUND_MIXER_RECLEV Default: 0 (0 dB)
182 182
183 This is the gain control for PCM input (RECording LEVel). The left 183 This is the gain control for PCM input (RECording LEVel). The left
184 and right channel gain are controlled independently. This gain 184 and right channel gain are controlled independently. This gain
185 control has 16 levels, which range from 0 dB to +22.5 dB in 1.5 dB 185 control has 16 levels, which range from 0 dB to +22.5 dB in 1.5 dB
186 steps. Those 16 levels are mapped onto 100 levels at the ioctl, see 186 steps. Those 16 levels are mapped onto 100 levels at the ioctl, see
187 below. 187 below.
188 188
189 SOUND_MIXER_RECSRC Default: SOUND_MASK_LINE 189 SOUND_MIXER_RECSRC Default: SOUND_MASK_LINE
190 190
191 This is a mask of currently selected PCM input sources (RECording 191 This is a mask of currently selected PCM input sources (RECording
192 SouRCes). Because the AD1843 can only have a single recording source 192 SouRCes). Because the AD1843 can only have a single recording source
193 at a time, only one bit at a time can be set in this mask. The 193 at a time, only one bit at a time can be set in this mask. The
194 allowable values are SOUND_MASK_PCM, SOUND_MASK_LINE, SOUND_MASK_MIC, 194 allowable values are SOUND_MASK_PCM, SOUND_MASK_LINE, SOUND_MASK_MIC,
195 or SOUND_MASK_CD. Selecting SOUND_MASK_PCM sets up internal 195 or SOUND_MASK_CD. Selecting SOUND_MASK_PCM sets up internal
196 resampling which is useful for loopback testing and for hardware 196 resampling which is useful for loopback testing and for hardware
197 sample rate conversion. But software sample rate conversion is 197 sample rate conversion. But software sample rate conversion is
198 probably faster, so I don't know how useful that is. 198 probably faster, so I don't know how useful that is.
199 199
200 SOUND_MIXER_OUTSRC DEFAULT: SOUND_MASK_LINE|SOUND_MASK_MIC|SOUND_MASK_CD 200 SOUND_MIXER_OUTSRC DEFAULT: SOUND_MASK_LINE|SOUND_MASK_MIC|SOUND_MASK_CD
201 201
202 This is a mask of sources that are currently passed through to the 202 This is a mask of sources that are currently passed through to the
203 outputs. Those sources whose bits are not set are muted. 203 outputs. Those sources whose bits are not set are muted.
204 204
205 ============================================================================== 205 ==============================================================================
206 GAIN CONTROL 206 GAIN CONTROL
207 207
208 There are five gain controls listed above. Each has 16, 32, or 64 208 There are five gain controls listed above. Each has 16, 32, or 64
209 steps. Each control has 1.5 dB of gain per step. Each control is 209 steps. Each control has 1.5 dB of gain per step. Each control is
210 stereo. 210 stereo.
211 211
212 The OSS defines the argument to a channel gain ioctl as having two 212 The OSS defines the argument to a channel gain ioctl as having two
213 components, left and right, each of which ranges from 0 to 100. The 213 components, left and right, each of which ranges from 0 to 100. The
214 two components are packed into the same word, with the left side gain 214 two components are packed into the same word, with the left side gain
215 in the least significant byte, and the right side gain in the second 215 in the least significant byte, and the right side gain in the second
216 least significant byte. In C, we would say this. 216 least significant byte. In C, we would say this.
217 217
218 #include <assert.h> 218 #include <assert.h>
219 219
220 ... 220 ...
221 221
222 assert(leftgain >= 0 && leftgain <= 100); 222 assert(leftgain >= 0 && leftgain <= 100);
223 assert(rightgain >= 0 && rightgain <= 100); 223 assert(rightgain >= 0 && rightgain <= 100);
224 arg = leftgain | rightgain << 8; 224 arg = leftgain | rightgain << 8;
225 225
226 So each OSS gain control has 101 steps. But the hardware has 16, 32, 226 So each OSS gain control has 101 steps. But the hardware has 16, 32,
227 or 64 steps. The hardware steps are spread across the 101 OSS steps 227 or 64 steps. The hardware steps are spread across the 101 OSS steps
228 nearly evenly. The conversion formulas are like this, given N equals 228 nearly evenly. The conversion formulas are like this, given N equals
229 16, 32, or 64. 229 16, 32, or 64.
230 230
231 int round = N/2 - 1; 231 int round = N/2 - 1;
232 OSS_gain_steps = (hw_gain_steps * 100 + round) / (N - 1); 232 OSS_gain_steps = (hw_gain_steps * 100 + round) / (N - 1);
233 hw_gain_steps = (OSS_gain_steps * (N - 1) + round) / 100; 233 hw_gain_steps = (OSS_gain_steps * (N - 1) + round) / 100;
234 234
235 Here is a snippet of C code that will return the left and right gain 235 Here is a snippet of C code that will return the left and right gain
236 of any channel in dB. Pass it one of the predefined gain_desc_t 236 of any channel in dB. Pass it one of the predefined gain_desc_t
237 structures to access any of the five channels' gains. 237 structures to access any of the five channels' gains.
238 238
239 typedef struct gain_desc { 239 typedef struct gain_desc {
240 float min_gain; 240 float min_gain;
241 float gain_step; 241 float gain_step;
242 int nbits; 242 int nbits;
243 int chan; 243 int chan;
244 } gain_desc_t; 244 } gain_desc_t;
245 245
246 const gain_desc_t gain_pcm = { -82.5, 1.5, 6, SOUND_MIXER_PCM }; 246 const gain_desc_t gain_pcm = { -82.5, 1.5, 6, SOUND_MIXER_PCM };
247 const gain_desc_t gain_line = { -34.5, 1.5, 5, SOUND_MIXER_LINE }; 247 const gain_desc_t gain_line = { -34.5, 1.5, 5, SOUND_MIXER_LINE };
248 const gain_desc_t gain_mic = { -34.5, 1.5, 5, SOUND_MIXER_MIC }; 248 const gain_desc_t gain_mic = { -34.5, 1.5, 5, SOUND_MIXER_MIC };
249 const gain_desc_t gain_cd = { -34.5, 1.5, 5, SOUND_MIXER_CD }; 249 const gain_desc_t gain_cd = { -34.5, 1.5, 5, SOUND_MIXER_CD };
250 const gain_desc_t gain_reclev = { 0.0, 1.5, 4, SOUND_MIXER_RECLEV }; 250 const gain_desc_t gain_reclev = { 0.0, 1.5, 4, SOUND_MIXER_RECLEV };
251 251
252 int get_gain_dB(int fd, const gain_desc_t *gp, 252 int get_gain_dB(int fd, const gain_desc_t *gp,
253 float *left, float *right) 253 float *left, float *right)
254 { 254 {
255 int word; 255 int word;
256 int lg, rg; 256 int lg, rg;
257 int mask = (1 << gp->nbits) - 1; 257 int mask = (1 << gp->nbits) - 1;
258 258
259 if (ioctl(fd, MIXER_READ(gp->chan), &word) != 0) 259 if (ioctl(fd, MIXER_READ(gp->chan), &word) != 0)
260 return -1; /* fail */ 260 return -1; /* fail */
261 lg = word & 0xFF; 261 lg = word & 0xFF;
262 rg = word >> 8 & 0xFF; 262 rg = word >> 8 & 0xFF;
263 lg = (lg * mask + mask / 2) / 100; 263 lg = (lg * mask + mask / 2) / 100;
264 rg = (rg * mask + mask / 2) / 100; 264 rg = (rg * mask + mask / 2) / 100;
265 *left = gp->min_gain + gp->gain_step * lg; 265 *left = gp->min_gain + gp->gain_step * lg;
266 *right = gp->min_gain + gp->gain_step * rg; 266 *right = gp->min_gain + gp->gain_step * rg;
267 return 0; 267 return 0;
268 } 268 }
269 269
270 And here is the corresponding routine to set a channel's gain in dB. 270 And here is the corresponding routine to set a channel's gain in dB.
271 271
272 int set_gain_dB(int fd, const gain_desc_t *gp, float left, float right) 272 int set_gain_dB(int fd, const gain_desc_t *gp, float left, float right)
273 { 273 {
274 float max_gain = 274 float max_gain =
275 gp->min_gain + (1 << gp->nbits) * gp->gain_step; 275 gp->min_gain + (1 << gp->nbits) * gp->gain_step;
276 float round = gp->gain_step / 2; 276 float round = gp->gain_step / 2;
277 int mask = (1 << gp->nbits) - 1; 277 int mask = (1 << gp->nbits) - 1;
278 int word; 278 int word;
279 int lg, rg; 279 int lg, rg;
280 280
281 if (left < gp->min_gain || right < gp->min_gain) 281 if (left < gp->min_gain || right < gp->min_gain)
282 return EINVAL; 282 return EINVAL;
283 lg = (left - gp->min_gain + round) / gp->gain_step; 283 lg = (left - gp->min_gain + round) / gp->gain_step;
284 rg = (right - gp->min_gain + round) / gp->gain_step; 284 rg = (right - gp->min_gain + round) / gp->gain_step;
285 if (lg >= (1 << gp->nbits) || rg >= (1 << gp->nbits)) 285 if (lg >= (1 << gp->nbits) || rg >= (1 << gp->nbits))
286 return EINVAL; 286 return EINVAL;
287 lg = (100 * lg + mask / 2) / mask; 287 lg = (100 * lg + mask / 2) / mask;
288 rg = (100 * rg + mask / 2) / mask; 288 rg = (100 * rg + mask / 2) / mask;
289 word = lg | rg << 8; 289 word = lg | rg << 8;
290 290
291 return ioctl(fd, MIXER_WRITE(gp->chan), &word); 291 return ioctl(fd, MIXER_WRITE(gp->chan), &word);
292 } 292 }
293 293
294 294
Documentation/spi/pxa2xx
1 ๏ปฟPXA2xx SPI on SSP driver HOWTO 1 ๏ปฟPXA2xx SPI on SSP driver HOWTO
2 =================================================== 2 ===================================================
3 This a mini howto on the pxa2xx_spi driver. The driver turns a PXA2xx 3 This a mini howto on the pxa2xx_spi driver. The driver turns a PXA2xx
4 synchronous serial port into a SPI master controller 4 synchronous serial port into a SPI master controller
5 (see Documentation/spi/spi_summary). The driver has the following features 5 (see Documentation/spi/spi_summary). The driver has the following features
6 6
7 - Support for any PXA2xx SSP 7 - Support for any PXA2xx SSP
8 - SSP PIO and SSP DMA data transfers. 8 - SSP PIO and SSP DMA data transfers.
9 - External and Internal (SSPFRM) chip selects. 9 - External and Internal (SSPFRM) chip selects.
10 - Per slave device (chip) configuration. 10 - Per slave device (chip) configuration.
11 - Full suspend, freeze, resume support. 11 - Full suspend, freeze, resume support.
12 12
13 The driver is built around a "spi_message" fifo serviced by workqueue and a 13 The driver is built around a "spi_message" fifo serviced by workqueue and a
14 tasklet. The workqueue, "pump_messages", drives message fifo and the tasklet 14 tasklet. The workqueue, "pump_messages", drives message fifo and the tasklet
15 (pump_transfer) is responsible for queuing SPI transactions and setting up and 15 (pump_transfer) is responsible for queuing SPI transactions and setting up and
16 launching the dma/interrupt driven transfers. 16 launching the dma/interrupt driven transfers.
17 17
18 Declaring PXA2xx Master Controllers 18 Declaring PXA2xx Master Controllers
19 ----------------------------------- 19 -----------------------------------
20 Typically a SPI master is defined in the arch/.../mach-*/board-*.c as a 20 Typically a SPI master is defined in the arch/.../mach-*/board-*.c as a
21 "platform device". The master configuration is passed to the driver via a table 21 "platform device". The master configuration is passed to the driver via a table
22 found in include/asm-arm/arch-pxa/pxa2xx_spi.h: 22 found in include/asm-arm/arch-pxa/pxa2xx_spi.h:
23 23
24 struct pxa2xx_spi_master { 24 struct pxa2xx_spi_master {
25 enum pxa_ssp_type ssp_type; 25 enum pxa_ssp_type ssp_type;
26 u32 clock_enable; 26 u32 clock_enable;
27 u16 num_chipselect; 27 u16 num_chipselect;
28 u8 enable_dma; 28 u8 enable_dma;
29 }; 29 };
30 30
31 The "pxa2xx_spi_master.ssp_type" field must have a value between 1 and 3 and 31 The "pxa2xx_spi_master.ssp_type" field must have a value between 1 and 3 and
32 informs the driver which features a particular SSP supports. 32 informs the driver which features a particular SSP supports.
33 33
34 The "pxa2xx_spi_master.clock_enable" field is used to enable/disable the 34 The "pxa2xx_spi_master.clock_enable" field is used to enable/disable the
35 corresponding SSP peripheral block in the "Clock Enable Register (CKEN"). See 35 corresponding SSP peripheral block in the "Clock Enable Register (CKEN"). See
36 the "PXA2xx Developer Manual" section "Clocks and Power Management". 36 the "PXA2xx Developer Manual" section "Clocks and Power Management".
37 37
38 The "pxa2xx_spi_master.num_chipselect" field is used to determine the number of 38 The "pxa2xx_spi_master.num_chipselect" field is used to determine the number of
39 slave device (chips) attached to this SPI master. 39 slave device (chips) attached to this SPI master.
40 40
41 The "pxa2xx_spi_master.enable_dma" field informs the driver that SSP DMA should 41 The "pxa2xx_spi_master.enable_dma" field informs the driver that SSP DMA should
42 be used. This caused the driver to acquire two DMA channels: rx_channel and 42 be used. This caused the driver to acquire two DMA channels: rx_channel and
43 tx_channel. The rx_channel has a higher DMA service priority the tx_channel. 43 tx_channel. The rx_channel has a higher DMA service priority the tx_channel.
44 See the "PXA2xx Developer Manual" section "DMA Controller". 44 See the "PXA2xx Developer Manual" section "DMA Controller".
45 45
46 NSSP MASTER SAMPLE 46 NSSP MASTER SAMPLE
47 ------------------ 47 ------------------
48 Below is a sample configuration using the PXA255 NSSP. 48 Below is a sample configuration using the PXA255 NSSP.
49 49
50 static struct resource pxa_spi_nssp_resources[] = { 50 static struct resource pxa_spi_nssp_resources[] = {
51 [0] = { 51 [0] = {
52 .start = __PREG(SSCR0_P(2)), /* Start address of NSSP */ 52 .start = __PREG(SSCR0_P(2)), /* Start address of NSSP */
53 .end = __PREG(SSCR0_P(2)) + 0x2c, /* Range of registers */ 53 .end = __PREG(SSCR0_P(2)) + 0x2c, /* Range of registers */
54 .flags = IORESOURCE_MEM, 54 .flags = IORESOURCE_MEM,
55 }, 55 },
56 [1] = { 56 [1] = {
57 .start = IRQ_NSSP, /* NSSP IRQ */ 57 .start = IRQ_NSSP, /* NSSP IRQ */
58 .end = IRQ_NSSP, 58 .end = IRQ_NSSP,
59 .flags = IORESOURCE_IRQ, 59 .flags = IORESOURCE_IRQ,
60 }, 60 },
61 }; 61 };
62 62
63 static struct pxa2xx_spi_master pxa_nssp_master_info = { 63 static struct pxa2xx_spi_master pxa_nssp_master_info = {
64 .ssp_type = PXA25x_NSSP, /* Type of SSP */ 64 .ssp_type = PXA25x_NSSP, /* Type of SSP */
65 .clock_enable = CKEN9_NSSP, /* NSSP Peripheral clock */ 65 .clock_enable = CKEN9_NSSP, /* NSSP Peripheral clock */
66 .num_chipselect = 1, /* Matches the number of chips attached to NSSP */ 66 .num_chipselect = 1, /* Matches the number of chips attached to NSSP */
67 .enable_dma = 1, /* Enables NSSP DMA */ 67 .enable_dma = 1, /* Enables NSSP DMA */
68 }; 68 };
69 69
70 static struct platform_device pxa_spi_nssp = { 70 static struct platform_device pxa_spi_nssp = {
71 .name = "pxa2xx-spi", /* MUST BE THIS VALUE, so device match driver */ 71 .name = "pxa2xx-spi", /* MUST BE THIS VALUE, so device match driver */
72 .id = 2, /* Bus number, MUST MATCH SSP number 1..n */ 72 .id = 2, /* Bus number, MUST MATCH SSP number 1..n */
73 .resource = pxa_spi_nssp_resources, 73 .resource = pxa_spi_nssp_resources,
74 .num_resources = ARRAY_SIZE(pxa_spi_nssp_resources), 74 .num_resources = ARRAY_SIZE(pxa_spi_nssp_resources),
75 .dev = { 75 .dev = {
76 .platform_data = &pxa_nssp_master_info, /* Passed to driver */ 76 .platform_data = &pxa_nssp_master_info, /* Passed to driver */
77 }, 77 },
78 }; 78 };
79 79
80 static struct platform_device *devices[] __initdata = { 80 static struct platform_device *devices[] __initdata = {
81 &pxa_spi_nssp, 81 &pxa_spi_nssp,
82 }; 82 };
83 83
84 static void __init board_init(void) 84 static void __init board_init(void)
85 { 85 {
86 (void)platform_add_device(devices, ARRAY_SIZE(devices)); 86 (void)platform_add_device(devices, ARRAY_SIZE(devices));
87 } 87 }
88 88
89 Declaring Slave Devices 89 Declaring Slave Devices
90 ----------------------- 90 -----------------------
91 Typically each SPI slave (chip) is defined in the arch/.../mach-*/board-*.c 91 Typically each SPI slave (chip) is defined in the arch/.../mach-*/board-*.c
92 using the "spi_board_info" structure found in "linux/spi/spi.h". See 92 using the "spi_board_info" structure found in "linux/spi/spi.h". See
93 "Documentation/spi/spi_summary" for additional information. 93 "Documentation/spi/spi_summary" for additional information.
94 94
95 Each slave device attached to the PXA must provide slave specific configuration 95 Each slave device attached to the PXA must provide slave specific configuration
96 information via the structure "pxa2xx_spi_chip" found in 96 information via the structure "pxa2xx_spi_chip" found in
97 "include/asm-arm/arch-pxa/pxa2xx_spi.h". The pxa2xx_spi master controller driver 97 "include/asm-arm/arch-pxa/pxa2xx_spi.h". The pxa2xx_spi master controller driver
98 will uses the configuration whenever the driver communicates with the slave 98 will uses the configuration whenever the driver communicates with the slave
99 device. 99 device.
100 100
101 struct pxa2xx_spi_chip { 101 struct pxa2xx_spi_chip {
102 u8 tx_threshold; 102 u8 tx_threshold;
103 u8 rx_threshold; 103 u8 rx_threshold;
104 u8 dma_burst_size; 104 u8 dma_burst_size;
105 u32 timeout_microsecs; 105 u32 timeout_microsecs;
106 u8 enable_loopback; 106 u8 enable_loopback;
107 void (*cs_control)(u32 command); 107 void (*cs_control)(u32 command);
108 }; 108 };
109 109
110 The "pxa2xx_spi_chip.tx_threshold" and "pxa2xx_spi_chip.rx_threshold" fields are 110 The "pxa2xx_spi_chip.tx_threshold" and "pxa2xx_spi_chip.rx_threshold" fields are
111 used to configure the SSP hardware fifo. These fields are critical to the 111 used to configure the SSP hardware fifo. These fields are critical to the
112 performance of pxa2xx_spi driver and misconfiguration will result in rx 112 performance of pxa2xx_spi driver and misconfiguration will result in rx
113 fifo overruns (especially in PIO mode transfers). Good default values are 113 fifo overruns (especially in PIO mode transfers). Good default values are
114 114
115 .tx_threshold = 12, 115 .tx_threshold = 12,
116 .rx_threshold = 4, 116 .rx_threshold = 4,
117 117
118 The "pxa2xx_spi_chip.dma_burst_size" field is used to configure PXA2xx DMA 118 The "pxa2xx_spi_chip.dma_burst_size" field is used to configure PXA2xx DMA
119 engine and is related the "spi_device.bits_per_word" field. Read and understand 119 engine and is related the "spi_device.bits_per_word" field. Read and understand
120 the PXA2xx "Developer Manual" sections on the DMA controller and SSP Controllers 120 the PXA2xx "Developer Manual" sections on the DMA controller and SSP Controllers
121 to determine the correct value. An SSP configured for byte-wide transfers would 121 to determine the correct value. An SSP configured for byte-wide transfers would
122 use a value of 8. 122 use a value of 8.
123 123
124 The "pxa2xx_spi_chip.timeout_microsecs" fields is used to efficiently handle 124 The "pxa2xx_spi_chip.timeout_microsecs" fields is used to efficiently handle
125 trailing bytes in the SSP receiver fifo. The correct value for this field is 125 trailing bytes in the SSP receiver fifo. The correct value for this field is
126 dependent on the SPI bus speed ("spi_board_info.max_speed_hz") and the specific 126 dependent on the SPI bus speed ("spi_board_info.max_speed_hz") and the specific
127 slave device. Please note the the PXA2xx SSP 1 does not support trailing byte 127 slave device. Please note that the PXA2xx SSP 1 does not support trailing byte
128 timeouts and must busy-wait any trailing bytes. 128 timeouts and must busy-wait any trailing bytes.
129 129
130 The "pxa2xx_spi_chip.enable_loopback" field is used to place the SSP porting 130 The "pxa2xx_spi_chip.enable_loopback" field is used to place the SSP porting
131 into internal loopback mode. In this mode the SSP controller internally 131 into internal loopback mode. In this mode the SSP controller internally
132 connects the SSPTX pin the the SSPRX pin. This is useful for initial setup 132 connects the SSPTX pin to the SSPRX pin. This is useful for initial setup
133 testing. 133 testing.
134 134
135 The "pxa2xx_spi_chip.cs_control" field is used to point to a board specific 135 The "pxa2xx_spi_chip.cs_control" field is used to point to a board specific
136 function for asserting/deasserting a slave device chip select. If the field is 136 function for asserting/deasserting a slave device chip select. If the field is
137 NULL, the pxa2xx_spi master controller driver assumes that the SSP port is 137 NULL, the pxa2xx_spi master controller driver assumes that the SSP port is
138 configured to use SSPFRM instead. 138 configured to use SSPFRM instead.
139 139
140 NSSP SALVE SAMPLE 140 NSSP SALVE SAMPLE
141 ----------------- 141 -----------------
142 The pxa2xx_spi_chip structure is passed to the pxa2xx_spi driver in the 142 The pxa2xx_spi_chip structure is passed to the pxa2xx_spi driver in the
143 "spi_board_info.controller_data" field. Below is a sample configuration using 143 "spi_board_info.controller_data" field. Below is a sample configuration using
144 the PXA255 NSSP. 144 the PXA255 NSSP.
145 145
146 /* Chip Select control for the CS8415A SPI slave device */ 146 /* Chip Select control for the CS8415A SPI slave device */
147 static void cs8415a_cs_control(u32 command) 147 static void cs8415a_cs_control(u32 command)
148 { 148 {
149 if (command & PXA2XX_CS_ASSERT) 149 if (command & PXA2XX_CS_ASSERT)
150 GPCR(2) = GPIO_bit(2); 150 GPCR(2) = GPIO_bit(2);
151 else 151 else
152 GPSR(2) = GPIO_bit(2); 152 GPSR(2) = GPIO_bit(2);
153 } 153 }
154 154
155 /* Chip Select control for the CS8405A SPI slave device */ 155 /* Chip Select control for the CS8405A SPI slave device */
156 static void cs8405a_cs_control(u32 command) 156 static void cs8405a_cs_control(u32 command)
157 { 157 {
158 if (command & PXA2XX_CS_ASSERT) 158 if (command & PXA2XX_CS_ASSERT)
159 GPCR(3) = GPIO_bit(3); 159 GPCR(3) = GPIO_bit(3);
160 else 160 else
161 GPSR(3) = GPIO_bit(3); 161 GPSR(3) = GPIO_bit(3);
162 } 162 }
163 163
164 static struct pxa2xx_spi_chip cs8415a_chip_info = { 164 static struct pxa2xx_spi_chip cs8415a_chip_info = {
165 .tx_threshold = 12, /* SSP hardward FIFO threshold */ 165 .tx_threshold = 12, /* SSP hardward FIFO threshold */
166 .rx_threshold = 4, /* SSP hardward FIFO threshold */ 166 .rx_threshold = 4, /* SSP hardward FIFO threshold */
167 .dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */ 167 .dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */
168 .timeout_microsecs = 64, /* Wait at least 64usec to handle trailing */ 168 .timeout_microsecs = 64, /* Wait at least 64usec to handle trailing */
169 .cs_control = cs8415a_cs_control, /* Use external chip select */ 169 .cs_control = cs8415a_cs_control, /* Use external chip select */
170 }; 170 };
171 171
172 static struct pxa2xx_spi_chip cs8405a_chip_info = { 172 static struct pxa2xx_spi_chip cs8405a_chip_info = {
173 .tx_threshold = 12, /* SSP hardward FIFO threshold */ 173 .tx_threshold = 12, /* SSP hardward FIFO threshold */
174 .rx_threshold = 4, /* SSP hardward FIFO threshold */ 174 .rx_threshold = 4, /* SSP hardward FIFO threshold */
175 .dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */ 175 .dma_burst_size = 8, /* Byte wide transfers used so 8 byte bursts */
176 .timeout_microsecs = 64, /* Wait at least 64usec to handle trailing */ 176 .timeout_microsecs = 64, /* Wait at least 64usec to handle trailing */
177 .cs_control = cs8405a_cs_control, /* Use external chip select */ 177 .cs_control = cs8405a_cs_control, /* Use external chip select */
178 }; 178 };
179 179
180 static struct spi_board_info streetracer_spi_board_info[] __initdata = { 180 static struct spi_board_info streetracer_spi_board_info[] __initdata = {
181 { 181 {
182 .modalias = "cs8415a", /* Name of spi_driver for this device */ 182 .modalias = "cs8415a", /* Name of spi_driver for this device */
183 .max_speed_hz = 3686400, /* Run SSP as fast a possbile */ 183 .max_speed_hz = 3686400, /* Run SSP as fast a possbile */
184 .bus_num = 2, /* Framework bus number */ 184 .bus_num = 2, /* Framework bus number */
185 .chip_select = 0, /* Framework chip select */ 185 .chip_select = 0, /* Framework chip select */
186 .platform_data = NULL; /* No spi_driver specific config */ 186 .platform_data = NULL; /* No spi_driver specific config */
187 .controller_data = &cs8415a_chip_info, /* Master chip config */ 187 .controller_data = &cs8415a_chip_info, /* Master chip config */
188 .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */ 188 .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */
189 }, 189 },
190 { 190 {
191 .modalias = "cs8405a", /* Name of spi_driver for this device */ 191 .modalias = "cs8405a", /* Name of spi_driver for this device */
192 .max_speed_hz = 3686400, /* Run SSP as fast a possbile */ 192 .max_speed_hz = 3686400, /* Run SSP as fast a possbile */
193 .bus_num = 2, /* Framework bus number */ 193 .bus_num = 2, /* Framework bus number */
194 .chip_select = 1, /* Framework chip select */ 194 .chip_select = 1, /* Framework chip select */
195 .controller_data = &cs8405a_chip_info, /* Master chip config */ 195 .controller_data = &cs8405a_chip_info, /* Master chip config */
196 .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */ 196 .irq = STREETRACER_APCI_IRQ, /* Slave device interrupt */
197 }, 197 },
198 }; 198 };
199 199
200 static void __init streetracer_init(void) 200 static void __init streetracer_init(void)
201 { 201 {
202 spi_register_board_info(streetracer_spi_board_info, 202 spi_register_board_info(streetracer_spi_board_info,
203 ARRAY_SIZE(streetracer_spi_board_info)); 203 ARRAY_SIZE(streetracer_spi_board_info));
204 } 204 }
205 205
206 206
207 DMA and PIO I/O Support 207 DMA and PIO I/O Support
208 ----------------------- 208 -----------------------
209 The pxa2xx_spi driver support both DMA and interrupt driven PIO message 209 The pxa2xx_spi driver support both DMA and interrupt driven PIO message
210 transfers. The driver defaults to PIO mode and DMA transfers must enabled by 210 transfers. The driver defaults to PIO mode and DMA transfers must enabled by
211 setting the "enable_dma" flag in the "pxa2xx_spi_master" structure and and 211 setting the "enable_dma" flag in the "pxa2xx_spi_master" structure and
212 ensuring that the "pxa2xx_spi_chip.dma_burst_size" field is non-zero. The DMA 212 ensuring that the "pxa2xx_spi_chip.dma_burst_size" field is non-zero. The DMA
213 mode support both coherent and stream based DMA mappings. 213 mode support both coherent and stream based DMA mappings.
214 214
215 The following logic is used to determine the type of I/O to be used on 215 The following logic is used to determine the type of I/O to be used on
216 a per "spi_transfer" basis: 216 a per "spi_transfer" basis:
217 217
218 if !enable_dma or dma_burst_size == 0 then 218 if !enable_dma or dma_burst_size == 0 then
219 always use PIO transfers 219 always use PIO transfers
220 220
221 if spi_message.is_dma_mapped and rx_dma_buf != 0 and tx_dma_buf != 0 then 221 if spi_message.is_dma_mapped and rx_dma_buf != 0 and tx_dma_buf != 0 then
222 use coherent DMA mode 222 use coherent DMA mode
223 223
224 if rx_buf and tx_buf are aligned on 8 byte boundary then 224 if rx_buf and tx_buf are aligned on 8 byte boundary then
225 use streaming DMA mode 225 use streaming DMA mode
226 226
227 otherwise 227 otherwise
228 use PIO transfer 228 use PIO transfer
229 229
230 THANKS TO 230 THANKS TO
231 --------- 231 ---------
232 232
233 David Brownell and others for mentoring the development of this driver. 233 David Brownell and others for mentoring the development of this driver.
234 234
235 235
Documentation/spi/spi-summary
1 Overview of Linux kernel SPI support 1 Overview of Linux kernel SPI support
2 ==================================== 2 ====================================
3 3
4 02-Dec-2005 4 02-Dec-2005
5 5
6 What is SPI? 6 What is SPI?
7 ------------ 7 ------------
8 The "Serial Peripheral Interface" (SPI) is a synchronous four wire serial 8 The "Serial Peripheral Interface" (SPI) is a synchronous four wire serial
9 link used to connect microcontrollers to sensors, memory, and peripherals. 9 link used to connect microcontrollers to sensors, memory, and peripherals.
10 10
11 The three signal wires hold a clock (SCLK, often on the order of 10 MHz), 11 The three signal wires hold a clock (SCLK, often on the order of 10 MHz),
12 and parallel data lines with "Master Out, Slave In" (MOSI) or "Master In, 12 and parallel data lines with "Master Out, Slave In" (MOSI) or "Master In,
13 Slave Out" (MISO) signals. (Other names are also used.) There are four 13 Slave Out" (MISO) signals. (Other names are also used.) There are four
14 clocking modes through which data is exchanged; mode-0 and mode-3 are most 14 clocking modes through which data is exchanged; mode-0 and mode-3 are most
15 commonly used. Each clock cycle shifts data out and data in; the clock 15 commonly used. Each clock cycle shifts data out and data in; the clock
16 doesn't cycle except when there is data to shift. 16 doesn't cycle except when there is data to shift.
17 17
18 SPI masters may use a "chip select" line to activate a given SPI slave 18 SPI masters may use a "chip select" line to activate a given SPI slave
19 device, so those three signal wires may be connected to several chips 19 device, so those three signal wires may be connected to several chips
20 in parallel. All SPI slaves support chipselects. Some devices have 20 in parallel. All SPI slaves support chipselects. Some devices have
21 other signals, often including an interrupt to the master. 21 other signals, often including an interrupt to the master.
22 22
23 Unlike serial busses like USB or SMBUS, even low level protocols for 23 Unlike serial busses like USB or SMBUS, even low level protocols for
24 SPI slave functions are usually not interoperable between vendors 24 SPI slave functions are usually not interoperable between vendors
25 (except for cases like SPI memory chips). 25 (except for cases like SPI memory chips).
26 26
27 - SPI may be used for request/response style device protocols, as with 27 - SPI may be used for request/response style device protocols, as with
28 touchscreen sensors and memory chips. 28 touchscreen sensors and memory chips.
29 29
30 - It may also be used to stream data in either direction (half duplex), 30 - It may also be used to stream data in either direction (half duplex),
31 or both of them at the same time (full duplex). 31 or both of them at the same time (full duplex).
32 32
33 - Some devices may use eight bit words. Others may different word 33 - Some devices may use eight bit words. Others may different word
34 lengths, such as streams of 12-bit or 20-bit digital samples. 34 lengths, such as streams of 12-bit or 20-bit digital samples.
35 35
36 In the same way, SPI slaves will only rarely support any kind of automatic 36 In the same way, SPI slaves will only rarely support any kind of automatic
37 discovery/enumeration protocol. The tree of slave devices accessible from 37 discovery/enumeration protocol. The tree of slave devices accessible from
38 a given SPI master will normally be set up manually, with configuration 38 a given SPI master will normally be set up manually, with configuration
39 tables. 39 tables.
40 40
41 SPI is only one of the names used by such four-wire protocols, and 41 SPI is only one of the names used by such four-wire protocols, and
42 most controllers have no problem handling "MicroWire" (think of it as 42 most controllers have no problem handling "MicroWire" (think of it as
43 half-duplex SPI, for request/response protocols), SSP ("Synchronous 43 half-duplex SPI, for request/response protocols), SSP ("Synchronous
44 Serial Protocol"), PSP ("Programmable Serial Protocol"), and other 44 Serial Protocol"), PSP ("Programmable Serial Protocol"), and other
45 related protocols. 45 related protocols.
46 46
47 Microcontrollers often support both master and slave sides of the SPI 47 Microcontrollers often support both master and slave sides of the SPI
48 protocol. This document (and Linux) currently only supports the master 48 protocol. This document (and Linux) currently only supports the master
49 side of SPI interactions. 49 side of SPI interactions.
50 50
51 51
52 Who uses it? On what kinds of systems? 52 Who uses it? On what kinds of systems?
53 --------------------------------------- 53 ---------------------------------------
54 Linux developers using SPI are probably writing device drivers for embedded 54 Linux developers using SPI are probably writing device drivers for embedded
55 systems boards. SPI is used to control external chips, and it is also a 55 systems boards. SPI is used to control external chips, and it is also a
56 protocol supported by every MMC or SD memory card. (The older "DataFlash" 56 protocol supported by every MMC or SD memory card. (The older "DataFlash"
57 cards, predating MMC cards but using the same connectors and card shape, 57 cards, predating MMC cards but using the same connectors and card shape,
58 support only SPI.) Some PC hardware uses SPI flash for BIOS code. 58 support only SPI.) Some PC hardware uses SPI flash for BIOS code.
59 59
60 SPI slave chips range from digital/analog converters used for analog 60 SPI slave chips range from digital/analog converters used for analog
61 sensors and codecs, to memory, to peripherals like USB controllers 61 sensors and codecs, to memory, to peripherals like USB controllers
62 or Ethernet adapters; and more. 62 or Ethernet adapters; and more.
63 63
64 Most systems using SPI will integrate a few devices on a mainboard. 64 Most systems using SPI will integrate a few devices on a mainboard.
65 Some provide SPI links on expansion connectors; in cases where no 65 Some provide SPI links on expansion connectors; in cases where no
66 dedicated SPI controller exists, GPIO pins can be used to create a 66 dedicated SPI controller exists, GPIO pins can be used to create a
67 low speed "bitbanging" adapter. Very few systems will "hotplug" an SPI 67 low speed "bitbanging" adapter. Very few systems will "hotplug" an SPI
68 controller; the reasons to use SPI focus on low cost and simple operation, 68 controller; the reasons to use SPI focus on low cost and simple operation,
69 and if dynamic reconfiguration is important, USB will often be a more 69 and if dynamic reconfiguration is important, USB will often be a more
70 appropriate low-pincount peripheral bus. 70 appropriate low-pincount peripheral bus.
71 71
72 Many microcontrollers that can run Linux integrate one or more I/O 72 Many microcontrollers that can run Linux integrate one or more I/O
73 interfaces with SPI modes. Given SPI support, they could use MMC or SD 73 interfaces with SPI modes. Given SPI support, they could use MMC or SD
74 cards without needing a special purpose MMC/SD/SDIO controller. 74 cards without needing a special purpose MMC/SD/SDIO controller.
75 75
76 76
77 How do these driver programming interfaces work? 77 How do these driver programming interfaces work?
78 ------------------------------------------------ 78 ------------------------------------------------
79 The <linux/spi/spi.h> header file includes kerneldoc, as does the 79 The <linux/spi/spi.h> header file includes kerneldoc, as does the
80 main source code, and you should certainly read that. This is just 80 main source code, and you should certainly read that. This is just
81 an overview, so you get the big picture before the details. 81 an overview, so you get the big picture before the details.
82 82
83 SPI requests always go into I/O queues. Requests for a given SPI device 83 SPI requests always go into I/O queues. Requests for a given SPI device
84 are always executed in FIFO order, and complete asynchronously through 84 are always executed in FIFO order, and complete asynchronously through
85 completion callbacks. There are also some simple synchronous wrappers 85 completion callbacks. There are also some simple synchronous wrappers
86 for those calls, including ones for common transaction types like writing 86 for those calls, including ones for common transaction types like writing
87 a command and then reading its response. 87 a command and then reading its response.
88 88
89 There are two types of SPI driver, here called: 89 There are two types of SPI driver, here called:
90 90
91 Controller drivers ... these are often built in to System-On-Chip 91 Controller drivers ... these are often built in to System-On-Chip
92 processors, and often support both Master and Slave roles. 92 processors, and often support both Master and Slave roles.
93 These drivers touch hardware registers and may use DMA. 93 These drivers touch hardware registers and may use DMA.
94 Or they can be PIO bitbangers, needing just GPIO pins. 94 Or they can be PIO bitbangers, needing just GPIO pins.
95 95
96 Protocol drivers ... these pass messages through the controller 96 Protocol drivers ... these pass messages through the controller
97 driver to communicate with a Slave or Master device on the 97 driver to communicate with a Slave or Master device on the
98 other side of an SPI link. 98 other side of an SPI link.
99 99
100 So for example one protocol driver might talk to the MTD layer to export 100 So for example one protocol driver might talk to the MTD layer to export
101 data to filesystems stored on SPI flash like DataFlash; and others might 101 data to filesystems stored on SPI flash like DataFlash; and others might
102 control audio interfaces, present touchscreen sensors as input interfaces, 102 control audio interfaces, present touchscreen sensors as input interfaces,
103 or monitor temperature and voltage levels during industrial processing. 103 or monitor temperature and voltage levels during industrial processing.
104 And those might all be sharing the same controller driver. 104 And those might all be sharing the same controller driver.
105 105
106 A "struct spi_device" encapsulates the master-side interface between 106 A "struct spi_device" encapsulates the master-side interface between
107 those two types of driver. At this writing, Linux has no slave side 107 those two types of driver. At this writing, Linux has no slave side
108 programming interface. 108 programming interface.
109 109
110 There is a minimal core of SPI programming interfaces, focussing on 110 There is a minimal core of SPI programming interfaces, focussing on
111 using driver model to connect controller and protocol drivers using 111 using driver model to connect controller and protocol drivers using
112 device tables provided by board specific initialization code. SPI 112 device tables provided by board specific initialization code. SPI
113 shows up in sysfs in several locations: 113 shows up in sysfs in several locations:
114 114
115 /sys/devices/.../CTLR/spiB.C ... spi_device for on bus "B", 115 /sys/devices/.../CTLR/spiB.C ... spi_device for on bus "B",
116 chipselect C, accessed through CTLR. 116 chipselect C, accessed through CTLR.
117 117
118 /sys/devices/.../CTLR/spiB.C/modalias ... identifies the driver 118 /sys/devices/.../CTLR/spiB.C/modalias ... identifies the driver
119 that should be used with this device (for hotplug/coldplug) 119 that should be used with this device (for hotplug/coldplug)
120 120
121 /sys/bus/spi/devices/spiB.C ... symlink to the physical 121 /sys/bus/spi/devices/spiB.C ... symlink to the physical
122 spiB-C device 122 spiB-C device
123 123
124 /sys/bus/spi/drivers/D ... driver for one or more spi*.* devices 124 /sys/bus/spi/drivers/D ... driver for one or more spi*.* devices
125 125
126 /sys/class/spi_master/spiB ... class device for the controller 126 /sys/class/spi_master/spiB ... class device for the controller
127 managing bus "B". All the spiB.* devices share the same 127 managing bus "B". All the spiB.* devices share the same
128 physical SPI bus segment, with SCLK, MOSI, and MISO. 128 physical SPI bus segment, with SCLK, MOSI, and MISO.
129 129
130 130
131 How does board-specific init code declare SPI devices? 131 How does board-specific init code declare SPI devices?
132 ------------------------------------------------------ 132 ------------------------------------------------------
133 Linux needs several kinds of information to properly configure SPI devices. 133 Linux needs several kinds of information to properly configure SPI devices.
134 That information is normally provided by board-specific code, even for 134 That information is normally provided by board-specific code, even for
135 chips that do support some of automated discovery/enumeration. 135 chips that do support some of automated discovery/enumeration.
136 136
137 DECLARE CONTROLLERS 137 DECLARE CONTROLLERS
138 138
139 The first kind of information is a list of what SPI controllers exist. 139 The first kind of information is a list of what SPI controllers exist.
140 For System-on-Chip (SOC) based boards, these will usually be platform 140 For System-on-Chip (SOC) based boards, these will usually be platform
141 devices, and the controller may need some platform_data in order to 141 devices, and the controller may need some platform_data in order to
142 operate properly. The "struct platform_device" will include resources 142 operate properly. The "struct platform_device" will include resources
143 like the physical address of the controller's first register and its IRQ. 143 like the physical address of the controller's first register and its IRQ.
144 144
145 Platforms will often abstract the "register SPI controller" operation, 145 Platforms will often abstract the "register SPI controller" operation,
146 maybe coupling it with code to initialize pin configurations, so that 146 maybe coupling it with code to initialize pin configurations, so that
147 the arch/.../mach-*/board-*.c files for several boards can all share the 147 the arch/.../mach-*/board-*.c files for several boards can all share the
148 same basic controller setup code. This is because most SOCs have several 148 same basic controller setup code. This is because most SOCs have several
149 SPI-capable controllers, and only the ones actually usable on a given 149 SPI-capable controllers, and only the ones actually usable on a given
150 board should normally be set up and registered. 150 board should normally be set up and registered.
151 151
152 So for example arch/.../mach-*/board-*.c files might have code like: 152 So for example arch/.../mach-*/board-*.c files might have code like:
153 153
154 #include <asm/arch/spi.h> /* for mysoc_spi_data */ 154 #include <asm/arch/spi.h> /* for mysoc_spi_data */
155 155
156 /* if your mach-* infrastructure doesn't support kernels that can 156 /* if your mach-* infrastructure doesn't support kernels that can
157 * run on multiple boards, pdata wouldn't benefit from "__init". 157 * run on multiple boards, pdata wouldn't benefit from "__init".
158 */ 158 */
159 static struct mysoc_spi_data __init pdata = { ... }; 159 static struct mysoc_spi_data __init pdata = { ... };
160 160
161 static __init board_init(void) 161 static __init board_init(void)
162 { 162 {
163 ... 163 ...
164 /* this board only uses SPI controller #2 */ 164 /* this board only uses SPI controller #2 */
165 mysoc_register_spi(2, &pdata); 165 mysoc_register_spi(2, &pdata);
166 ... 166 ...
167 } 167 }
168 168
169 And SOC-specific utility code might look something like: 169 And SOC-specific utility code might look something like:
170 170
171 #include <asm/arch/spi.h> 171 #include <asm/arch/spi.h>
172 172
173 static struct platform_device spi2 = { ... }; 173 static struct platform_device spi2 = { ... };
174 174
175 void mysoc_register_spi(unsigned n, struct mysoc_spi_data *pdata) 175 void mysoc_register_spi(unsigned n, struct mysoc_spi_data *pdata)
176 { 176 {
177 struct mysoc_spi_data *pdata2; 177 struct mysoc_spi_data *pdata2;
178 178
179 pdata2 = kmalloc(sizeof *pdata2, GFP_KERNEL); 179 pdata2 = kmalloc(sizeof *pdata2, GFP_KERNEL);
180 *pdata2 = pdata; 180 *pdata2 = pdata;
181 ... 181 ...
182 if (n == 2) { 182 if (n == 2) {
183 spi2->dev.platform_data = pdata2; 183 spi2->dev.platform_data = pdata2;
184 register_platform_device(&spi2); 184 register_platform_device(&spi2);
185 185
186 /* also: set up pin modes so the spi2 signals are 186 /* also: set up pin modes so the spi2 signals are
187 * visible on the relevant pins ... bootloaders on 187 * visible on the relevant pins ... bootloaders on
188 * production boards may already have done this, but 188 * production boards may already have done this, but
189 * developer boards will often need Linux to do it. 189 * developer boards will often need Linux to do it.
190 */ 190 */
191 } 191 }
192 ... 192 ...
193 } 193 }
194 194
195 Notice how the platform_data for boards may be different, even if the 195 Notice how the platform_data for boards may be different, even if the
196 same SOC controller is used. For example, on one board SPI might use 196 same SOC controller is used. For example, on one board SPI might use
197 an external clock, where another derives the SPI clock from current 197 an external clock, where another derives the SPI clock from current
198 settings of some master clock. 198 settings of some master clock.
199 199
200 200
201 DECLARE SLAVE DEVICES 201 DECLARE SLAVE DEVICES
202 202
203 The second kind of information is a list of what SPI slave devices exist 203 The second kind of information is a list of what SPI slave devices exist
204 on the target board, often with some board-specific data needed for the 204 on the target board, often with some board-specific data needed for the
205 driver to work correctly. 205 driver to work correctly.
206 206
207 Normally your arch/.../mach-*/board-*.c files would provide a small table 207 Normally your arch/.../mach-*/board-*.c files would provide a small table
208 listing the SPI devices on each board. (This would typically be only a 208 listing the SPI devices on each board. (This would typically be only a
209 small handful.) That might look like: 209 small handful.) That might look like:
210 210
211 static struct ads7846_platform_data ads_info = { 211 static struct ads7846_platform_data ads_info = {
212 .vref_delay_usecs = 100, 212 .vref_delay_usecs = 100,
213 .x_plate_ohms = 580, 213 .x_plate_ohms = 580,
214 .y_plate_ohms = 410, 214 .y_plate_ohms = 410,
215 }; 215 };
216 216
217 static struct spi_board_info spi_board_info[] __initdata = { 217 static struct spi_board_info spi_board_info[] __initdata = {
218 { 218 {
219 .modalias = "ads7846", 219 .modalias = "ads7846",
220 .platform_data = &ads_info, 220 .platform_data = &ads_info,
221 .mode = SPI_MODE_0, 221 .mode = SPI_MODE_0,
222 .irq = GPIO_IRQ(31), 222 .irq = GPIO_IRQ(31),
223 .max_speed_hz = 120000 /* max sample rate at 3V */ * 16, 223 .max_speed_hz = 120000 /* max sample rate at 3V */ * 16,
224 .bus_num = 1, 224 .bus_num = 1,
225 .chip_select = 0, 225 .chip_select = 0,
226 }, 226 },
227 }; 227 };
228 228
229 Again, notice how board-specific information is provided; each chip may need 229 Again, notice how board-specific information is provided; each chip may need
230 several types. This example shows generic constraints like the fastest SPI 230 several types. This example shows generic constraints like the fastest SPI
231 clock to allow (a function of board voltage in this case) or how an IRQ pin 231 clock to allow (a function of board voltage in this case) or how an IRQ pin
232 is wired, plus chip-specific constraints like an important delay that's 232 is wired, plus chip-specific constraints like an important delay that's
233 changed by the capacitance at one pin. 233 changed by the capacitance at one pin.
234 234
235 (There's also "controller_data", information that may be useful to the 235 (There's also "controller_data", information that may be useful to the
236 controller driver. An example would be peripheral-specific DMA tuning 236 controller driver. An example would be peripheral-specific DMA tuning
237 data or chipselect callbacks. This is stored in spi_device later.) 237 data or chipselect callbacks. This is stored in spi_device later.)
238 238
239 The board_info should provide enough information to let the system work 239 The board_info should provide enough information to let the system work
240 without the chip's driver being loaded. The most troublesome aspect of 240 without the chip's driver being loaded. The most troublesome aspect of
241 that is likely the SPI_CS_HIGH bit in the spi_device.mode field, since 241 that is likely the SPI_CS_HIGH bit in the spi_device.mode field, since
242 sharing a bus with a device that interprets chipselect "backwards" is 242 sharing a bus with a device that interprets chipselect "backwards" is
243 not possible. 243 not possible.
244 244
245 Then your board initialization code would register that table with the SPI 245 Then your board initialization code would register that table with the SPI
246 infrastructure, so that it's available later when the SPI master controller 246 infrastructure, so that it's available later when the SPI master controller
247 driver is registered: 247 driver is registered:
248 248
249 spi_register_board_info(spi_board_info, ARRAY_SIZE(spi_board_info)); 249 spi_register_board_info(spi_board_info, ARRAY_SIZE(spi_board_info));
250 250
251 Like with other static board-specific setup, you won't unregister those. 251 Like with other static board-specific setup, you won't unregister those.
252 252
253 The widely used "card" style computers bundle memory, cpu, and little else 253 The widely used "card" style computers bundle memory, cpu, and little else
254 onto a card that's maybe just thirty square centimeters. On such systems, 254 onto a card that's maybe just thirty square centimeters. On such systems,
255 your arch/.../mach-.../board-*.c file would primarily provide information 255 your arch/.../mach-.../board-*.c file would primarily provide information
256 about the devices on the mainboard into which such a card is plugged. That 256 about the devices on the mainboard into which such a card is plugged. That
257 certainly includes SPI devices hooked up through the card connectors! 257 certainly includes SPI devices hooked up through the card connectors!
258 258
259 259
260 NON-STATIC CONFIGURATIONS 260 NON-STATIC CONFIGURATIONS
261 261
262 Developer boards often play by different rules than product boards, and one 262 Developer boards often play by different rules than product boards, and one
263 example is the potential need to hotplug SPI devices and/or controllers. 263 example is the potential need to hotplug SPI devices and/or controllers.
264 264
265 For those cases you might need to use use spi_busnum_to_master() to look 265 For those cases you might need to use spi_busnum_to_master() to look
266 up the spi bus master, and will likely need spi_new_device() to provide the 266 up the spi bus master, and will likely need spi_new_device() to provide the
267 board info based on the board that was hotplugged. Of course, you'd later 267 board info based on the board that was hotplugged. Of course, you'd later
268 call at least spi_unregister_device() when that board is removed. 268 call at least spi_unregister_device() when that board is removed.
269 269
270 When Linux includes support for MMC/SD/SDIO/DataFlash cards through SPI, those 270 When Linux includes support for MMC/SD/SDIO/DataFlash cards through SPI, those
271 configurations will also be dynamic. Fortunately, those devices all support 271 configurations will also be dynamic. Fortunately, those devices all support
272 basic device identification probes, so that support should hotplug normally. 272 basic device identification probes, so that support should hotplug normally.
273 273
274 274
275 How do I write an "SPI Protocol Driver"? 275 How do I write an "SPI Protocol Driver"?
276 ---------------------------------------- 276 ----------------------------------------
277 All SPI drivers are currently kernel drivers. A userspace driver API 277 All SPI drivers are currently kernel drivers. A userspace driver API
278 would just be another kernel driver, probably offering some lowlevel 278 would just be another kernel driver, probably offering some lowlevel
279 access through aio_read(), aio_write(), and ioctl() calls and using the 279 access through aio_read(), aio_write(), and ioctl() calls and using the
280 standard userspace sysfs mechanisms to bind to a given SPI device. 280 standard userspace sysfs mechanisms to bind to a given SPI device.
281 281
282 SPI protocol drivers somewhat resemble platform device drivers: 282 SPI protocol drivers somewhat resemble platform device drivers:
283 283
284 static struct spi_driver CHIP_driver = { 284 static struct spi_driver CHIP_driver = {
285 .driver = { 285 .driver = {
286 .name = "CHIP", 286 .name = "CHIP",
287 .bus = &spi_bus_type, 287 .bus = &spi_bus_type,
288 .owner = THIS_MODULE, 288 .owner = THIS_MODULE,
289 }, 289 },
290 290
291 .probe = CHIP_probe, 291 .probe = CHIP_probe,
292 .remove = __devexit_p(CHIP_remove), 292 .remove = __devexit_p(CHIP_remove),
293 .suspend = CHIP_suspend, 293 .suspend = CHIP_suspend,
294 .resume = CHIP_resume, 294 .resume = CHIP_resume,
295 }; 295 };
296 296
297 The driver core will autmatically attempt to bind this driver to any SPI 297 The driver core will autmatically attempt to bind this driver to any SPI
298 device whose board_info gave a modalias of "CHIP". Your probe() code 298 device whose board_info gave a modalias of "CHIP". Your probe() code
299 might look like this unless you're creating a class_device: 299 might look like this unless you're creating a class_device:
300 300
301 static int __devinit CHIP_probe(struct spi_device *spi) 301 static int __devinit CHIP_probe(struct spi_device *spi)
302 { 302 {
303 struct CHIP *chip; 303 struct CHIP *chip;
304 struct CHIP_platform_data *pdata; 304 struct CHIP_platform_data *pdata;
305 305
306 /* assuming the driver requires board-specific data: */ 306 /* assuming the driver requires board-specific data: */
307 pdata = &spi->dev.platform_data; 307 pdata = &spi->dev.platform_data;
308 if (!pdata) 308 if (!pdata)
309 return -ENODEV; 309 return -ENODEV;
310 310
311 /* get memory for driver's per-chip state */ 311 /* get memory for driver's per-chip state */
312 chip = kzalloc(sizeof *chip, GFP_KERNEL); 312 chip = kzalloc(sizeof *chip, GFP_KERNEL);
313 if (!chip) 313 if (!chip)
314 return -ENOMEM; 314 return -ENOMEM;
315 dev_set_drvdata(&spi->dev, chip); 315 dev_set_drvdata(&spi->dev, chip);
316 316
317 ... etc 317 ... etc
318 return 0; 318 return 0;
319 } 319 }
320 320
321 As soon as it enters probe(), the driver may issue I/O requests to 321 As soon as it enters probe(), the driver may issue I/O requests to
322 the SPI device using "struct spi_message". When remove() returns, 322 the SPI device using "struct spi_message". When remove() returns,
323 the driver guarantees that it won't submit any more such messages. 323 the driver guarantees that it won't submit any more such messages.
324 324
325 - An spi_message is a sequence of of protocol operations, executed 325 - An spi_message is a sequence of protocol operations, executed
326 as one atomic sequence. SPI driver controls include: 326 as one atomic sequence. SPI driver controls include:
327 327
328 + when bidirectional reads and writes start ... by how its 328 + when bidirectional reads and writes start ... by how its
329 sequence of spi_transfer requests is arranged; 329 sequence of spi_transfer requests is arranged;
330 330
331 + optionally defining short delays after transfers ... using 331 + optionally defining short delays after transfers ... using
332 the spi_transfer.delay_usecs setting; 332 the spi_transfer.delay_usecs setting;
333 333
334 + whether the chipselect becomes inactive after a transfer and 334 + whether the chipselect becomes inactive after a transfer and
335 any delay ... by using the spi_transfer.cs_change flag; 335 any delay ... by using the spi_transfer.cs_change flag;
336 336
337 + hinting whether the next message is likely to go to this same 337 + hinting whether the next message is likely to go to this same
338 device ... using the spi_transfer.cs_change flag on the last 338 device ... using the spi_transfer.cs_change flag on the last
339 transfer in that atomic group, and potentially saving costs 339 transfer in that atomic group, and potentially saving costs
340 for chip deselect and select operations. 340 for chip deselect and select operations.
341 341
342 - Follow standard kernel rules, and provide DMA-safe buffers in 342 - Follow standard kernel rules, and provide DMA-safe buffers in
343 your messages. That way controller drivers using DMA aren't forced 343 your messages. That way controller drivers using DMA aren't forced
344 to make extra copies unless the hardware requires it (e.g. working 344 to make extra copies unless the hardware requires it (e.g. working
345 around hardware errata that force the use of bounce buffering). 345 around hardware errata that force the use of bounce buffering).
346 346
347 If standard dma_map_single() handling of these buffers is inappropriate, 347 If standard dma_map_single() handling of these buffers is inappropriate,
348 you can use spi_message.is_dma_mapped to tell the controller driver 348 you can use spi_message.is_dma_mapped to tell the controller driver
349 that you've already provided the relevant DMA addresses. 349 that you've already provided the relevant DMA addresses.
350 350
351 - The basic I/O primitive is spi_async(). Async requests may be 351 - The basic I/O primitive is spi_async(). Async requests may be
352 issued in any context (irq handler, task, etc) and completion 352 issued in any context (irq handler, task, etc) and completion
353 is reported using a callback provided with the message. 353 is reported using a callback provided with the message.
354 After any detected error, the chip is deselected and processing 354 After any detected error, the chip is deselected and processing
355 of that spi_message is aborted. 355 of that spi_message is aborted.
356 356
357 - There are also synchronous wrappers like spi_sync(), and wrappers 357 - There are also synchronous wrappers like spi_sync(), and wrappers
358 like spi_read(), spi_write(), and spi_write_then_read(). These 358 like spi_read(), spi_write(), and spi_write_then_read(). These
359 may be issued only in contexts that may sleep, and they're all 359 may be issued only in contexts that may sleep, and they're all
360 clean (and small, and "optional") layers over spi_async(). 360 clean (and small, and "optional") layers over spi_async().
361 361
362 - The spi_write_then_read() call, and convenience wrappers around 362 - The spi_write_then_read() call, and convenience wrappers around
363 it, should only be used with small amounts of data where the 363 it, should only be used with small amounts of data where the
364 cost of an extra copy may be ignored. It's designed to support 364 cost of an extra copy may be ignored. It's designed to support
365 common RPC-style requests, such as writing an eight bit command 365 common RPC-style requests, such as writing an eight bit command
366 and reading a sixteen bit response -- spi_w8r16() being one its 366 and reading a sixteen bit response -- spi_w8r16() being one its
367 wrappers, doing exactly that. 367 wrappers, doing exactly that.
368 368
369 Some drivers may need to modify spi_device characteristics like the 369 Some drivers may need to modify spi_device characteristics like the
370 transfer mode, wordsize, or clock rate. This is done with spi_setup(), 370 transfer mode, wordsize, or clock rate. This is done with spi_setup(),
371 which would normally be called from probe() before the first I/O is 371 which would normally be called from probe() before the first I/O is
372 done to the device. 372 done to the device.
373 373
374 While "spi_device" would be the bottom boundary of the driver, the 374 While "spi_device" would be the bottom boundary of the driver, the
375 upper boundaries might include sysfs (especially for sensor readings), 375 upper boundaries might include sysfs (especially for sensor readings),
376 the input layer, ALSA, networking, MTD, the character device framework, 376 the input layer, ALSA, networking, MTD, the character device framework,
377 or other Linux subsystems. 377 or other Linux subsystems.
378 378
379 Note that there are two types of memory your driver must manage as part 379 Note that there are two types of memory your driver must manage as part
380 of interacting with SPI devices. 380 of interacting with SPI devices.
381 381
382 - I/O buffers use the usual Linux rules, and must be DMA-safe. 382 - I/O buffers use the usual Linux rules, and must be DMA-safe.
383 You'd normally allocate them from the heap or free page pool. 383 You'd normally allocate them from the heap or free page pool.
384 Don't use the stack, or anything that's declared "static". 384 Don't use the stack, or anything that's declared "static".
385 385
386 - The spi_message and spi_transfer metadata used to glue those 386 - The spi_message and spi_transfer metadata used to glue those
387 I/O buffers into a group of protocol transactions. These can 387 I/O buffers into a group of protocol transactions. These can
388 be allocated anywhere it's convenient, including as part of 388 be allocated anywhere it's convenient, including as part of
389 other allocate-once driver data structures. Zero-init these. 389 other allocate-once driver data structures. Zero-init these.
390 390
391 If you like, spi_message_alloc() and spi_message_free() convenience 391 If you like, spi_message_alloc() and spi_message_free() convenience
392 routines are available to allocate and zero-initialize an spi_message 392 routines are available to allocate and zero-initialize an spi_message
393 with several transfers. 393 with several transfers.
394 394
395 395
396 How do I write an "SPI Master Controller Driver"? 396 How do I write an "SPI Master Controller Driver"?
397 ------------------------------------------------- 397 -------------------------------------------------
398 An SPI controller will probably be registered on the platform_bus; write 398 An SPI controller will probably be registered on the platform_bus; write
399 a driver to bind to the device, whichever bus is involved. 399 a driver to bind to the device, whichever bus is involved.
400 400
401 The main task of this type of driver is to provide an "spi_master". 401 The main task of this type of driver is to provide an "spi_master".
402 Use spi_alloc_master() to allocate the master, and class_get_devdata() 402 Use spi_alloc_master() to allocate the master, and class_get_devdata()
403 to get the driver-private data allocated for that device. 403 to get the driver-private data allocated for that device.
404 404
405 struct spi_master *master; 405 struct spi_master *master;
406 struct CONTROLLER *c; 406 struct CONTROLLER *c;
407 407
408 master = spi_alloc_master(dev, sizeof *c); 408 master = spi_alloc_master(dev, sizeof *c);
409 if (!master) 409 if (!master)
410 return -ENODEV; 410 return -ENODEV;
411 411
412 c = class_get_devdata(&master->cdev); 412 c = class_get_devdata(&master->cdev);
413 413
414 The driver will initialize the fields of that spi_master, including the 414 The driver will initialize the fields of that spi_master, including the
415 bus number (maybe the same as the platform device ID) and three methods 415 bus number (maybe the same as the platform device ID) and three methods
416 used to interact with the SPI core and SPI protocol drivers. It will 416 used to interact with the SPI core and SPI protocol drivers. It will
417 also initialize its own internal state. (See below about bus numbering 417 also initialize its own internal state. (See below about bus numbering
418 and those methods.) 418 and those methods.)
419 419
420 After you initialize the spi_master, then use spi_register_master() to 420 After you initialize the spi_master, then use spi_register_master() to
421 publish it to the rest of the system. At that time, device nodes for 421 publish it to the rest of the system. At that time, device nodes for
422 the controller and any predeclared spi devices will be made available, 422 the controller and any predeclared spi devices will be made available,
423 and the driver model core will take care of binding them to drivers. 423 and the driver model core will take care of binding them to drivers.
424 424
425 If you need to remove your SPI controller driver, spi_unregister_master() 425 If you need to remove your SPI controller driver, spi_unregister_master()
426 will reverse the effect of spi_register_master(). 426 will reverse the effect of spi_register_master().
427 427
428 428
429 BUS NUMBERING 429 BUS NUMBERING
430 430
431 Bus numbering is important, since that's how Linux identifies a given 431 Bus numbering is important, since that's how Linux identifies a given
432 SPI bus (shared SCK, MOSI, MISO). Valid bus numbers start at zero. On 432 SPI bus (shared SCK, MOSI, MISO). Valid bus numbers start at zero. On
433 SOC systems, the bus numbers should match the numbers defined by the chip 433 SOC systems, the bus numbers should match the numbers defined by the chip
434 manufacturer. For example, hardware controller SPI2 would be bus number 2, 434 manufacturer. For example, hardware controller SPI2 would be bus number 2,
435 and spi_board_info for devices connected to it would use that number. 435 and spi_board_info for devices connected to it would use that number.
436 436
437 If you don't have such hardware-assigned bus number, and for some reason 437 If you don't have such hardware-assigned bus number, and for some reason
438 you can't just assign them, then provide a negative bus number. That will 438 you can't just assign them, then provide a negative bus number. That will
439 then be replaced by a dynamically assigned number. You'd then need to treat 439 then be replaced by a dynamically assigned number. You'd then need to treat
440 this as a non-static configuration (see above). 440 this as a non-static configuration (see above).
441 441
442 442
443 SPI MASTER METHODS 443 SPI MASTER METHODS
444 444
445 master->setup(struct spi_device *spi) 445 master->setup(struct spi_device *spi)
446 This sets up the device clock rate, SPI mode, and word sizes. 446 This sets up the device clock rate, SPI mode, and word sizes.
447 Drivers may change the defaults provided by board_info, and then 447 Drivers may change the defaults provided by board_info, and then
448 call spi_setup(spi) to invoke this routine. It may sleep. 448 call spi_setup(spi) to invoke this routine. It may sleep.
449 449
450 master->transfer(struct spi_device *spi, struct spi_message *message) 450 master->transfer(struct spi_device *spi, struct spi_message *message)
451 This must not sleep. Its responsibility is arrange that the 451 This must not sleep. Its responsibility is arrange that the
452 transfer happens and its complete() callback is issued; the two 452 transfer happens and its complete() callback is issued; the two
453 will normally happen later, after other transfers complete. 453 will normally happen later, after other transfers complete.
454 454
455 master->cleanup(struct spi_device *spi) 455 master->cleanup(struct spi_device *spi)
456 Your controller driver may use spi_device.controller_state to hold 456 Your controller driver may use spi_device.controller_state to hold
457 state it dynamically associates with that device. If you do that, 457 state it dynamically associates with that device. If you do that,
458 be sure to provide the cleanup() method to free that state. 458 be sure to provide the cleanup() method to free that state.
459 459
460 460
461 SPI MESSAGE QUEUE 461 SPI MESSAGE QUEUE
462 462
463 The bulk of the driver will be managing the I/O queue fed by transfer(). 463 The bulk of the driver will be managing the I/O queue fed by transfer().
464 464
465 That queue could be purely conceptual. For example, a driver used only 465 That queue could be purely conceptual. For example, a driver used only
466 for low-frequency sensor acess might be fine using synchronous PIO. 466 for low-frequency sensor acess might be fine using synchronous PIO.
467 467
468 But the queue will probably be very real, using message->queue, PIO, 468 But the queue will probably be very real, using message->queue, PIO,
469 often DMA (especially if the root filesystem is in SPI flash), and 469 often DMA (especially if the root filesystem is in SPI flash), and
470 execution contexts like IRQ handlers, tasklets, or workqueues (such 470 execution contexts like IRQ handlers, tasklets, or workqueues (such
471 as keventd). Your driver can be as fancy, or as simple, as you need. 471 as keventd). Your driver can be as fancy, or as simple, as you need.
472 Such a transfer() method would normally just add the message to a 472 Such a transfer() method would normally just add the message to a
473 queue, and then start some asynchronous transfer engine (unless it's 473 queue, and then start some asynchronous transfer engine (unless it's
474 already running). 474 already running).
475 475
476 476
477 THANKS TO 477 THANKS TO
478 --------- 478 ---------
479 Contributors to Linux-SPI discussions include (in alphabetical order, 479 Contributors to Linux-SPI discussions include (in alphabetical order,
480 by last name): 480 by last name):
481 481
482 David Brownell 482 David Brownell
483 Russell King 483 Russell King
484 Dmitry Pervushin 484 Dmitry Pervushin
485 Stephen Street 485 Stephen Street
486 Mark Underwood 486 Mark Underwood
487 Andrew Victor 487 Andrew Victor
488 Vitaly Wool 488 Vitaly Wool
489 489
490 490
Documentation/unshare.txt
1 1
2 unshare system call: 2 unshare system call:
3 -------------------- 3 --------------------
4 This document describes the new system call, unshare. The document 4 This document describes the new system call, unshare. The document
5 provides an overview of the feature, why it is needed, how it can 5 provides an overview of the feature, why it is needed, how it can
6 be used, its interface specification, design, implementation and 6 be used, its interface specification, design, implementation and
7 how it can be tested. 7 how it can be tested.
8 8
9 Change Log: 9 Change Log:
10 ----------- 10 -----------
11 version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 11 version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
12 12
13 Contents: 13 Contents:
14 --------- 14 ---------
15 1) Overview 15 1) Overview
16 2) Benefits 16 2) Benefits
17 3) Cost 17 3) Cost
18 4) Requirements 18 4) Requirements
19 5) Functional Specification 19 5) Functional Specification
20 6) High Level Design 20 6) High Level Design
21 7) Low Level Design 21 7) Low Level Design
22 8) Test Specification 22 8) Test Specification
23 9) Future Work 23 9) Future Work
24 24
25 1) Overview 25 1) Overview
26 ----------- 26 -----------
27 Most legacy operating system kernels support an abstraction of threads 27 Most legacy operating system kernels support an abstraction of threads
28 as multiple execution contexts within a process. These kernels provide 28 as multiple execution contexts within a process. These kernels provide
29 special resources and mechanisms to maintain these "threads". The Linux 29 special resources and mechanisms to maintain these "threads". The Linux
30 kernel, in a clever and simple manner, does not make distinction 30 kernel, in a clever and simple manner, does not make distinction
31 between processes and "threads". The kernel allows processes to share 31 between processes and "threads". The kernel allows processes to share
32 resources and thus they can achieve legacy "threads" behavior without 32 resources and thus they can achieve legacy "threads" behavior without
33 requiring additional data structures and mechanisms in the kernel. The 33 requiring additional data structures and mechanisms in the kernel. The
34 power of implementing threads in this manner comes not only from 34 power of implementing threads in this manner comes not only from
35 its simplicity but also from allowing application programmers to work 35 its simplicity but also from allowing application programmers to work
36 outside the confinement of all-or-nothing shared resources of legacy 36 outside the confinement of all-or-nothing shared resources of legacy
37 threads. On Linux, at the time of thread creation using the clone system 37 threads. On Linux, at the time of thread creation using the clone system
38 call, applications can selectively choose which resources to share 38 call, applications can selectively choose which resources to share
39 between threads. 39 between threads.
40 40
41 unshare system call adds a primitive to the Linux thread model that 41 unshare system call adds a primitive to the Linux thread model that
42 allows threads to selectively 'unshare' any resources that were being 42 allows threads to selectively 'unshare' any resources that were being
43 shared at the time of their creation. unshare was conceptualized by 43 shared at the time of their creation. unshare was conceptualized by
44 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part 44 Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
45 of the discussion on POSIX threads on Linux. unshare augments the 45 of the discussion on POSIX threads on Linux. unshare augments the
46 usefulness of Linux threads for applications that would like to control 46 usefulness of Linux threads for applications that would like to control
47 shared resources without creating a new process. unshare is a natural 47 shared resources without creating a new process. unshare is a natural
48 addition to the set of available primitives on Linux that implement 48 addition to the set of available primitives on Linux that implement
49 the concept of process/thread as a virtual machine. 49 the concept of process/thread as a virtual machine.
50 50
51 2) Benefits 51 2) Benefits
52 ----------- 52 -----------
53 unshare would be useful to large application frameworks such as PAM 53 unshare would be useful to large application frameworks such as PAM
54 where creating a new process to control sharing/unsharing of process 54 where creating a new process to control sharing/unsharing of process
55 resources is not possible. Since namespaces are shared by default 55 resources is not possible. Since namespaces are shared by default
56 when creating a new process using fork or clone, unshare can benefit 56 when creating a new process using fork or clone, unshare can benefit
57 even non-threaded applications if they have a need to disassociate 57 even non-threaded applications if they have a need to disassociate
58 from default shared namespace. The following lists two use-cases 58 from default shared namespace. The following lists two use-cases
59 where unshare can be used. 59 where unshare can be used.
60 60
61 2.1 Per-security context namespaces 61 2.1 Per-security context namespaces
62 ----------------------------------- 62 -----------------------------------
63 unshare can be used to implement polyinstantiated directories using 63 unshare can be used to implement polyinstantiated directories using
64 the kernel's per-process namespace mechanism. Polyinstantiated directories, 64 the kernel's per-process namespace mechanism. Polyinstantiated directories,
65 such as per-user and/or per-security context instance of /tmp, /var/tmp or 65 such as per-user and/or per-security context instance of /tmp, /var/tmp or
66 per-security context instance of a user's home directory, isolate user 66 per-security context instance of a user's home directory, isolate user
67 processes when working with these directories. Using unshare, a PAM 67 processes when working with these directories. Using unshare, a PAM
68 module can easily setup a private namespace for a user at login. 68 module can easily setup a private namespace for a user at login.
69 Polyinstantiated directories are required for Common Criteria certification 69 Polyinstantiated directories are required for Common Criteria certification
70 with Labeled System Protection Profile, however, with the availability 70 with Labeled System Protection Profile, however, with the availability
71 of shared-tree feature in the Linux kernel, even regular Linux systems 71 of shared-tree feature in the Linux kernel, even regular Linux systems
72 can benefit from setting up private namespaces at login and 72 can benefit from setting up private namespaces at login and
73 polyinstantiating /tmp, /var/tmp and other directories deemed 73 polyinstantiating /tmp, /var/tmp and other directories deemed
74 appropriate by system administrators. 74 appropriate by system administrators.
75 75
76 2.2 unsharing of virtual memory and/or open files 76 2.2 unsharing of virtual memory and/or open files
77 ------------------------------------------------- 77 -------------------------------------------------
78 Consider a client/server application where the server is processing 78 Consider a client/server application where the server is processing
79 client requests by creating processes that share resources such as 79 client requests by creating processes that share resources such as
80 virtual memory and open files. Without unshare, the server has to 80 virtual memory and open files. Without unshare, the server has to
81 decide what needs to be shared at the time of creating the process 81 decide what needs to be shared at the time of creating the process
82 which services the request. unshare allows the server an ability to 82 which services the request. unshare allows the server an ability to
83 disassociate parts of the context during the servicing of the 83 disassociate parts of the context during the servicing of the
84 request. For large and complex middleware application frameworks, this 84 request. For large and complex middleware application frameworks, this
85 ability to unshare after the process was created can be very 85 ability to unshare after the process was created can be very
86 useful. 86 useful.
87 87
88 3) Cost 88 3) Cost
89 ------- 89 -------
90 In order to not duplicate code and to handle the fact that unshare 90 In order to not duplicate code and to handle the fact that unshare
91 works on an active task (as opposed to clone/fork working on a newly 91 works on an active task (as opposed to clone/fork working on a newly
92 allocated inactive task) unshare had to make minor reorganizational 92 allocated inactive task) unshare had to make minor reorganizational
93 changes to copy_* functions utilized by clone/fork system call. 93 changes to copy_* functions utilized by clone/fork system call.
94 There is a cost associated with altering existing, well tested and 94 There is a cost associated with altering existing, well tested and
95 stable code to implement a new feature that may not get exercised 95 stable code to implement a new feature that may not get exercised
96 extensively in the beginning. However, with proper design and code 96 extensively in the beginning. However, with proper design and code
97 review of the changes and creation of an unshare test for the LTP 97 review of the changes and creation of an unshare test for the LTP
98 the benefits of this new feature can exceed its cost. 98 the benefits of this new feature can exceed its cost.
99 99
100 4) Requirements 100 4) Requirements
101 --------------- 101 ---------------
102 unshare reverses sharing that was done using clone(2) system call, 102 unshare reverses sharing that was done using clone(2) system call,
103 so unshare should have a similar interface as clone(2). That is, 103 so unshare should have a similar interface as clone(2). That is,
104 since flags in clone(int flags, void *stack) specifies what should 104 since flags in clone(int flags, void *stack) specifies what should
105 be shared, similar flags in unshare(int flags) should specify 105 be shared, similar flags in unshare(int flags) should specify
106 what should be unshared. Unfortunately, this may appear to invert 106 what should be unshared. Unfortunately, this may appear to invert
107 the meaning of the flags from the way they are used in clone(2). 107 the meaning of the flags from the way they are used in clone(2).
108 However, there was no easy solution that was less confusing and that 108 However, there was no easy solution that was less confusing and that
109 allowed incremental context unsharing in future without an ABI change. 109 allowed incremental context unsharing in future without an ABI change.
110 110
111 unshare interface should accommodate possible future addition of 111 unshare interface should accommodate possible future addition of
112 new context flags without requiring a rebuild of old applications. 112 new context flags without requiring a rebuild of old applications.
113 If and when new context flags are added, unshare design should allow 113 If and when new context flags are added, unshare design should allow
114 incremental unsharing of those resources on an as needed basis. 114 incremental unsharing of those resources on an as needed basis.
115 115
116 5) Functional Specification 116 5) Functional Specification
117 --------------------------- 117 ---------------------------
118 NAME 118 NAME
119 unshare - disassociate parts of the process execution context 119 unshare - disassociate parts of the process execution context
120 120
121 SYNOPSIS 121 SYNOPSIS
122 #include <sched.h> 122 #include <sched.h>
123 123
124 int unshare(int flags); 124 int unshare(int flags);
125 125
126 DESCRIPTION 126 DESCRIPTION
127 unshare allows a process to disassociate parts of its execution 127 unshare allows a process to disassociate parts of its execution
128 context that are currently being shared with other processes. Part 128 context that are currently being shared with other processes. Part
129 of execution context, such as the namespace, is shared by default 129 of execution context, such as the namespace, is shared by default
130 when a new process is created using fork(2), while other parts, 130 when a new process is created using fork(2), while other parts,
131 such as the virtual memory, open file descriptors, etc, may be 131 such as the virtual memory, open file descriptors, etc, may be
132 shared by explicit request to share them when creating a process 132 shared by explicit request to share them when creating a process
133 using clone(2). 133 using clone(2).
134 134
135 The main use of unshare is to allow a process to control its 135 The main use of unshare is to allow a process to control its
136 shared execution context without creating a new process. 136 shared execution context without creating a new process.
137 137
138 The flags argument specifies one or bitwise-or'ed of several of 138 The flags argument specifies one or bitwise-or'ed of several of
139 the following constants. 139 the following constants.
140 140
141 CLONE_FS 141 CLONE_FS
142 If CLONE_FS is set, file system information of the caller 142 If CLONE_FS is set, file system information of the caller
143 is disassociated from the shared file system information. 143 is disassociated from the shared file system information.
144 144
145 CLONE_FILES 145 CLONE_FILES
146 If CLONE_FILES is set, the file descriptor table of the 146 If CLONE_FILES is set, the file descriptor table of the
147 caller is disassociated from the shared file descriptor 147 caller is disassociated from the shared file descriptor
148 table. 148 table.
149 149
150 CLONE_NEWNS 150 CLONE_NEWNS
151 If CLONE_NEWNS is set, the namespace of the caller is 151 If CLONE_NEWNS is set, the namespace of the caller is
152 disassociated from the shared namespace. 152 disassociated from the shared namespace.
153 153
154 CLONE_VM 154 CLONE_VM
155 If CLONE_VM is set, the virtual memory of the caller is 155 If CLONE_VM is set, the virtual memory of the caller is
156 disassociated from the shared virtual memory. 156 disassociated from the shared virtual memory.
157 157
158 RETURN VALUE 158 RETURN VALUE
159 On success, zero returned. On failure, -1 is returned and errno is 159 On success, zero returned. On failure, -1 is returned and errno is
160 160
161 ERRORS 161 ERRORS
162 EPERM CLONE_NEWNS was specified by a non-root process (process 162 EPERM CLONE_NEWNS was specified by a non-root process (process
163 without CAP_SYS_ADMIN). 163 without CAP_SYS_ADMIN).
164 164
165 ENOMEM Cannot allocate sufficient memory to copy parts of caller's 165 ENOMEM Cannot allocate sufficient memory to copy parts of caller's
166 context that need to be unshared. 166 context that need to be unshared.
167 167
168 EINVAL Invalid flag was specified as an argument. 168 EINVAL Invalid flag was specified as an argument.
169 169
170 CONFORMING TO 170 CONFORMING TO
171 The unshare() call is Linux-specific and should not be used 171 The unshare() call is Linux-specific and should not be used
172 in programs intended to be portable. 172 in programs intended to be portable.
173 173
174 SEE ALSO 174 SEE ALSO
175 clone(2), fork(2) 175 clone(2), fork(2)
176 176
177 6) High Level Design 177 6) High Level Design
178 -------------------- 178 --------------------
179 Depending on the flags argument, the unshare system call allocates 179 Depending on the flags argument, the unshare system call allocates
180 appropriate process context structures, populates it with values from 180 appropriate process context structures, populates it with values from
181 the current shared version, associates newly duplicated structures 181 the current shared version, associates newly duplicated structures
182 with the current task structure and releases corresponding shared 182 with the current task structure and releases corresponding shared
183 versions. Helper functions of clone (copy_*) could not be used 183 versions. Helper functions of clone (copy_*) could not be used
184 directly by unshare because of the following two reasons. 184 directly by unshare because of the following two reasons.
185 1) clone operates on a newly allocated not-yet-active task 185 1) clone operates on a newly allocated not-yet-active task
186 structure, where as unshare operates on the current active 186 structure, where as unshare operates on the current active
187 task. Therefore unshare has to take appropriate task_lock() 187 task. Therefore unshare has to take appropriate task_lock()
188 before associating newly duplicated context structures 188 before associating newly duplicated context structures
189 2) unshare has to allocate and duplicate all context structures 189 2) unshare has to allocate and duplicate all context structures
190 that are being unshared, before associating them with the 190 that are being unshared, before associating them with the
191 current task and releasing older shared structures. Failure 191 current task and releasing older shared structures. Failure
192 do so will create race conditions and/or oops when trying 192 do so will create race conditions and/or oops when trying
193 to backout due to an error. Consider the case of unsharing 193 to backout due to an error. Consider the case of unsharing
194 both virtual memory and namespace. After successfully unsharing 194 both virtual memory and namespace. After successfully unsharing
195 vm, if the system call encounters an error while allocating 195 vm, if the system call encounters an error while allocating
196 new namespace structure, the error return code will have to 196 new namespace structure, the error return code will have to
197 reverse the unsharing of vm. As part of the reversal the 197 reverse the unsharing of vm. As part of the reversal the
198 system call will have to go back to older, shared, vm 198 system call will have to go back to older, shared, vm
199 structure, which may not exist anymore. 199 structure, which may not exist anymore.
200 200
201 Therefore code from copy_* functions that allocated and duplicated 201 Therefore code from copy_* functions that allocated and duplicated
202 current context structure was moved into new dup_* functions. Now, 202 current context structure was moved into new dup_* functions. Now,
203 copy_* functions call dup_* functions to allocate and duplicate 203 copy_* functions call dup_* functions to allocate and duplicate
204 appropriate context structures and then associate them with the 204 appropriate context structures and then associate them with the
205 task structure that is being constructed. unshare system call on 205 task structure that is being constructed. unshare system call on
206 the other hand performs the following: 206 the other hand performs the following:
207 1) Check flags to force missing, but implied, flags 207 1) Check flags to force missing, but implied, flags
208 2) For each context structure, call the corresponding unshare 208 2) For each context structure, call the corresponding unshare
209 helper function to allocate and duplicate a new context 209 helper function to allocate and duplicate a new context
210 structure, if the appropriate bit is set in the flags argument. 210 structure, if the appropriate bit is set in the flags argument.
211 3) If there is no error in allocation and duplication and there 211 3) If there is no error in allocation and duplication and there
212 are new context structures then lock the current task structure, 212 are new context structures then lock the current task structure,
213 associate new context structures with the current task structure, 213 associate new context structures with the current task structure,
214 and release the lock on the current task structure. 214 and release the lock on the current task structure.
215 4) Appropriately release older, shared, context structures. 215 4) Appropriately release older, shared, context structures.
216 216
217 7) Low Level Design 217 7) Low Level Design
218 ------------------- 218 -------------------
219 Implementation of unshare can be grouped in the following 4 different 219 Implementation of unshare can be grouped in the following 4 different
220 items: 220 items:
221 a) Reorganization of existing copy_* functions 221 a) Reorganization of existing copy_* functions
222 b) unshare system call service function 222 b) unshare system call service function
223 c) unshare helper functions for each different process context 223 c) unshare helper functions for each different process context
224 d) Registration of system call number for different architectures 224 d) Registration of system call number for different architectures
225 225
226 7.1) Reorganization of copy_* functions 226 7.1) Reorganization of copy_* functions
227 Each copy function such as copy_mm, copy_namespace, copy_files, 227 Each copy function such as copy_mm, copy_namespace, copy_files,
228 etc, had roughly two components. The first component allocated 228 etc, had roughly two components. The first component allocated
229 and duplicated the appropriate structure and the second component 229 and duplicated the appropriate structure and the second component
230 linked it to the task structure passed in as an argument to the copy 230 linked it to the task structure passed in as an argument to the copy
231 function. The first component was split into its own function. 231 function. The first component was split into its own function.
232 These dup_* functions allocated and duplicated the appropriate 232 These dup_* functions allocated and duplicated the appropriate
233 context structure. The reorganized copy_* functions invoked 233 context structure. The reorganized copy_* functions invoked
234 their corresponding dup_* functions and then linked the newly 234 their corresponding dup_* functions and then linked the newly
235 duplicated structures to the task structure with which the 235 duplicated structures to the task structure with which the
236 copy function was called. 236 copy function was called.
237 237
238 7.2) unshare system call service function 238 7.2) unshare system call service function
239 * Check flags 239 * Check flags
240 Force implied flags. If CLONE_THREAD is set force CLONE_VM. 240 Force implied flags. If CLONE_THREAD is set force CLONE_VM.
241 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is 241 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
242 set and signals are also being shared, force CLONE_THREAD. If 242 set and signals are also being shared, force CLONE_THREAD. If
243 CLONE_NEWNS is set, force CLONE_FS. 243 CLONE_NEWNS is set, force CLONE_FS.
244 * For each context flag, invoke the corresponding unshare_* 244 * For each context flag, invoke the corresponding unshare_*
245 helper routine with flags passed into the system call and a 245 helper routine with flags passed into the system call and a
246 reference to pointer pointing the new unshared structure 246 reference to pointer pointing the new unshared structure
247 * If any new structures are created by unshare_* helper 247 * If any new structures are created by unshare_* helper
248 functions, take the task_lock() on the current task, 248 functions, take the task_lock() on the current task,
249 modify appropriate context pointers, and release the 249 modify appropriate context pointers, and release the
250 task lock. 250 task lock.
251 * For all newly unshared structures, release the corresponding 251 * For all newly unshared structures, release the corresponding
252 older, shared, structures. 252 older, shared, structures.
253 253
254 7.3) unshare_* helper functions 254 7.3) unshare_* helper functions
255 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, 255 For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
256 and CLONE_THREAD, return -EINVAL since they are not implemented yet. 256 and CLONE_THREAD, return -EINVAL since they are not implemented yet.
257 For others, check the flag value to see if the unsharing is 257 For others, check the flag value to see if the unsharing is
258 required for that structure. If it is, invoke the corresponding 258 required for that structure. If it is, invoke the corresponding
259 dup_* function to allocate and duplicate the structure and return 259 dup_* function to allocate and duplicate the structure and return
260 a pointer to it. 260 a pointer to it.
261 261
262 7.4) Appropriately modify architecture specific code to register the 262 7.4) Appropriately modify architecture specific code to register the
263 the new system call. 263 new system call.
264 264
265 8) Test Specification 265 8) Test Specification
266 --------------------- 266 ---------------------
267 The test for unshare should test the following: 267 The test for unshare should test the following:
268 1) Valid flags: Test to check that clone flags for signal and 268 1) Valid flags: Test to check that clone flags for signal and
269 signal handlers, for which unsharing is not implemented 269 signal handlers, for which unsharing is not implemented
270 yet, return -EINVAL. 270 yet, return -EINVAL.
271 2) Missing/implied flags: Test to make sure that if unsharing 271 2) Missing/implied flags: Test to make sure that if unsharing
272 namespace without specifying unsharing of filesystem, correctly 272 namespace without specifying unsharing of filesystem, correctly
273 unshares both namespace and filesystem information. 273 unshares both namespace and filesystem information.
274 3) For each of the four (namespace, filesystem, files and vm) 274 3) For each of the four (namespace, filesystem, files and vm)
275 supported unsharing, verify that the system call correctly 275 supported unsharing, verify that the system call correctly
276 unshares the appropriate structure. Verify that unsharing 276 unshares the appropriate structure. Verify that unsharing
277 them individually as well as in combination with each 277 them individually as well as in combination with each
278 other works as expected. 278 other works as expected.
279 4) Concurrent execution: Use shared memory segments and futex on 279 4) Concurrent execution: Use shared memory segments and futex on
280 an address in the shm segment to synchronize execution of 280 an address in the shm segment to synchronize execution of
281 about 10 threads. Have a couple of threads execute execve, 281 about 10 threads. Have a couple of threads execute execve,
282 a couple _exit and the rest unshare with different combination 282 a couple _exit and the rest unshare with different combination
283 of flags. Verify that unsharing is performed as expected and 283 of flags. Verify that unsharing is performed as expected and
284 that there are no oops or hangs. 284 that there are no oops or hangs.
285 285
286 9) Future Work 286 9) Future Work
287 -------------- 287 --------------
288 The current implementation of unshare does not allow unsharing of 288 The current implementation of unshare does not allow unsharing of
289 signals and signal handlers. Signals are complex to begin with and 289 signals and signal handlers. Signals are complex to begin with and
290 to unshare signals and/or signal handlers of a currently running 290 to unshare signals and/or signal handlers of a currently running
291 process is even more complex. If in the future there is a specific 291 process is even more complex. If in the future there is a specific
292 need to allow unsharing of signals and/or signal handlers, it can 292 need to allow unsharing of signals and/or signal handlers, it can
293 be incrementally added to unshare without affecting legacy 293 be incrementally added to unshare without affecting legacy
294 applications using unshare. 294 applications using unshare.
295 295
296 296
Documentation/usb/error-codes.txt
1 Revised: 2004-Oct-21 1 Revised: 2004-Oct-21
2 2
3 This is the documentation of (hopefully) all possible error codes (and 3 This is the documentation of (hopefully) all possible error codes (and
4 their interpretation) that can be returned from usbcore. 4 their interpretation) that can be returned from usbcore.
5 5
6 Some of them are returned by the Host Controller Drivers (HCDs), which 6 Some of them are returned by the Host Controller Drivers (HCDs), which
7 device drivers only see through usbcore. As a rule, all the HCDs should 7 device drivers only see through usbcore. As a rule, all the HCDs should
8 behave the same except for transfer speed dependent behaviors and the 8 behave the same except for transfer speed dependent behaviors and the
9 way certain faults are reported. 9 way certain faults are reported.
10 10
11 11
12 ************************************************************************** 12 **************************************************************************
13 * Error codes returned by usb_submit_urb * 13 * Error codes returned by usb_submit_urb *
14 ************************************************************************** 14 **************************************************************************
15 15
16 Non-USB-specific: 16 Non-USB-specific:
17 17
18 0 URB submission went fine 18 0 URB submission went fine
19 19
20 -ENOMEM no memory for allocation of internal structures 20 -ENOMEM no memory for allocation of internal structures
21 21
22 USB-specific: 22 USB-specific:
23 23
24 -ENODEV specified USB-device or bus doesn't exist 24 -ENODEV specified USB-device or bus doesn't exist
25 25
26 -ENOENT specified interface or endpoint does not exist or 26 -ENOENT specified interface or endpoint does not exist or
27 is not enabled 27 is not enabled
28 28
29 -ENXIO host controller driver does not support queuing of this type 29 -ENXIO host controller driver does not support queuing of this type
30 of urb. (treat as a host controller bug.) 30 of urb. (treat as a host controller bug.)
31 31
32 -EINVAL a) Invalid transfer type specified (or not supported) 32 -EINVAL a) Invalid transfer type specified (or not supported)
33 b) Invalid or unsupported periodic transfer interval 33 b) Invalid or unsupported periodic transfer interval
34 c) ISO: attempted to change transfer interval 34 c) ISO: attempted to change transfer interval
35 d) ISO: number_of_packets is < 0 35 d) ISO: number_of_packets is < 0
36 e) various other cases 36 e) various other cases
37 37
38 -EAGAIN a) specified ISO start frame too early 38 -EAGAIN a) specified ISO start frame too early
39 b) (using ISO-ASAP) too much scheduled for the future 39 b) (using ISO-ASAP) too much scheduled for the future
40 wait some time and try again. 40 wait some time and try again.
41 41
42 -EFBIG Host controller driver can't schedule that many ISO frames. 42 -EFBIG Host controller driver can't schedule that many ISO frames.
43 43
44 -EPIPE Specified endpoint is stalled. For non-control endpoints, 44 -EPIPE Specified endpoint is stalled. For non-control endpoints,
45 reset this status with usb_clear_halt(). 45 reset this status with usb_clear_halt().
46 46
47 -EMSGSIZE (a) endpoint maxpacket size is zero; it is not usable 47 -EMSGSIZE (a) endpoint maxpacket size is zero; it is not usable
48 in the current interface altsetting. 48 in the current interface altsetting.
49 (b) ISO packet is larger than the endpoint maxpacket. 49 (b) ISO packet is larger than the endpoint maxpacket.
50 (c) requested data transfer length is invalid: negative 50 (c) requested data transfer length is invalid: negative
51 or too large for the host controller. 51 or too large for the host controller.
52 52
53 -ENOSPC This request would overcommit the usb bandwidth reserved 53 -ENOSPC This request would overcommit the usb bandwidth reserved
54 for periodic transfers (interrupt, isochronous). 54 for periodic transfers (interrupt, isochronous).
55 55
56 -ESHUTDOWN The device or host controller has been disabled due to some 56 -ESHUTDOWN The device or host controller has been disabled due to some
57 problem that could not be worked around. 57 problem that could not be worked around.
58 58
59 -EPERM Submission failed because urb->reject was set. 59 -EPERM Submission failed because urb->reject was set.
60 60
61 -EHOSTUNREACH URB was rejected because the device is suspended. 61 -EHOSTUNREACH URB was rejected because the device is suspended.
62 62
63 63
64 ************************************************************************** 64 **************************************************************************
65 * Error codes returned by in urb->status * 65 * Error codes returned by in urb->status *
66 * or in iso_frame_desc[n].status (for ISO) * 66 * or in iso_frame_desc[n].status (for ISO) *
67 ************************************************************************** 67 **************************************************************************
68 68
69 USB device drivers may only test urb status values in completion handlers. 69 USB device drivers may only test urb status values in completion handlers.
70 This is because otherwise there would be a race between HCDs updating 70 This is because otherwise there would be a race between HCDs updating
71 these values on one CPU, and device drivers testing them on another CPU. 71 these values on one CPU, and device drivers testing them on another CPU.
72 72
73 A transfer's actual_length may be positive even when an error has been 73 A transfer's actual_length may be positive even when an error has been
74 reported. That's because transfers often involve several packets, so that 74 reported. That's because transfers often involve several packets, so that
75 one or more packets could finish before an error stops further endpoint I/O. 75 one or more packets could finish before an error stops further endpoint I/O.
76 76
77 77
78 0 Transfer completed successfully 78 0 Transfer completed successfully
79 79
80 -ENOENT URB was synchronously unlinked by usb_unlink_urb 80 -ENOENT URB was synchronously unlinked by usb_unlink_urb
81 81
82 -EINPROGRESS URB still pending, no results yet 82 -EINPROGRESS URB still pending, no results yet
83 (That is, if drivers see this it's a bug.) 83 (That is, if drivers see this it's a bug.)
84 84
85 -EPROTO (*, **) a) bitstuff error 85 -EPROTO (*, **) a) bitstuff error
86 b) no response packet received within the 86 b) no response packet received within the
87 prescribed bus turn-around time 87 prescribed bus turn-around time
88 c) unknown USB error 88 c) unknown USB error
89 89
90 -EILSEQ (*, **) a) CRC mismatch 90 -EILSEQ (*, **) a) CRC mismatch
91 b) no response packet received within the 91 b) no response packet received within the
92 prescribed bus turn-around time 92 prescribed bus turn-around time
93 c) unknown USB error 93 c) unknown USB error
94 94
95 Note that often the controller hardware does not 95 Note that often the controller hardware does not
96 distinguish among cases a), b), and c), so a 96 distinguish among cases a), b), and c), so a
97 driver cannot tell whether there was a protocol 97 driver cannot tell whether there was a protocol
98 error, a failure to respond (often caused by 98 error, a failure to respond (often caused by
99 device disconnect), or some other fault. 99 device disconnect), or some other fault.
100 100
101 -ETIME (**) No response packet received within the prescribed 101 -ETIME (**) No response packet received within the prescribed
102 bus turn-around time. This error may instead be 102 bus turn-around time. This error may instead be
103 reported as -EPROTO or -EILSEQ. 103 reported as -EPROTO or -EILSEQ.
104 104
105 -ETIMEDOUT Synchronous USB message functions use this code 105 -ETIMEDOUT Synchronous USB message functions use this code
106 to indicate timeout expired before the transfer 106 to indicate timeout expired before the transfer
107 completed, and no other error was reported by HC. 107 completed, and no other error was reported by HC.
108 108
109 -EPIPE (**) Endpoint stalled. For non-control endpoints, 109 -EPIPE (**) Endpoint stalled. For non-control endpoints,
110 reset this status with usb_clear_halt(). 110 reset this status with usb_clear_halt().
111 111
112 -ECOMM During an IN transfer, the host controller 112 -ECOMM During an IN transfer, the host controller
113 received data from an endpoint faster than it 113 received data from an endpoint faster than it
114 could be written to system memory 114 could be written to system memory
115 115
116 -ENOSR During an OUT transfer, the host controller 116 -ENOSR During an OUT transfer, the host controller
117 could not retrieve data from system memory fast 117 could not retrieve data from system memory fast
118 enough to keep up with the USB data rate 118 enough to keep up with the USB data rate
119 119
120 -EOVERFLOW (*) The amount of data returned by the endpoint was 120 -EOVERFLOW (*) The amount of data returned by the endpoint was
121 greater than either the max packet size of the 121 greater than either the max packet size of the
122 endpoint or the remaining buffer size. "Babble". 122 endpoint or the remaining buffer size. "Babble".
123 123
124 -EREMOTEIO The data read from the endpoint did not fill the 124 -EREMOTEIO The data read from the endpoint did not fill the
125 specified buffer, and URB_SHORT_NOT_OK was set in 125 specified buffer, and URB_SHORT_NOT_OK was set in
126 urb->transfer_flags. 126 urb->transfer_flags.
127 127
128 -ENODEV Device was removed. Often preceded by a burst of 128 -ENODEV Device was removed. Often preceded by a burst of
129 other errors, since the hub driver doesn't detect 129 other errors, since the hub driver doesn't detect
130 device removal events immediately. 130 device removal events immediately.
131 131
132 -EXDEV ISO transfer only partially completed 132 -EXDEV ISO transfer only partially completed
133 look at individual frame status for details 133 look at individual frame status for details
134 134
135 -EINVAL ISO madness, if this happens: Log off and go home 135 -EINVAL ISO madness, if this happens: Log off and go home
136 136
137 -ECONNRESET URB was asynchronously unlinked by usb_unlink_urb 137 -ECONNRESET URB was asynchronously unlinked by usb_unlink_urb
138 138
139 -ESHUTDOWN The device or host controller has been disabled due 139 -ESHUTDOWN The device or host controller has been disabled due
140 to some problem that could not be worked around, 140 to some problem that could not be worked around,
141 such as a physical disconnect. 141 such as a physical disconnect.
142 142
143 143
144 (*) Error codes like -EPROTO, -EILSEQ and -EOVERFLOW normally indicate 144 (*) Error codes like -EPROTO, -EILSEQ and -EOVERFLOW normally indicate
145 hardware problems such as bad devices (including firmware) or cables. 145 hardware problems such as bad devices (including firmware) or cables.
146 146
147 (**) This is also one of several codes that different kinds of host 147 (**) This is also one of several codes that different kinds of host
148 controller use to to indicate a transfer has failed because of device 148 controller use to indicate a transfer has failed because of device
149 disconnect. In the interval before the hub driver starts disconnect 149 disconnect. In the interval before the hub driver starts disconnect
150 processing, devices may receive such fault reports for every request. 150 processing, devices may receive such fault reports for every request.
151 151
152 152
153 153
154 ************************************************************************** 154 **************************************************************************
155 * Error codes returned by usbcore-functions * 155 * Error codes returned by usbcore-functions *
156 * (expect also other submit and transfer status codes) * 156 * (expect also other submit and transfer status codes) *
157 ************************************************************************** 157 **************************************************************************
158 158
159 usb_register(): 159 usb_register():
160 -EINVAL error during registering new driver 160 -EINVAL error during registering new driver
161 161
162 usb_get_*/usb_set_*(): 162 usb_get_*/usb_set_*():
163 usb_control_msg(): 163 usb_control_msg():
164 usb_bulk_msg(): 164 usb_bulk_msg():
165 -ETIMEDOUT Timeout expired before the transfer completed. 165 -ETIMEDOUT Timeout expired before the transfer completed.
166 166
Documentation/usb/hiddev.txt
1 Care and feeding of your Human Interface Devices 1 Care and feeding of your Human Interface Devices
2 2
3 INTRODUCTION 3 INTRODUCTION
4 4
5 In addition to the normal input type HID devices, USB also uses the 5 In addition to the normal input type HID devices, USB also uses the
6 human interface device protocols for things that are not really human 6 human interface device protocols for things that are not really human
7 interfaces, but have similar sorts of communication needs. The two big 7 interfaces, but have similar sorts of communication needs. The two big
8 examples for this are power devices (especially uninterruptable power 8 examples for this are power devices (especially uninterruptable power
9 supplies) and monitor control on higher end monitors. 9 supplies) and monitor control on higher end monitors.
10 10
11 To support these disparite requirements, the Linux USB system provides 11 To support these disparite requirements, the Linux USB system provides
12 HID events to two separate interfaces: 12 HID events to two separate interfaces:
13 * the input subsystem, which converts HID events into normal input 13 * the input subsystem, which converts HID events into normal input
14 device interfaces (such as keyboard, mouse and joystick) and a 14 device interfaces (such as keyboard, mouse and joystick) and a
15 normalised event interface - see Documentation/input/input.txt 15 normalised event interface - see Documentation/input/input.txt
16 * the hiddev interface, which provides fairly raw HID events 16 * the hiddev interface, which provides fairly raw HID events
17 17
18 The data flow for a HID event produced by a device is something like 18 The data flow for a HID event produced by a device is something like
19 the following : 19 the following :
20 20
21 usb.c ---> hid-core.c ----> hid-input.c ----> [keyboard/mouse/joystick/event] 21 usb.c ---> hid-core.c ----> hid-input.c ----> [keyboard/mouse/joystick/event]
22 | 22 |
23 | 23 |
24 --> hiddev.c ----> POWER / MONITOR CONTROL 24 --> hiddev.c ----> POWER / MONITOR CONTROL
25 25
26 In addition, other subsystems (apart from USB) can potentially feed 26 In addition, other subsystems (apart from USB) can potentially feed
27 events into the input subsystem, but these have no effect on the hid 27 events into the input subsystem, but these have no effect on the hid
28 device interface. 28 device interface.
29 29
30 USING THE HID DEVICE INTERFACE 30 USING THE HID DEVICE INTERFACE
31 31
32 The hiddev interface is a char interface using the normal USB major, 32 The hiddev interface is a char interface using the normal USB major,
33 with the minor numbers starting at 96 and finishing at 111. Therefore, 33 with the minor numbers starting at 96 and finishing at 111. Therefore,
34 you need the following commands: 34 you need the following commands:
35 mknod /dev/usb/hiddev0 c 180 96 35 mknod /dev/usb/hiddev0 c 180 96
36 mknod /dev/usb/hiddev1 c 180 97 36 mknod /dev/usb/hiddev1 c 180 97
37 mknod /dev/usb/hiddev2 c 180 98 37 mknod /dev/usb/hiddev2 c 180 98
38 mknod /dev/usb/hiddev3 c 180 99 38 mknod /dev/usb/hiddev3 c 180 99
39 mknod /dev/usb/hiddev4 c 180 100 39 mknod /dev/usb/hiddev4 c 180 100
40 mknod /dev/usb/hiddev5 c 180 101 40 mknod /dev/usb/hiddev5 c 180 101
41 mknod /dev/usb/hiddev6 c 180 102 41 mknod /dev/usb/hiddev6 c 180 102
42 mknod /dev/usb/hiddev7 c 180 103 42 mknod /dev/usb/hiddev7 c 180 103
43 mknod /dev/usb/hiddev8 c 180 104 43 mknod /dev/usb/hiddev8 c 180 104
44 mknod /dev/usb/hiddev9 c 180 105 44 mknod /dev/usb/hiddev9 c 180 105
45 mknod /dev/usb/hiddev10 c 180 106 45 mknod /dev/usb/hiddev10 c 180 106
46 mknod /dev/usb/hiddev11 c 180 107 46 mknod /dev/usb/hiddev11 c 180 107
47 mknod /dev/usb/hiddev12 c 180 108 47 mknod /dev/usb/hiddev12 c 180 108
48 mknod /dev/usb/hiddev13 c 180 109 48 mknod /dev/usb/hiddev13 c 180 109
49 mknod /dev/usb/hiddev14 c 180 110 49 mknod /dev/usb/hiddev14 c 180 110
50 mknod /dev/usb/hiddev15 c 180 111 50 mknod /dev/usb/hiddev15 c 180 111
51 51
52 So you point your hiddev compliant user-space program at the correct 52 So you point your hiddev compliant user-space program at the correct
53 interface for your device, and it all just works. 53 interface for your device, and it all just works.
54 54
55 Assuming that you have a hiddev compliant user-space program, of 55 Assuming that you have a hiddev compliant user-space program, of
56 course. If you need to write one, read on. 56 course. If you need to write one, read on.
57 57
58 58
59 THE HIDDEV API 59 THE HIDDEV API
60 This description should be read in conjunction with the HID 60 This description should be read in conjunction with the HID
61 specification, freely available from http://www.usb.org, and 61 specification, freely available from http://www.usb.org, and
62 conveniently linked of http://www.linux-usb.org. 62 conveniently linked of http://www.linux-usb.org.
63 63
64 The hiddev API uses a read() interface, and a set of ioctl() calls. 64 The hiddev API uses a read() interface, and a set of ioctl() calls.
65 65
66 HID devices exchange data with the host computer using data 66 HID devices exchange data with the host computer using data
67 bundles called "reports". Each report is divided into "fields", 67 bundles called "reports". Each report is divided into "fields",
68 each of which can have one or more "usages". In the hid-core, 68 each of which can have one or more "usages". In the hid-core,
69 each one of these usages has a single signed 32 bit value. 69 each one of these usages has a single signed 32 bit value.
70 70
71 read(): 71 read():
72 This is the event interface. When the HID device's state changes, 72 This is the event interface. When the HID device's state changes,
73 it performs an interrupt transfer containing a report which contains 73 it performs an interrupt transfer containing a report which contains
74 the changed value. The hid-core.c module parses the report, and 74 the changed value. The hid-core.c module parses the report, and
75 returns to hiddev.c the individual usages that have changed within 75 returns to hiddev.c the individual usages that have changed within
76 the report. In its basic mode, the hiddev will make these individual 76 the report. In its basic mode, the hiddev will make these individual
77 usage changes available to the reader using a struct hiddev_event: 77 usage changes available to the reader using a struct hiddev_event:
78 78
79 struct hiddev_event { 79 struct hiddev_event {
80 unsigned hid; 80 unsigned hid;
81 signed int value; 81 signed int value;
82 }; 82 };
83 83
84 containing the HID usage identifier for the status that changed, and 84 containing the HID usage identifier for the status that changed, and
85 the value that it was changed to. Note that the structure is defined 85 the value that it was changed to. Note that the structure is defined
86 within <linux/hiddev.h>, along with some other useful #defines and 86 within <linux/hiddev.h>, along with some other useful #defines and
87 structures. The HID usage identifier is a composite of the HID usage 87 structures. The HID usage identifier is a composite of the HID usage
88 page shifted to the 16 high order bits ORed with the usage code. The 88 page shifted to the 16 high order bits ORed with the usage code. The
89 behavior of the read() function can be modified using the HIDIOCSFLAG 89 behavior of the read() function can be modified using the HIDIOCSFLAG
90 ioctl() described below. 90 ioctl() described below.
91 91
92 92
93 ioctl(): 93 ioctl():
94 This is the control interface. There are a number of controls: 94 This is the control interface. There are a number of controls:
95 95
96 HIDIOCGVERSION - int (read) 96 HIDIOCGVERSION - int (read)
97 Gets the version code out of the hiddev driver. 97 Gets the version code out of the hiddev driver.
98 98
99 HIDIOCAPPLICATION - (none) 99 HIDIOCAPPLICATION - (none)
100 This ioctl call returns the HID application usage associated with the 100 This ioctl call returns the HID application usage associated with the
101 hid device. The third argument to ioctl() specifies which application 101 hid device. The third argument to ioctl() specifies which application
102 index to get. This is useful when the device has more than one 102 index to get. This is useful when the device has more than one
103 application collection. If the index is invalid (greater or equal to 103 application collection. If the index is invalid (greater or equal to
104 the number of application collections this device has) the ioctl 104 the number of application collections this device has) the ioctl
105 returns -1. You can find out beforehand how many application 105 returns -1. You can find out beforehand how many application
106 collections the device has from the num_applications field from the 106 collections the device has from the num_applications field from the
107 hiddev_devinfo structure. 107 hiddev_devinfo structure.
108 108
109 HIDIOCGCOLLECTIONINFO - struct hiddev_collection_info (read/write) 109 HIDIOCGCOLLECTIONINFO - struct hiddev_collection_info (read/write)
110 This returns a superset of the information above, providing not only 110 This returns a superset of the information above, providing not only
111 application collections, but all the collections the device has. It 111 application collections, but all the collections the device has. It
112 also returns the level the collection lives in the hierarchy. 112 also returns the level the collection lives in the hierarchy.
113 The user passes in a hiddev_collection_info struct with the index 113 The user passes in a hiddev_collection_info struct with the index
114 field set to the index that should be returned. The ioctl fills in 114 field set to the index that should be returned. The ioctl fills in
115 the other fields. If the index is larger than the last collection 115 the other fields. If the index is larger than the last collection
116 index, the ioctl returns -1 and sets errno to -EINVAL. 116 index, the ioctl returns -1 and sets errno to -EINVAL.
117 117
118 HIDIOCGDEVINFO - struct hiddev_devinfo (read) 118 HIDIOCGDEVINFO - struct hiddev_devinfo (read)
119 Gets a hiddev_devinfo structure which describes the device. 119 Gets a hiddev_devinfo structure which describes the device.
120 120
121 HIDIOCGSTRING - struct struct hiddev_string_descriptor (read/write) 121 HIDIOCGSTRING - struct hiddev_string_descriptor (read/write)
122 Gets a string descriptor from the device. The caller must fill in the 122 Gets a string descriptor from the device. The caller must fill in the
123 "index" field to indicate which descriptor should be returned. 123 "index" field to indicate which descriptor should be returned.
124 124
125 HIDIOCINITREPORT - (none) 125 HIDIOCINITREPORT - (none)
126 Instructs the kernel to retrieve all input and feature report values 126 Instructs the kernel to retrieve all input and feature report values
127 from the device. At this point, all the usage structures will contain 127 from the device. At this point, all the usage structures will contain
128 current values for the device, and will maintain it as the device 128 current values for the device, and will maintain it as the device
129 changes. Note that the use of this ioctl is unnecessary in general, 129 changes. Note that the use of this ioctl is unnecessary in general,
130 since later kernels automatically initialize the reports from the 130 since later kernels automatically initialize the reports from the
131 device at attach time. 131 device at attach time.
132 132
133 HIDIOCGNAME - string (variable length) 133 HIDIOCGNAME - string (variable length)
134 Gets the device name 134 Gets the device name
135 135
136 HIDIOCGREPORT - struct hiddev_report_info (write) 136 HIDIOCGREPORT - struct hiddev_report_info (write)
137 Instructs the kernel to get a feature or input report from the device, 137 Instructs the kernel to get a feature or input report from the device,
138 in order to selectively update the usage structures (in contrast to 138 in order to selectively update the usage structures (in contrast to
139 INITREPORT). 139 INITREPORT).
140 140
141 HIDIOCSREPORT - struct hiddev_report_info (write) 141 HIDIOCSREPORT - struct hiddev_report_info (write)
142 Instructs the kernel to send a report to the device. This report can 142 Instructs the kernel to send a report to the device. This report can
143 be filled in by the user through HIDIOCSUSAGE calls (below) to fill in 143 be filled in by the user through HIDIOCSUSAGE calls (below) to fill in
144 individual usage values in the report before sending the report in full 144 individual usage values in the report before sending the report in full
145 to the device. 145 to the device.
146 146
147 HIDIOCGREPORTINFO - struct hiddev_report_info (read/write) 147 HIDIOCGREPORTINFO - struct hiddev_report_info (read/write)
148 Fills in a hiddev_report_info structure for the user. The report is 148 Fills in a hiddev_report_info structure for the user. The report is
149 looked up by type (input, output or feature) and id, so these fields 149 looked up by type (input, output or feature) and id, so these fields
150 must be filled in by the user. The ID can be absolute -- the actual 150 must be filled in by the user. The ID can be absolute -- the actual
151 report id as reported by the device -- or relative -- 151 report id as reported by the device -- or relative --
152 HID_REPORT_ID_FIRST for the first report, and (HID_REPORT_ID_NEXT | 152 HID_REPORT_ID_FIRST for the first report, and (HID_REPORT_ID_NEXT |
153 report_id) for the next report after report_id. Without a-priori 153 report_id) for the next report after report_id. Without a-priori
154 information about report ids, the right way to use this ioctl is to 154 information about report ids, the right way to use this ioctl is to
155 use the relative IDs above to enumerate the valid IDs. The ioctl 155 use the relative IDs above to enumerate the valid IDs. The ioctl
156 returns non-zero when there is no more next ID. The real report ID is 156 returns non-zero when there is no more next ID. The real report ID is
157 filled into the returned hiddev_report_info structure. 157 filled into the returned hiddev_report_info structure.
158 158
159 HIDIOCGFIELDINFO - struct hiddev_field_info (read/write) 159 HIDIOCGFIELDINFO - struct hiddev_field_info (read/write)
160 Returns the field information associated with a report in a 160 Returns the field information associated with a report in a
161 hiddev_field_info structure. The user must fill in report_id and 161 hiddev_field_info structure. The user must fill in report_id and
162 report_type in this structure, as above. The field_index should also 162 report_type in this structure, as above. The field_index should also
163 be filled in, which should be a number from 0 and maxfield-1, as 163 be filled in, which should be a number from 0 and maxfield-1, as
164 returned from a previous HIDIOCGREPORTINFO call. 164 returned from a previous HIDIOCGREPORTINFO call.
165 165
166 HIDIOCGUCODE - struct hiddev_usage_ref (read/write) 166 HIDIOCGUCODE - struct hiddev_usage_ref (read/write)
167 Returns the usage_code in a hiddev_usage_ref structure, given that 167 Returns the usage_code in a hiddev_usage_ref structure, given that
168 given its report type, report id, field index, and index within the 168 given its report type, report id, field index, and index within the
169 field have already been filled into the structure. 169 field have already been filled into the structure.
170 170
171 HIDIOCGUSAGE - struct hiddev_usage_ref (read/write) 171 HIDIOCGUSAGE - struct hiddev_usage_ref (read/write)
172 Returns the value of a usage in a hiddev_usage_ref structure. The 172 Returns the value of a usage in a hiddev_usage_ref structure. The
173 usage to be retrieved can be specified as above, or the user can 173 usage to be retrieved can be specified as above, or the user can
174 choose to fill in the report_type field and specify the report_id as 174 choose to fill in the report_type field and specify the report_id as
175 HID_REPORT_ID_UNKNOWN. In this case, the hiddev_usage_ref will be 175 HID_REPORT_ID_UNKNOWN. In this case, the hiddev_usage_ref will be
176 filled in with the report and field information associated with this 176 filled in with the report and field information associated with this
177 usage if it is found. 177 usage if it is found.
178 178
179 HIDIOCSUSAGE - struct hiddev_usage_ref (write) 179 HIDIOCSUSAGE - struct hiddev_usage_ref (write)
180 Sets the value of a usage in an output report. The user fills in 180 Sets the value of a usage in an output report. The user fills in
181 the hiddev_usage_ref structure as above, but additionally fills in 181 the hiddev_usage_ref structure as above, but additionally fills in
182 the value field. 182 the value field.
183 183
184 HIDIOGCOLLECTIONINDEX - struct hiddev_usage_ref (write) 184 HIDIOGCOLLECTIONINDEX - struct hiddev_usage_ref (write)
185 Returns the collection index associated with this usage. This 185 Returns the collection index associated with this usage. This
186 indicates where in the collection hierarchy this usage sits. 186 indicates where in the collection hierarchy this usage sits.
187 187
188 HIDIOCGFLAG - int (read) 188 HIDIOCGFLAG - int (read)
189 HIDIOCSFLAG - int (write) 189 HIDIOCSFLAG - int (write)
190 These operations respectively inspect and replace the mode flags 190 These operations respectively inspect and replace the mode flags
191 that influence the read() call above. The flags are as follows: 191 that influence the read() call above. The flags are as follows:
192 192
193 HIDDEV_FLAG_UREF - read() calls will now return 193 HIDDEV_FLAG_UREF - read() calls will now return
194 struct hiddev_usage_ref instead of struct hiddev_event. 194 struct hiddev_usage_ref instead of struct hiddev_event.
195 This is a larger structure, but in situations where the 195 This is a larger structure, but in situations where the
196 device has more than one usage in its reports with the 196 device has more than one usage in its reports with the
197 same usage code, this mode serves to resolve such 197 same usage code, this mode serves to resolve such
198 ambiguity. 198 ambiguity.
199 199
200 HIDDEV_FLAG_REPORT - This flag can only be used in conjunction 200 HIDDEV_FLAG_REPORT - This flag can only be used in conjunction
201 with HIDDEV_FLAG_UREF. With this flag set, when the device 201 with HIDDEV_FLAG_UREF. With this flag set, when the device
202 sends a report, a struct hiddev_usage_ref will be returned 202 sends a report, a struct hiddev_usage_ref will be returned
203 to read() filled in with the report_type and report_id, but 203 to read() filled in with the report_type and report_id, but
204 with field_index set to FIELD_INDEX_NONE. This serves as 204 with field_index set to FIELD_INDEX_NONE. This serves as
205 additional notification when the device has sent a report. 205 additional notification when the device has sent a report.
206 206
Documentation/usb/usb-serial.txt
1 INTRODUCTION 1 INTRODUCTION
2 2
3 The USB serial driver currently supports a number of different USB to 3 The USB serial driver currently supports a number of different USB to
4 serial converter products, as well as some devices that use a serial 4 serial converter products, as well as some devices that use a serial
5 interface from userspace to talk to the device. 5 interface from userspace to talk to the device.
6 6
7 See the individual product section below for specific information about 7 See the individual product section below for specific information about
8 the different devices. 8 the different devices.
9 9
10 10
11 CONFIGURATION 11 CONFIGURATION
12 12
13 Currently the driver can handle up to 256 different serial interfaces at 13 Currently the driver can handle up to 256 different serial interfaces at
14 one time. 14 one time.
15 15
16 The major number that the driver uses is 188 so to use the driver, 16 The major number that the driver uses is 188 so to use the driver,
17 create the following nodes: 17 create the following nodes:
18 mknod /dev/ttyUSB0 c 188 0 18 mknod /dev/ttyUSB0 c 188 0
19 mknod /dev/ttyUSB1 c 188 1 19 mknod /dev/ttyUSB1 c 188 1
20 mknod /dev/ttyUSB2 c 188 2 20 mknod /dev/ttyUSB2 c 188 2
21 mknod /dev/ttyUSB3 c 188 3 21 mknod /dev/ttyUSB3 c 188 3
22 . 22 .
23 . 23 .
24 . 24 .
25 mknod /dev/ttyUSB254 c 188 254 25 mknod /dev/ttyUSB254 c 188 254
26 mknod /dev/ttyUSB255 c 188 255 26 mknod /dev/ttyUSB255 c 188 255
27 27
28 When the device is connected and recognized by the driver, the driver 28 When the device is connected and recognized by the driver, the driver
29 will print to the system log, which node(s) the device has been bound 29 will print to the system log, which node(s) the device has been bound
30 to. 30 to.
31 31
32 32
33 SPECIFIC DEVICES SUPPORTED 33 SPECIFIC DEVICES SUPPORTED
34 34
35 35
36 ConnectTech WhiteHEAT 4 port converter 36 ConnectTech WhiteHEAT 4 port converter
37 37
38 ConnectTech has been very forthcoming with information about their 38 ConnectTech has been very forthcoming with information about their
39 device, including providing a unit to test with. 39 device, including providing a unit to test with.
40 40
41 The driver is officially supported by Connect Tech Inc. 41 The driver is officially supported by Connect Tech Inc.
42 http://www.connecttech.com 42 http://www.connecttech.com
43 43
44 For any questions or problems with this driver, please contact 44 For any questions or problems with this driver, please contact
45 Stuart MacDonald at stuartm@connecttech.com 45 Stuart MacDonald at stuartm@connecttech.com
46 46
47 47
48 HandSpring Visor, Palm USB, and Cliรฉ USB driver 48 HandSpring Visor, Palm USB, and Cliรฉ USB driver
49 49
50 This driver works with all HandSpring USB, Palm USB, and Sony Cliรฉ USB 50 This driver works with all HandSpring USB, Palm USB, and Sony Cliรฉ USB
51 devices. 51 devices.
52 52
53 Only when the device tries to connect to the host, will the device show 53 Only when the device tries to connect to the host, will the device show
54 up to the host as a valid USB device. When this happens, the device is 54 up to the host as a valid USB device. When this happens, the device is
55 properly enumerated, assigned a port, and then communication _should_ be 55 properly enumerated, assigned a port, and then communication _should_ be
56 possible. The driver cleans up properly when the device is removed, or 56 possible. The driver cleans up properly when the device is removed, or
57 the connection is canceled on the device. 57 the connection is canceled on the device.
58 58
59 NOTE: 59 NOTE:
60 This means that in order to talk to the device, the sync button must be 60 This means that in order to talk to the device, the sync button must be
61 pressed BEFORE trying to get any program to communicate to the device. 61 pressed BEFORE trying to get any program to communicate to the device.
62 This goes against the current documentation for pilot-xfer and other 62 This goes against the current documentation for pilot-xfer and other
63 packages, but is the only way that it will work due to the hardware 63 packages, but is the only way that it will work due to the hardware
64 in the device. 64 in the device.
65 65
66 When the device is connected, try talking to it on the second port 66 When the device is connected, try talking to it on the second port
67 (this is usually /dev/ttyUSB1 if you do not have any other usb-serial 67 (this is usually /dev/ttyUSB1 if you do not have any other usb-serial
68 devices in the system.) The system log should tell you which port is 68 devices in the system.) The system log should tell you which port is
69 the port to use for the HotSync transfer. The "Generic" port can be used 69 the port to use for the HotSync transfer. The "Generic" port can be used
70 for other device communication, such as a PPP link. 70 for other device communication, such as a PPP link.
71 71
72 For some Sony Cliรฉ devices, /dev/ttyUSB0 must be used to talk to the 72 For some Sony Cliรฉ devices, /dev/ttyUSB0 must be used to talk to the
73 device. This is true for all OS version 3.5 devices, and most devices 73 device. This is true for all OS version 3.5 devices, and most devices
74 that have had a flash upgrade to a newer version of the OS. See the 74 that have had a flash upgrade to a newer version of the OS. See the
75 kernel system log for information on which is the correct port to use. 75 kernel system log for information on which is the correct port to use.
76 76
77 If after pressing the sync button, nothing shows up in the system log, 77 If after pressing the sync button, nothing shows up in the system log,
78 try resetting the device, first a hot reset, and then a cold reset if 78 try resetting the device, first a hot reset, and then a cold reset if
79 necessary. Some devices need this before they can talk to the USB port 79 necessary. Some devices need this before they can talk to the USB port
80 properly. 80 properly.
81 81
82 Devices that are not compiled into the kernel can be specified with module 82 Devices that are not compiled into the kernel can be specified with module
83 parameters. e.g. modprobe visor vendor=0x54c product=0x66 83 parameters. e.g. modprobe visor vendor=0x54c product=0x66
84 84
85 There is a webpage and mailing lists for this portion of the driver at: 85 There is a webpage and mailing lists for this portion of the driver at:
86 http://usbvisor.sourceforge.net/ 86 http://usbvisor.sourceforge.net/
87 87
88 For any questions or problems with this driver, please contact Greg 88 For any questions or problems with this driver, please contact Greg
89 Kroah-Hartman at greg@kroah.com 89 Kroah-Hartman at greg@kroah.com
90 90
91 91
92 PocketPC PDA Driver 92 PocketPC PDA Driver
93 93
94 This driver can be used to connect to Compaq iPAQ, HP Jornada, Casio EM500 94 This driver can be used to connect to Compaq iPAQ, HP Jornada, Casio EM500
95 and other PDAs running Windows CE 3.0 or PocketPC 2002 using a USB 95 and other PDAs running Windows CE 3.0 or PocketPC 2002 using a USB
96 cable/cradle. 96 cable/cradle.
97 Most devices supported by ActiveSync are supported out of the box. 97 Most devices supported by ActiveSync are supported out of the box.
98 For others, please use module parameters to specify the product and vendor 98 For others, please use module parameters to specify the product and vendor
99 id. e.g. modprobe ipaq vendor=0x3f0 product=0x1125 99 id. e.g. modprobe ipaq vendor=0x3f0 product=0x1125
100 100
101 The driver presents a serial interface (usually on /dev/ttyUSB0) over 101 The driver presents a serial interface (usually on /dev/ttyUSB0) over
102 which one may run ppp and establish a TCP/IP link to the PDA. Once this 102 which one may run ppp and establish a TCP/IP link to the PDA. Once this
103 is done, you can transfer files, backup, download email etc. The most 103 is done, you can transfer files, backup, download email etc. The most
104 significant advantage of using USB is speed - I can get 73 to 113 104 significant advantage of using USB is speed - I can get 73 to 113
105 kbytes/sec for download/upload to my iPAQ. 105 kbytes/sec for download/upload to my iPAQ.
106 106
107 This driver is only one of a set of components required to utilize 107 This driver is only one of a set of components required to utilize
108 the USB connection. Please visit http://synce.sourceforge.net which 108 the USB connection. Please visit http://synce.sourceforge.net which
109 contains the necessary packages and a simple step-by-step howto. 109 contains the necessary packages and a simple step-by-step howto.
110 110
111 Once connected, you can use Win CE programs like ftpView, Pocket Outlook 111 Once connected, you can use Win CE programs like ftpView, Pocket Outlook
112 from the PDA and xcerdisp, synce utilities from the Linux side. 112 from the PDA and xcerdisp, synce utilities from the Linux side.
113 113
114 To use Pocket IE, follow the instructions given at 114 To use Pocket IE, follow the instructions given at
115 http://www.tekguru.co.uk/EM500/usbtonet.htm to achieve the same thing 115 http://www.tekguru.co.uk/EM500/usbtonet.htm to achieve the same thing
116 on Win98. Omit the proxy server part; Linux is quite capable of forwarding 116 on Win98. Omit the proxy server part; Linux is quite capable of forwarding
117 packets unlike Win98. Another modification is required at least for the 117 packets unlike Win98. Another modification is required at least for the
118 iPAQ - disable autosync by going to the Start/Settings/Connections menu 118 iPAQ - disable autosync by going to the Start/Settings/Connections menu
119 and unchecking the "Automatically synchronize ..." box. Go to 119 and unchecking the "Automatically synchronize ..." box. Go to
120 Start/Programs/Connections, connect the cable and select "usbdial" (or 120 Start/Programs/Connections, connect the cable and select "usbdial" (or
121 whatever you named your new USB connection). You should finally wind 121 whatever you named your new USB connection). You should finally wind
122 up with a "Connected to usbdial" window with status shown as connected. 122 up with a "Connected to usbdial" window with status shown as connected.
123 Now start up PIE and browse away. 123 Now start up PIE and browse away.
124 124
125 If it doesn't work for some reason, load both the usbserial and ipaq module 125 If it doesn't work for some reason, load both the usbserial and ipaq module
126 with the module parameter "debug" set to 1 and examine the system log. 126 with the module parameter "debug" set to 1 and examine the system log.
127 You can also try soft-resetting your PDA before attempting a connection. 127 You can also try soft-resetting your PDA before attempting a connection.
128 128
129 Other functionality may be possible depending on your PDA. According to 129 Other functionality may be possible depending on your PDA. According to
130 Wes Cilldhaire <billybobjoehenrybob@hotmail.com>, with the Toshiba E570, 130 Wes Cilldhaire <billybobjoehenrybob@hotmail.com>, with the Toshiba E570,
131 ...if you boot into the bootloader (hold down the power when hitting the 131 ...if you boot into the bootloader (hold down the power when hitting the
132 reset button, continuing to hold onto the power until the bootloader screen 132 reset button, continuing to hold onto the power until the bootloader screen
133 is displayed), then put it in the cradle with the ipaq driver loaded, open 133 is displayed), then put it in the cradle with the ipaq driver loaded, open
134 a terminal on /dev/ttyUSB0, it gives you a "USB Reflash" terminal, which can 134 a terminal on /dev/ttyUSB0, it gives you a "USB Reflash" terminal, which can
135 be used to flash the ROM, as well as the microP code.. so much for needing 135 be used to flash the ROM, as well as the microP code.. so much for needing
136 Toshiba's $350 serial cable for flashing!! :D 136 Toshiba's $350 serial cable for flashing!! :D
137 NOTE: This has NOT been tested. Use at your own risk. 137 NOTE: This has NOT been tested. Use at your own risk.
138 138
139 For any questions or problems with the driver, please contact Ganesh 139 For any questions or problems with the driver, please contact Ganesh
140 Varadarajan <ganesh@veritas.com> 140 Varadarajan <ganesh@veritas.com>
141 141
142 142
143 Keyspan PDA Serial Adapter 143 Keyspan PDA Serial Adapter
144 144
145 Single port DB-9 serial adapter, pushed as a PDA adapter for iMacs (mostly 145 Single port DB-9 serial adapter, pushed as a PDA adapter for iMacs (mostly
146 sold in Macintosh catalogs, comes in a translucent white/green dongle). 146 sold in Macintosh catalogs, comes in a translucent white/green dongle).
147 Fairly simple device. Firmware is homebrew. 147 Fairly simple device. Firmware is homebrew.
148 This driver also works for the Xircom/Entrgra single port serial adapter. 148 This driver also works for the Xircom/Entrgra single port serial adapter.
149 149
150 Current status: 150 Current status:
151 Things that work: 151 Things that work:
152 basic input/output (tested with 'cu') 152 basic input/output (tested with 'cu')
153 blocking write when serial line can't keep up 153 blocking write when serial line can't keep up
154 changing baud rates (up to 115200) 154 changing baud rates (up to 115200)
155 getting/setting modem control pins (TIOCM{GET,SET,BIS,BIC}) 155 getting/setting modem control pins (TIOCM{GET,SET,BIS,BIC})
156 sending break (although duration looks suspect) 156 sending break (although duration looks suspect)
157 Things that don't: 157 Things that don't:
158 device strings (as logged by kernel) have trailing binary garbage 158 device strings (as logged by kernel) have trailing binary garbage
159 device ID isn't right, might collide with other Keyspan products 159 device ID isn't right, might collide with other Keyspan products
160 changing baud rates ought to flush tx/rx to avoid mangled half characters 160 changing baud rates ought to flush tx/rx to avoid mangled half characters
161 Big Things on the todo list: 161 Big Things on the todo list:
162 parity, 7 vs 8 bits per char, 1 or 2 stop bits 162 parity, 7 vs 8 bits per char, 1 or 2 stop bits
163 HW flow control 163 HW flow control
164 not all of the standard USB descriptors are handled: Get_Status, Set_Feature 164 not all of the standard USB descriptors are handled: Get_Status, Set_Feature
165 O_NONBLOCK, select() 165 O_NONBLOCK, select()
166 166
167 For any questions or problems with this driver, please contact Brian 167 For any questions or problems with this driver, please contact Brian
168 Warner at warner@lothar.com 168 Warner at warner@lothar.com
169 169
170 170
171 Keyspan USA-series Serial Adapters 171 Keyspan USA-series Serial Adapters
172 172
173 Single, Dual and Quad port adapters - driver uses Keyspan supplied 173 Single, Dual and Quad port adapters - driver uses Keyspan supplied
174 firmware and is being developed with their support. 174 firmware and is being developed with their support.
175 175
176 Current status: 176 Current status:
177 The USA-18X, USA-28X, USA-19, USA-19W and USA-49W are supported and 177 The USA-18X, USA-28X, USA-19, USA-19W and USA-49W are supported and
178 have been pretty throughly tested at various baud rates with 8-N-1 178 have been pretty throughly tested at various baud rates with 8-N-1
179 character settings. Other character lengths and parity setups are 179 character settings. Other character lengths and parity setups are
180 presently untested. 180 presently untested.
181 181
182 The USA-28 isn't yet supported though doing so should be pretty 182 The USA-28 isn't yet supported though doing so should be pretty
183 straightforward. Contact the maintainer if you require this 183 straightforward. Contact the maintainer if you require this
184 functionality. 184 functionality.
185 185
186 More information is available at: 186 More information is available at:
187 http://misc.nu/hugh/keyspan.html 187 http://misc.nu/hugh/keyspan.html
188 188
189 For any questions or problems with this driver, please contact Hugh 189 For any questions or problems with this driver, please contact Hugh
190 Blemings at hugh@misc.nu 190 Blemings at hugh@misc.nu
191 191
192 192
193 FTDI Single Port Serial Driver 193 FTDI Single Port Serial Driver
194 194
195 This is a single port DB-25 serial adapter. More information about this 195 This is a single port DB-25 serial adapter. More information about this
196 device and the Linux driver can be found at: 196 device and the Linux driver can be found at:
197 http://reality.sgi.com/bryder_wellington/ftdi_sio/ 197 http://reality.sgi.com/bryder_wellington/ftdi_sio/
198 198
199 For any questions or problems with this driver, please contact Bill Ryder 199 For any questions or problems with this driver, please contact Bill Ryder
200 at bryder@sgi.com 200 at bryder@sgi.com
201 201
202 202
203 ZyXEL omni.net lcd plus ISDN TA 203 ZyXEL omni.net lcd plus ISDN TA
204 204
205 This is an ISDN TA. Please report both successes and troubles to 205 This is an ISDN TA. Please report both successes and troubles to
206 azummo@towertech.it 206 azummo@towertech.it
207 207
208 208
209 Cypress M8 CY4601 Family Serial Driver 209 Cypress M8 CY4601 Family Serial Driver
210 210
211 This driver was in most part developed by Neil "koyama" Whelchel. It 211 This driver was in most part developed by Neil "koyama" Whelchel. It
212 has been improved since that previous form to support dynamic serial 212 has been improved since that previous form to support dynamic serial
213 line settings and improved line handling. The driver is for the most 213 line settings and improved line handling. The driver is for the most
214 part stable and has been tested on an smp machine. (dual p2) 214 part stable and has been tested on an smp machine. (dual p2)
215 215
216 Chipsets supported under CY4601 family: 216 Chipsets supported under CY4601 family:
217 217
218 CY7C63723, CY7C63742, CY7C63743, CY7C64013 218 CY7C63723, CY7C63742, CY7C63743, CY7C64013
219 219
220 Devices supported: 220 Devices supported:
221 221
222 -DeLorme's USB Earthmate (SiRF Star II lp arch) 222 -DeLorme's USB Earthmate (SiRF Star II lp arch)
223 -Cypress HID->COM RS232 adapter 223 -Cypress HID->COM RS232 adapter
224 224
225 Note: Cypress Semiconductor claims no affiliation with the 225 Note: Cypress Semiconductor claims no affiliation with the
226 the hid->com device. 226 hid->com device.
227 227
228 Most devices using chipsets under the CY4601 family should 228 Most devices using chipsets under the CY4601 family should
229 work with the driver. As long as they stay true to the CY4601 229 work with the driver. As long as they stay true to the CY4601
230 usbserial specification. 230 usbserial specification.
231 231
232 Technical notes: 232 Technical notes:
233 233
234 The Earthmate starts out at 4800 8N1 by default... the driver will 234 The Earthmate starts out at 4800 8N1 by default... the driver will
235 upon start init to this setting. usbserial core provides the rest 235 upon start init to this setting. usbserial core provides the rest
236 of the termios settings, along with some custom termios so that the 236 of the termios settings, along with some custom termios so that the
237 output is in proper format and parsable. 237 output is in proper format and parsable.
238 238
239 The device can be put into sirf mode by issuing NMEA command: 239 The device can be put into sirf mode by issuing NMEA command:
240 $PSRF100,<protocol>,<baud>,<databits>,<stopbits>,<parity>*CHECKSUM 240 $PSRF100,<protocol>,<baud>,<databits>,<stopbits>,<parity>*CHECKSUM
241 $PSRF100,0,9600,8,1,0*0C 241 $PSRF100,0,9600,8,1,0*0C
242 242
243 It should then be sufficient to change the port termios to match this 243 It should then be sufficient to change the port termios to match this
244 to begin communicating. 244 to begin communicating.
245 245
246 As far as I can tell it supports pretty much every sirf command as 246 As far as I can tell it supports pretty much every sirf command as
247 documented online available with firmware 2.31, with some unknown 247 documented online available with firmware 2.31, with some unknown
248 message ids. 248 message ids.
249 249
250 The hid->com adapter can run at a maximum baud of 115200bps. Please note 250 The hid->com adapter can run at a maximum baud of 115200bps. Please note
251 that the device has trouble or is incapable of raising line voltage properly. 251 that the device has trouble or is incapable of raising line voltage properly.
252 It will be fine with null modem links, as long as you do not try to link two 252 It will be fine with null modem links, as long as you do not try to link two
253 together without hacking the adapter to set the line high. 253 together without hacking the adapter to set the line high.
254 254
255 The driver is smp safe. Performance with the driver is rather low when using 255 The driver is smp safe. Performance with the driver is rather low when using
256 it for transfering files. This is being worked on, but I would be willing to 256 it for transfering files. This is being worked on, but I would be willing to
257 accept patches. An urb queue or packet buffer would likely fit the bill here. 257 accept patches. An urb queue or packet buffer would likely fit the bill here.
258 258
259 If you have any questions, problems, patches, feature requests, etc. you can 259 If you have any questions, problems, patches, feature requests, etc. you can
260 contact me here via email: 260 contact me here via email:
261 dignome@gmail.com 261 dignome@gmail.com
262 (your problems/patches can alternately be submitted to usb-devel) 262 (your problems/patches can alternately be submitted to usb-devel)
263 263
264 264
265 Digi AccelePort Driver 265 Digi AccelePort Driver
266 266
267 This driver supports the Digi AccelePort USB 2 and 4 devices, 2 port 267 This driver supports the Digi AccelePort USB 2 and 4 devices, 2 port
268 (plus a parallel port) and 4 port USB serial converters. The driver 268 (plus a parallel port) and 4 port USB serial converters. The driver
269 does NOT yet support the Digi AccelePort USB 8. 269 does NOT yet support the Digi AccelePort USB 8.
270 270
271 This driver works under SMP with the usb-uhci driver. It does not 271 This driver works under SMP with the usb-uhci driver. It does not
272 work under SMP with the uhci driver. 272 work under SMP with the uhci driver.
273 273
274 The driver is generally working, though we still have a few more ioctls 274 The driver is generally working, though we still have a few more ioctls
275 to implement and final testing and debugging to do. The parallel port 275 to implement and final testing and debugging to do. The parallel port
276 on the USB 2 is supported as a serial to parallel converter; in other 276 on the USB 2 is supported as a serial to parallel converter; in other
277 words, it appears as another USB serial port on Linux, even though 277 words, it appears as another USB serial port on Linux, even though
278 physically it is really a parallel port. The Digi Acceleport USB 8 278 physically it is really a parallel port. The Digi Acceleport USB 8
279 is not yet supported. 279 is not yet supported.
280 280
281 Please contact Peter Berger (pberger@brimson.com) or Al Borchers 281 Please contact Peter Berger (pberger@brimson.com) or Al Borchers
282 (alborchers@steinerpoint.com) for questions or problems with this 282 (alborchers@steinerpoint.com) for questions or problems with this
283 driver. 283 driver.
284 284
285 285
286 Belkin USB Serial Adapter F5U103 286 Belkin USB Serial Adapter F5U103
287 287
288 Single port DB-9/PS-2 serial adapter from Belkin with firmware by eTEK Labs. 288 Single port DB-9/PS-2 serial adapter from Belkin with firmware by eTEK Labs.
289 The Peracom single port serial adapter also works with this driver, as 289 The Peracom single port serial adapter also works with this driver, as
290 well as the GoHubs adapter. 290 well as the GoHubs adapter.
291 291
292 Current status: 292 Current status:
293 The following have been tested and work: 293 The following have been tested and work:
294 Baud rate 300-230400 294 Baud rate 300-230400
295 Data bits 5-8 295 Data bits 5-8
296 Stop bits 1-2 296 Stop bits 1-2
297 Parity N,E,O,M,S 297 Parity N,E,O,M,S
298 Handshake None, Software (XON/XOFF), Hardware (CTSRTS,CTSDTR)* 298 Handshake None, Software (XON/XOFF), Hardware (CTSRTS,CTSDTR)*
299 Break Set and clear 299 Break Set and clear
300 Line contrl Input/Output query and control ** 300 Line contrl Input/Output query and control **
301 301
302 * Hardware input flow control is only enabled for firmware 302 * Hardware input flow control is only enabled for firmware
303 levels above 2.06. Read source code comments describing Belkin 303 levels above 2.06. Read source code comments describing Belkin
304 firmware errata. Hardware output flow control is working for all 304 firmware errata. Hardware output flow control is working for all
305 firmware versions. 305 firmware versions.
306 ** Queries of inputs (CTS,DSR,CD,RI) show the last 306 ** Queries of inputs (CTS,DSR,CD,RI) show the last
307 reported state. Queries of outputs (DTR,RTS) show the last 307 reported state. Queries of outputs (DTR,RTS) show the last
308 requested state and may not reflect current state as set by 308 requested state and may not reflect current state as set by
309 automatic hardware flow control. 309 automatic hardware flow control.
310 310
311 TO DO List: 311 TO DO List:
312 -- Add true modem contol line query capability. Currently tracks the 312 -- Add true modem contol line query capability. Currently tracks the
313 states reported by the interrupt and the states requested. 313 states reported by the interrupt and the states requested.
314 -- Add error reporting back to application for UART error conditions. 314 -- Add error reporting back to application for UART error conditions.
315 -- Add support for flush ioctls. 315 -- Add support for flush ioctls.
316 -- Add everything else that is missing :) 316 -- Add everything else that is missing :)
317 317
318 For any questions or problems with this driver, please contact William 318 For any questions or problems with this driver, please contact William
319 Greathouse at wgreathouse@smva.com 319 Greathouse at wgreathouse@smva.com
320 320
321 321
322 Empeg empeg-car Mark I/II Driver 322 Empeg empeg-car Mark I/II Driver
323 323
324 This is an experimental driver to provide connectivity support for the 324 This is an experimental driver to provide connectivity support for the
325 client synchronization tools for an Empeg empeg-car mp3 player. 325 client synchronization tools for an Empeg empeg-car mp3 player.
326 326
327 Tips: 327 Tips:
328 * Don't forget to create the device nodes for ttyUSB{0,1,2,...} 328 * Don't forget to create the device nodes for ttyUSB{0,1,2,...}
329 * modprobe empeg (modprobe is your friend) 329 * modprobe empeg (modprobe is your friend)
330 * emptool --usb /dev/ttyUSB0 (or whatever you named your device node) 330 * emptool --usb /dev/ttyUSB0 (or whatever you named your device node)
331 331
332 For any questions or problems with this driver, please contact Gary 332 For any questions or problems with this driver, please contact Gary
333 Brubaker at xavyer@ix.netcom.com 333 Brubaker at xavyer@ix.netcom.com
334 334
335 335
336 MCT USB Single Port Serial Adapter U232 336 MCT USB Single Port Serial Adapter U232
337 337
338 This driver is for the MCT USB-RS232 Converter (25 pin, Model No. 338 This driver is for the MCT USB-RS232 Converter (25 pin, Model No.
339 U232-P25) from Magic Control Technology Corp. (there is also a 9 pin 339 U232-P25) from Magic Control Technology Corp. (there is also a 9 pin
340 Model No. U232-P9). More information about this device can be found at 340 Model No. U232-P9). More information about this device can be found at
341 the manufacture's web-site: http://www.mct.com.tw. 341 the manufacture's web-site: http://www.mct.com.tw.
342 342
343 The driver is generally working, though it still needs some more testing. 343 The driver is generally working, though it still needs some more testing.
344 It is derived from the Belkin USB Serial Adapter F5U103 driver and its 344 It is derived from the Belkin USB Serial Adapter F5U103 driver and its
345 TODO list is valid for this driver as well. 345 TODO list is valid for this driver as well.
346 346
347 This driver has also been found to work for other products, which have 347 This driver has also been found to work for other products, which have
348 the same Vendor ID but different Product IDs. Sitecom's U232-P25 serial 348 the same Vendor ID but different Product IDs. Sitecom's U232-P25 serial
349 converter uses Product ID 0x230 and Vendor ID 0x711 and works with this 349 converter uses Product ID 0x230 and Vendor ID 0x711 and works with this
350 driver. Also, D-Link's DU-H3SP USB BAY also works with this driver. 350 driver. Also, D-Link's DU-H3SP USB BAY also works with this driver.
351 351
352 For any questions or problems with this driver, please contact Wolfgang 352 For any questions or problems with this driver, please contact Wolfgang
353 Grandegger at wolfgang@ces.ch 353 Grandegger at wolfgang@ces.ch
354 354
355 355
356 Inside Out Networks Edgeport Driver 356 Inside Out Networks Edgeport Driver
357 357
358 This driver supports all devices made by Inside Out Networks, specifically 358 This driver supports all devices made by Inside Out Networks, specifically
359 the following models: 359 the following models:
360 Edgeport/4 360 Edgeport/4
361 Rapidport/4 361 Rapidport/4
362 Edgeport/4t 362 Edgeport/4t
363 Edgeport/2 363 Edgeport/2
364 Edgeport/4i 364 Edgeport/4i
365 Edgeport/2i 365 Edgeport/2i
366 Edgeport/421 366 Edgeport/421
367 Edgeport/21 367 Edgeport/21
368 Edgeport/8 368 Edgeport/8
369 Edgeport/8 Dual 369 Edgeport/8 Dual
370 Edgeport/2D8 370 Edgeport/2D8
371 Edgeport/4D8 371 Edgeport/4D8
372 Edgeport/8i 372 Edgeport/8i
373 Edgeport/2 DIN 373 Edgeport/2 DIN
374 Edgeport/4 DIN 374 Edgeport/4 DIN
375 Edgeport/16 Dual 375 Edgeport/16 Dual
376 376
377 For any questions or problems with this driver, please contact Greg 377 For any questions or problems with this driver, please contact Greg
378 Kroah-Hartman at greg@kroah.com 378 Kroah-Hartman at greg@kroah.com
379 379
380 380
381 REINER SCT cyberJack pinpad/e-com USB chipcard reader 381 REINER SCT cyberJack pinpad/e-com USB chipcard reader
382 382
383 Interface to ISO 7816 compatible contactbased chipcards, e.g. GSM SIMs. 383 Interface to ISO 7816 compatible contactbased chipcards, e.g. GSM SIMs.
384 384
385 Current status: 385 Current status:
386 This is the kernel part of the driver for this USB card reader. 386 This is the kernel part of the driver for this USB card reader.
387 There is also a user part for a CT-API driver available. A site 387 There is also a user part for a CT-API driver available. A site
388 for downloading is TBA. For now, you can request it from the 388 for downloading is TBA. For now, you can request it from the
389 maintainer (linux-usb@sii.li). 389 maintainer (linux-usb@sii.li).
390 390
391 For any questions or problems with this driver, please contact 391 For any questions or problems with this driver, please contact
392 linux-usb@sii.li 392 linux-usb@sii.li
393 393
394 394
395 Prolific PL2303 Driver 395 Prolific PL2303 Driver
396 396
397 This driver supports any device that has the PL2303 chip from Prolific 397 This driver supports any device that has the PL2303 chip from Prolific
398 in it. This includes a number of single port USB to serial 398 in it. This includes a number of single port USB to serial
399 converters and USB GPS devices. Devices from Aten (the UC-232) and 399 converters and USB GPS devices. Devices from Aten (the UC-232) and
400 IO-Data work with this driver, as does the DCU-11 mobile-phone cable. 400 IO-Data work with this driver, as does the DCU-11 mobile-phone cable.
401 401
402 For any questions or problems with this driver, please contact Greg 402 For any questions or problems with this driver, please contact Greg
403 Kroah-Hartman at greg@kroah.com 403 Kroah-Hartman at greg@kroah.com
404 404
405 405
406 KL5KUSB105 chipset / PalmConnect USB single-port adapter 406 KL5KUSB105 chipset / PalmConnect USB single-port adapter
407 407
408 Current status: 408 Current status:
409 The driver was put together by looking at the usb bus transactions 409 The driver was put together by looking at the usb bus transactions
410 done by Palm's driver under Windows, so a lot of functionality is 410 done by Palm's driver under Windows, so a lot of functionality is
411 still missing. Notably, serial ioctls are sometimes faked or not yet 411 still missing. Notably, serial ioctls are sometimes faked or not yet
412 implemented. Support for finding out about DSR and CTS line status is 412 implemented. Support for finding out about DSR and CTS line status is
413 however implemented (though not nicely), so your favorite autopilot(1) 413 however implemented (though not nicely), so your favorite autopilot(1)
414 and pilot-manager -daemon calls will work. Baud rates up to 115200 414 and pilot-manager -daemon calls will work. Baud rates up to 115200
415 are supported, but handshaking (software or hardware) is not, which is 415 are supported, but handshaking (software or hardware) is not, which is
416 why it is wise to cut down on the rate used is wise for large 416 why it is wise to cut down on the rate used is wise for large
417 transfers until this is settled. 417 transfers until this is settled.
418 418
419 Options supported: 419 Options supported:
420 If this driver is compiled as a module you can pass the following 420 If this driver is compiled as a module you can pass the following
421 options to it: 421 options to it:
422 debug - extra verbose debugging info 422 debug - extra verbose debugging info
423 (default: 0; nonzero enables) 423 (default: 0; nonzero enables)
424 use_lowlatency - use low_latency flag to speed up tty layer 424 use_lowlatency - use low_latency flag to speed up tty layer
425 when reading from from the device. 425 when reading from the device.
426 (default: 0; nonzero enables) 426 (default: 0; nonzero enables)
427 427
428 See http://www.uuhaus.de/linux/palmconnect.html for up-to-date 428 See http://www.uuhaus.de/linux/palmconnect.html for up-to-date
429 information on this driver. 429 information on this driver.
430 430
431 AIRcable USB Dongle Bluetooth driver 431 AIRcable USB Dongle Bluetooth driver
432 If there is the cdc_acm driver loaded in the system, you will find that the 432 If there is the cdc_acm driver loaded in the system, you will find that the
433 cdc_acm claims the device before AIRcable can. This is simply corrected 433 cdc_acm claims the device before AIRcable can. This is simply corrected
434 by unloading both modules and then loading the aircable module before 434 by unloading both modules and then loading the aircable module before
435 cdc_acm module 435 cdc_acm module
436 436
437 Generic Serial driver 437 Generic Serial driver
438 438
439 If your device is not one of the above listed devices, compatible with 439 If your device is not one of the above listed devices, compatible with
440 the above models, you can try out the "generic" interface. This 440 the above models, you can try out the "generic" interface. This
441 interface does not provide any type of control messages sent to the 441 interface does not provide any type of control messages sent to the
442 device, and does not support any kind of device flow control. All that 442 device, and does not support any kind of device flow control. All that
443 is required of your device is that it has at least one bulk in endpoint, 443 is required of your device is that it has at least one bulk in endpoint,
444 or one bulk out endpoint. 444 or one bulk out endpoint.
445 445
446 To enable the generic driver to recognize your device, build the driver 446 To enable the generic driver to recognize your device, build the driver
447 as a module and load it by the following invocation: 447 as a module and load it by the following invocation:
448 insmod usbserial vendor=0x#### product=0x#### 448 insmod usbserial vendor=0x#### product=0x####
449 where the #### is replaced with the hex representation of your device's 449 where the #### is replaced with the hex representation of your device's
450 vendor id and product id. 450 vendor id and product id.
451 451
452 This driver has been successfully used to connect to the NetChip USB 452 This driver has been successfully used to connect to the NetChip USB
453 development board, providing a way to develop USB firmware without 453 development board, providing a way to develop USB firmware without
454 having to write a custom driver. 454 having to write a custom driver.
455 455
456 For any questions or problems with this driver, please contact Greg 456 For any questions or problems with this driver, please contact Greg
457 Kroah-Hartman at greg@kroah.com 457 Kroah-Hartman at greg@kroah.com
458 458
459 459
460 CONTACT: 460 CONTACT:
461 461
462 If anyone has any problems using these drivers, with any of the above 462 If anyone has any problems using these drivers, with any of the above
463 specified products, please contact the specific driver's author listed 463 specified products, please contact the specific driver's author listed
464 above, or join the Linux-USB mailing list (information on joining the 464 above, or join the Linux-USB mailing list (information on joining the
465 mailing list, as well as a link to its searchable archive is at 465 mailing list, as well as a link to its searchable archive is at
466 http://www.linux-usb.org/ ) 466 http://www.linux-usb.org/ )
467 467
468 468
469 Greg Kroah-Hartman 469 Greg Kroah-Hartman
470 greg@kroah.com 470 greg@kroah.com
471 471
Documentation/video4linux/README.pvrusb2
1 1
2 $Id$ 2 $Id$
3 Mike Isely <isely@pobox.com> 3 Mike Isely <isely@pobox.com>
4 4
5 pvrusb2 driver 5 pvrusb2 driver
6 6
7 Background: 7 Background:
8 8
9 This driver is intended for the "Hauppauge WinTV PVR USB 2.0", which 9 This driver is intended for the "Hauppauge WinTV PVR USB 2.0", which
10 is a USB 2.0 hosted TV Tuner. This driver is a work in progress. 10 is a USB 2.0 hosted TV Tuner. This driver is a work in progress.
11 Its history started with the reverse-engineering effort by Bjรถrn 11 Its history started with the reverse-engineering effort by Bjรถrn
12 Danielsson <pvrusb2@dax.nu> whose web page can be found here: 12 Danielsson <pvrusb2@dax.nu> whose web page can be found here:
13 13
14 http://pvrusb2.dax.nu/ 14 http://pvrusb2.dax.nu/
15 15
16 From there Aurelien Alleaume <slts@free.fr> began an effort to 16 From there Aurelien Alleaume <slts@free.fr> began an effort to
17 create a video4linux compatible driver. I began with Aurelien's 17 create a video4linux compatible driver. I began with Aurelien's
18 last known snapshot and evolved the driver to the state it is in 18 last known snapshot and evolved the driver to the state it is in
19 here. 19 here.
20 20
21 More information on this driver can be found at: 21 More information on this driver can be found at:
22 22
23 http://www.isely.net/pvrusb2.html 23 http://www.isely.net/pvrusb2.html
24 24
25 25
26 This driver has a strong separation of layers. They are very 26 This driver has a strong separation of layers. They are very
27 roughly: 27 roughly:
28 28
29 1a. Low level wire-protocol implementation with the device. 29 1a. Low level wire-protocol implementation with the device.
30 30
31 1b. I2C adaptor implementation and corresponding I2C client drivers 31 1b. I2C adaptor implementation and corresponding I2C client drivers
32 implemented elsewhere in V4L. 32 implemented elsewhere in V4L.
33 33
34 1c. High level hardware driver implementation which coordinates all 34 1c. High level hardware driver implementation which coordinates all
35 activities that ensure correct operation of the device. 35 activities that ensure correct operation of the device.
36 36
37 2. A "context" layer which manages instancing of driver, setup, 37 2. A "context" layer which manages instancing of driver, setup,
38 tear-down, arbitration, and interaction with high level 38 tear-down, arbitration, and interaction with high level
39 interfaces appropriately as devices are hotplugged in the 39 interfaces appropriately as devices are hotplugged in the
40 system. 40 system.
41 41
42 3. High level interfaces which glue the driver to various published 42 3. High level interfaces which glue the driver to various published
43 Linux APIs (V4L, sysfs, maybe DVB in the future). 43 Linux APIs (V4L, sysfs, maybe DVB in the future).
44 44
45 The most important shearing layer is between the top 2 layers. A 45 The most important shearing layer is between the top 2 layers. A
46 lot of work went into the driver to ensure that any kind of 46 lot of work went into the driver to ensure that any kind of
47 conceivable API can be laid on top of the core driver. (Yes, the 47 conceivable API can be laid on top of the core driver. (Yes, the
48 driver internally leverages V4L to do its work but that really has 48 driver internally leverages V4L to do its work but that really has
49 nothing to do with the API published by the driver to the outside 49 nothing to do with the API published by the driver to the outside
50 world.) The architecture allows for different APIs to 50 world.) The architecture allows for different APIs to
51 simultaneously access the driver. I have a strong sense of fairness 51 simultaneously access the driver. I have a strong sense of fairness
52 about APIs and also feel that it is a good design principle to keep 52 about APIs and also feel that it is a good design principle to keep
53 implementation and interface isolated from each other. Thus while 53 implementation and interface isolated from each other. Thus while
54 right now the V4L high level interface is the most complete, the 54 right now the V4L high level interface is the most complete, the
55 sysfs high level interface will work equally well for similar 55 sysfs high level interface will work equally well for similar
56 functions, and there's no reason I see right now why it shouldn't be 56 functions, and there's no reason I see right now why it shouldn't be
57 possible to produce a DVB high level interface that can sit right 57 possible to produce a DVB high level interface that can sit right
58 alongside V4L. 58 alongside V4L.
59 59
60 NOTE: Complete documentation on the pvrusb2 driver is contained in 60 NOTE: Complete documentation on the pvrusb2 driver is contained in
61 the html files within the doc directory; these are exactly the same 61 the html files within the doc directory; these are exactly the same
62 as what is on the web site at the time. Browse those files 62 as what is on the web site at the time. Browse those files
63 (especially the FAQ) before asking questions. 63 (especially the FAQ) before asking questions.
64 64
65 65
66 Building 66 Building
67 67
68 To build these modules essentially amounts to just running "Make", 68 To build these modules essentially amounts to just running "Make",
69 but you need the kernel source tree nearby and you will likely also 69 but you need the kernel source tree nearby and you will likely also
70 want to set a few controlling environment variables first in order 70 want to set a few controlling environment variables first in order
71 to link things up with that source tree. Please see the Makefile 71 to link things up with that source tree. Please see the Makefile
72 here for comments that explain how to do that. 72 here for comments that explain how to do that.
73 73
74 74
75 Source file list / functional overview: 75 Source file list / functional overview:
76 76
77 (Note: The term "module" used below generally refers to loosely 77 (Note: The term "module" used below generally refers to loosely
78 defined functional units within the pvrusb2 driver and bears no 78 defined functional units within the pvrusb2 driver and bears no
79 relation to the Linux kernel's concept of a loadable module.) 79 relation to the Linux kernel's concept of a loadable module.)
80 80
81 pvrusb2-audio.[ch] - This is glue logic that resides between this 81 pvrusb2-audio.[ch] - This is glue logic that resides between this
82 driver and the msp3400.ko I2C client driver (which is found 82 driver and the msp3400.ko I2C client driver (which is found
83 elsewhere in V4L). 83 elsewhere in V4L).
84 84
85 pvrusb2-context.[ch] - This module implements the context for an 85 pvrusb2-context.[ch] - This module implements the context for an
86 instance of the driver. Everything else eventually ties back to 86 instance of the driver. Everything else eventually ties back to
87 or is otherwise instanced within the data structures implemented 87 or is otherwise instanced within the data structures implemented
88 here. Hotplugging is ultimately coordinated here. All high level 88 here. Hotplugging is ultimately coordinated here. All high level
89 interfaces tie into the driver through this module. This module 89 interfaces tie into the driver through this module. This module
90 helps arbitrate each interface's access to the actual driver core, 90 helps arbitrate each interface's access to the actual driver core,
91 and is designed to allow concurrent access through multiple 91 and is designed to allow concurrent access through multiple
92 instances of multiple interfaces (thus you can for example change 92 instances of multiple interfaces (thus you can for example change
93 the tuner's frequency through sysfs while simultaneously streaming 93 the tuner's frequency through sysfs while simultaneously streaming
94 video through V4L out to an instance of mplayer). 94 video through V4L out to an instance of mplayer).
95 95
96 pvrusb2-debug.h - This header defines a printk() wrapper and a mask 96 pvrusb2-debug.h - This header defines a printk() wrapper and a mask
97 of debugging bit definitions for the various kinds of debug 97 of debugging bit definitions for the various kinds of debug
98 messages that can be enabled within the driver. 98 messages that can be enabled within the driver.
99 99
100 pvrusb2-debugifc.[ch] - This module implements a crude command line 100 pvrusb2-debugifc.[ch] - This module implements a crude command line
101 oriented debug interface into the driver. Aside from being part 101 oriented debug interface into the driver. Aside from being part
102 of the process for implementing manual firmware extraction (see 102 of the process for implementing manual firmware extraction (see
103 the pvrusb2 web site mentioned earlier), probably I'm the only one 103 the pvrusb2 web site mentioned earlier), probably I'm the only one
104 who has ever used this. It is mainly a debugging aid. 104 who has ever used this. It is mainly a debugging aid.
105 105
106 pvrusb2-eeprom.[ch] - This is glue logic that resides between this 106 pvrusb2-eeprom.[ch] - This is glue logic that resides between this
107 driver the tveeprom.ko module, which is itself implemented 107 driver the tveeprom.ko module, which is itself implemented
108 elsewhere in V4L. 108 elsewhere in V4L.
109 109
110 pvrusb2-encoder.[ch] - This module implements all protocol needed to 110 pvrusb2-encoder.[ch] - This module implements all protocol needed to
111 interact with the Conexant mpeg2 encoder chip within the pvrusb2 111 interact with the Conexant mpeg2 encoder chip within the pvrusb2
112 device. It is a crude echo of corresponding logic in ivtv, 112 device. It is a crude echo of corresponding logic in ivtv,
113 however the design goals (strict isolation) and physical layer 113 however the design goals (strict isolation) and physical layer
114 (proxy through USB instead of PCI) are enough different that this 114 (proxy through USB instead of PCI) are enough different that this
115 implementation had to be completely different. 115 implementation had to be completely different.
116 116
117 pvrusb2-hdw-internal.h - This header defines the core data structure 117 pvrusb2-hdw-internal.h - This header defines the core data structure
118 in the driver used to track ALL internal state related to control 118 in the driver used to track ALL internal state related to control
119 of the hardware. Nobody outside of the core hardware-handling 119 of the hardware. Nobody outside of the core hardware-handling
120 modules should have any business using this header. All external 120 modules should have any business using this header. All external
121 access to the driver should be through one of the high level 121 access to the driver should be through one of the high level
122 interfaces (e.g. V4L, sysfs, etc), and in fact even those high 122 interfaces (e.g. V4L, sysfs, etc), and in fact even those high
123 level interfaces are restricted to the API defined in 123 level interfaces are restricted to the API defined in
124 pvrusb2-hdw.h and NOT this header. 124 pvrusb2-hdw.h and NOT this header.
125 125
126 pvrusb2-hdw.h - This header defines the full internal API for 126 pvrusb2-hdw.h - This header defines the full internal API for
127 controlling the hardware. High level interfaces (e.g. V4L, sysfs) 127 controlling the hardware. High level interfaces (e.g. V4L, sysfs)
128 will work through here. 128 will work through here.
129 129
130 pvrusb2-hdw.c - This module implements all the various bits of logic 130 pvrusb2-hdw.c - This module implements all the various bits of logic
131 that handle overall control of a specific pvrusb2 device. 131 that handle overall control of a specific pvrusb2 device.
132 (Policy, instantiation, and arbitration of pvrusb2 devices fall 132 (Policy, instantiation, and arbitration of pvrusb2 devices fall
133 within the jurisdiction of pvrusb-context not here). 133 within the jurisdiction of pvrusb-context not here).
134 134
135 pvrusb2-i2c-chips-*.c - These modules implement the glue logic to 135 pvrusb2-i2c-chips-*.c - These modules implement the glue logic to
136 tie together and configure various I2C modules as they attach to 136 tie together and configure various I2C modules as they attach to
137 the I2C bus. There are two versions of this file. The "v4l2" 137 the I2C bus. There are two versions of this file. The "v4l2"
138 version is intended to be used in-tree alongside V4L, where we 138 version is intended to be used in-tree alongside V4L, where we
139 implement just the logic that makes sense for a pure V4L 139 implement just the logic that makes sense for a pure V4L
140 environment. The "all" version is intended for use outside of 140 environment. The "all" version is intended for use outside of
141 V4L, where we might encounter other possibly "challenging" modules 141 V4L, where we might encounter other possibly "challenging" modules
142 from ivtv or older kernel snapshots (or even the support modules 142 from ivtv or older kernel snapshots (or even the support modules
143 in the standalone snapshot). 143 in the standalone snapshot).
144 144
145 pvrusb2-i2c-cmd-v4l1.[ch] - This module implements generic V4L1 145 pvrusb2-i2c-cmd-v4l1.[ch] - This module implements generic V4L1
146 compatible commands to the I2C modules. It is here where state 146 compatible commands to the I2C modules. It is here where state
147 changes inside the pvrusb2 driver are translated into V4L1 147 changes inside the pvrusb2 driver are translated into V4L1
148 commands that are in turn send to the various I2C modules. 148 commands that are in turn send to the various I2C modules.
149 149
150 pvrusb2-i2c-cmd-v4l2.[ch] - This module implements generic V4L2 150 pvrusb2-i2c-cmd-v4l2.[ch] - This module implements generic V4L2
151 compatible commands to the I2C modules. It is here where state 151 compatible commands to the I2C modules. It is here where state
152 changes inside the pvrusb2 driver are translated into V4L2 152 changes inside the pvrusb2 driver are translated into V4L2
153 commands that are in turn send to the various I2C modules. 153 commands that are in turn send to the various I2C modules.
154 154
155 pvrusb2-i2c-core.[ch] - This module provides an implementation of a 155 pvrusb2-i2c-core.[ch] - This module provides an implementation of a
156 kernel-friendly I2C adaptor driver, through which other external 156 kernel-friendly I2C adaptor driver, through which other external
157 I2C client drivers (e.g. msp3400, tuner, lirc) may connect and 157 I2C client drivers (e.g. msp3400, tuner, lirc) may connect and
158 operate corresponding chips within the the pvrusb2 device. It is 158 operate corresponding chips within the pvrusb2 device. It is
159 through here that other V4L modules can reach into this driver to 159 through here that other V4L modules can reach into this driver to
160 operate specific pieces (and those modules are in turn driven by 160 operate specific pieces (and those modules are in turn driven by
161 glue logic which is coordinated by pvrusb2-hdw, doled out by 161 glue logic which is coordinated by pvrusb2-hdw, doled out by
162 pvrusb2-context, and then ultimately made available to users 162 pvrusb2-context, and then ultimately made available to users
163 through one of the high level interfaces). 163 through one of the high level interfaces).
164 164
165 pvrusb2-io.[ch] - This module implements a very low level ring of 165 pvrusb2-io.[ch] - This module implements a very low level ring of
166 transfer buffers, required in order to stream data from the 166 transfer buffers, required in order to stream data from the
167 device. This module is *very* low level. It only operates the 167 device. This module is *very* low level. It only operates the
168 buffers and makes no attempt to define any policy or mechanism for 168 buffers and makes no attempt to define any policy or mechanism for
169 how such buffers might be used. 169 how such buffers might be used.
170 170
171 pvrusb2-ioread.[ch] - This module layers on top of pvrusb2-io.[ch] 171 pvrusb2-ioread.[ch] - This module layers on top of pvrusb2-io.[ch]
172 to provide a streaming API usable by a read() system call style of 172 to provide a streaming API usable by a read() system call style of
173 I/O. Right now this is the only layer on top of pvrusb2-io.[ch], 173 I/O. Right now this is the only layer on top of pvrusb2-io.[ch],
174 however the underlying architecture here was intended to allow for 174 however the underlying architecture here was intended to allow for
175 other styles of I/O to be implemented with additonal modules, like 175 other styles of I/O to be implemented with additonal modules, like
176 mmap()'ed buffers or something even more exotic. 176 mmap()'ed buffers or something even more exotic.
177 177
178 pvrusb2-main.c - This is the top level of the driver. Module level 178 pvrusb2-main.c - This is the top level of the driver. Module level
179 and USB core entry points are here. This is our "main". 179 and USB core entry points are here. This is our "main".
180 180
181 pvrusb2-sysfs.[ch] - This is the high level interface which ties the 181 pvrusb2-sysfs.[ch] - This is the high level interface which ties the
182 pvrusb2 driver into sysfs. Through this interface you can do 182 pvrusb2 driver into sysfs. Through this interface you can do
183 everything with the driver except actually stream data. 183 everything with the driver except actually stream data.
184 184
185 pvrusb2-tuner.[ch] - This is glue logic that resides between this 185 pvrusb2-tuner.[ch] - This is glue logic that resides between this
186 driver and the tuner.ko I2C client driver (which is found 186 driver and the tuner.ko I2C client driver (which is found
187 elsewhere in V4L). 187 elsewhere in V4L).
188 188
189 pvrusb2-util.h - This header defines some common macros used 189 pvrusb2-util.h - This header defines some common macros used
190 throughout the driver. These macros are not really specific to 190 throughout the driver. These macros are not really specific to
191 the driver, but they had to go somewhere. 191 the driver, but they had to go somewhere.
192 192
193 pvrusb2-v4l2.[ch] - This is the high level interface which ties the 193 pvrusb2-v4l2.[ch] - This is the high level interface which ties the
194 pvrusb2 driver into video4linux. It is through here that V4L 194 pvrusb2 driver into video4linux. It is through here that V4L
195 applications can open and operate the driver in the usual V4L 195 applications can open and operate the driver in the usual V4L
196 ways. Note that **ALL** V4L functionality is published only 196 ways. Note that **ALL** V4L functionality is published only
197 through here and nowhere else. 197 through here and nowhere else.
198 198
199 pvrusb2-video-*.[ch] - This is glue logic that resides between this 199 pvrusb2-video-*.[ch] - This is glue logic that resides between this
200 driver and the saa711x.ko I2C client driver (which is found 200 driver and the saa711x.ko I2C client driver (which is found
201 elsewhere in V4L). Note that saa711x.ko used to be known as 201 elsewhere in V4L). Note that saa711x.ko used to be known as
202 saa7115.ko in ivtv. There are two versions of this; one is 202 saa7115.ko in ivtv. There are two versions of this; one is
203 selected depending on the particular saa711[5x].ko that is found. 203 selected depending on the particular saa711[5x].ko that is found.
204 204
205 pvrusb2.h - This header contains compile time tunable parameters 205 pvrusb2.h - This header contains compile time tunable parameters
206 (and at the moment the driver has very little that needs to be 206 (and at the moment the driver has very little that needs to be
207 tuned). 207 tuned).
208 208
209 209
210 -Mike Isely 210 -Mike Isely
211 isely@pobox.com 211 isely@pobox.com
212 212
213 213
Documentation/video4linux/Zoran
1 Frequently Asked Questions: 1 Frequently Asked Questions:
2 =========================== 2 ===========================
3 subject: unified zoran driver (zr360x7, zoran, buz, dc10(+), dc30(+), lml33) 3 subject: unified zoran driver (zr360x7, zoran, buz, dc10(+), dc30(+), lml33)
4 website: http://mjpeg.sourceforge.net/driver-zoran/ 4 website: http://mjpeg.sourceforge.net/driver-zoran/
5 5
6 1. What cards are supported 6 1. What cards are supported
7 1.1 What the TV decoder can do an what not 7 1.1 What the TV decoder can do an what not
8 1.2 What the TV encoder can do an what not 8 1.2 What the TV encoder can do an what not
9 2. How do I get this damn thing to work 9 2. How do I get this damn thing to work
10 3. What mainboard should I use (or why doesn't my card work) 10 3. What mainboard should I use (or why doesn't my card work)
11 4. Programming interface 11 4. Programming interface
12 5. Applications 12 5. Applications
13 6. Concerning buffer sizes, quality, output size etc. 13 6. Concerning buffer sizes, quality, output size etc.
14 7. It hangs/crashes/fails/whatevers! Help! 14 7. It hangs/crashes/fails/whatevers! Help!
15 8. Maintainers/Contacting 15 8. Maintainers/Contacting
16 9. License 16 9. License
17 17
18 =========================== 18 ===========================
19 19
20 1. What cards are supported 20 1. What cards are supported
21 21
22 Iomega Buz, Linux Media Labs LML33/LML33R10, Pinnacle/Miro 22 Iomega Buz, Linux Media Labs LML33/LML33R10, Pinnacle/Miro
23 DC10/DC10+/DC30/DC30+ and related boards (available under various names). 23 DC10/DC10+/DC30/DC30+ and related boards (available under various names).
24 24
25 Iomega Buz: 25 Iomega Buz:
26 * Zoran zr36067 PCI controller 26 * Zoran zr36067 PCI controller
27 * Zoran zr36060 MJPEG codec 27 * Zoran zr36060 MJPEG codec
28 * Philips saa7111 TV decoder 28 * Philips saa7111 TV decoder
29 * Philips saa7185 TV encoder 29 * Philips saa7185 TV encoder
30 Drivers to use: videodev, i2c-core, i2c-algo-bit, 30 Drivers to use: videodev, i2c-core, i2c-algo-bit,
31 videocodec, saa7111, saa7185, zr36060, zr36067 31 videocodec, saa7111, saa7185, zr36060, zr36067
32 Inputs/outputs: Composite and S-video 32 Inputs/outputs: Composite and S-video
33 Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) 33 Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps)
34 Card number: 7 34 Card number: 7
35 35
36 AverMedia 6 Eyes AVS6EYES: 36 AverMedia 6 Eyes AVS6EYES:
37 * Zoran zr36067 PCI controller 37 * Zoran zr36067 PCI controller
38 * Zoran zr36060 MJPEG codec 38 * Zoran zr36060 MJPEG codec
39 * Samsung ks0127 TV decoder 39 * Samsung ks0127 TV decoder
40 * Conexant bt866 TV encoder 40 * Conexant bt866 TV encoder
41 Drivers to use: videodev, i2c-core, i2c-algo-bit, 41 Drivers to use: videodev, i2c-core, i2c-algo-bit,
42 videocodec, ks0127, bt866, zr36060, zr36067 42 videocodec, ks0127, bt866, zr36060, zr36067
43 Inputs/outputs: Six physical inputs. 1-6 are composite, 43 Inputs/outputs: Six physical inputs. 1-6 are composite,
44 1-2, 3-4, 5-6 doubles as S-video, 44 1-2, 3-4, 5-6 doubles as S-video,
45 1-3 triples as component. 45 1-3 triples as component.
46 One composite output. 46 One composite output.
47 Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) 47 Norms: PAL, SECAM (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps)
48 Card number: 8 48 Card number: 8
49 Not autodetected, card=8 is necessary. 49 Not autodetected, card=8 is necessary.
50 50
51 Linux Media Labs LML33: 51 Linux Media Labs LML33:
52 * Zoran zr36067 PCI controller 52 * Zoran zr36067 PCI controller
53 * Zoran zr36060 MJPEG codec 53 * Zoran zr36060 MJPEG codec
54 * Brooktree bt819 TV decoder 54 * Brooktree bt819 TV decoder
55 * Brooktree bt856 TV encoder 55 * Brooktree bt856 TV encoder
56 Drivers to use: videodev, i2c-core, i2c-algo-bit, 56 Drivers to use: videodev, i2c-core, i2c-algo-bit,
57 videocodec, bt819, bt856, zr36060, zr36067 57 videocodec, bt819, bt856, zr36060, zr36067
58 Inputs/outputs: Composite and S-video 58 Inputs/outputs: Composite and S-video
59 Norms: PAL (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) 59 Norms: PAL (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps)
60 Card number: 5 60 Card number: 5
61 61
62 Linux Media Labs LML33R10: 62 Linux Media Labs LML33R10:
63 * Zoran zr36067 PCI controller 63 * Zoran zr36067 PCI controller
64 * Zoran zr36060 MJPEG codec 64 * Zoran zr36060 MJPEG codec
65 * Philips saa7114 TV decoder 65 * Philips saa7114 TV decoder
66 * Analog Devices adv7170 TV encoder 66 * Analog Devices adv7170 TV encoder
67 Drivers to use: videodev, i2c-core, i2c-algo-bit, 67 Drivers to use: videodev, i2c-core, i2c-algo-bit,
68 videocodec, saa7114, adv7170, zr36060, zr36067 68 videocodec, saa7114, adv7170, zr36060, zr36067
69 Inputs/outputs: Composite and S-video 69 Inputs/outputs: Composite and S-video
70 Norms: PAL (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps) 70 Norms: PAL (720x576 @ 25 fps), NTSC (720x480 @ 29.97 fps)
71 Card number: 6 71 Card number: 6
72 72
73 Pinnacle/Miro DC10(new): 73 Pinnacle/Miro DC10(new):
74 * Zoran zr36057 PCI controller 74 * Zoran zr36057 PCI controller
75 * Zoran zr36060 MJPEG codec 75 * Zoran zr36060 MJPEG codec
76 * Philips saa7110a TV decoder 76 * Philips saa7110a TV decoder
77 * Analog Devices adv7176 TV encoder 77 * Analog Devices adv7176 TV encoder
78 Drivers to use: videodev, i2c-core, i2c-algo-bit, 78 Drivers to use: videodev, i2c-core, i2c-algo-bit,
79 videocodec, saa7110, adv7175, zr36060, zr36067 79 videocodec, saa7110, adv7175, zr36060, zr36067
80 Inputs/outputs: Composite, S-video and Internal 80 Inputs/outputs: Composite, S-video and Internal
81 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps) 81 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps)
82 Card number: 1 82 Card number: 1
83 83
84 Pinnacle/Miro DC10+: 84 Pinnacle/Miro DC10+:
85 * Zoran zr36067 PCI controller 85 * Zoran zr36067 PCI controller
86 * Zoran zr36060 MJPEG codec 86 * Zoran zr36060 MJPEG codec
87 * Philips saa7110a TV decoder 87 * Philips saa7110a TV decoder
88 * Analog Devices adv7176 TV encoder 88 * Analog Devices adv7176 TV encoder
89 Drivers to use: videodev, i2c-core, i2c-algo-bit, 89 Drivers to use: videodev, i2c-core, i2c-algo-bit,
90 videocodec, sa7110, adv7175, zr36060, zr36067 90 videocodec, sa7110, adv7175, zr36060, zr36067
91 Inputs/outputs: Composite, S-video and Internal 91 Inputs/outputs: Composite, S-video and Internal
92 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps) 92 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps)
93 Card number: 2 93 Card number: 2
94 94
95 Pinnacle/Miro DC10(old): * 95 Pinnacle/Miro DC10(old): *
96 * Zoran zr36057 PCI controller 96 * Zoran zr36057 PCI controller
97 * Zoran zr36050 MJPEG codec 97 * Zoran zr36050 MJPEG codec
98 * Zoran zr36016 Video Front End or Fuji md0211 Video Front End (clone?) 98 * Zoran zr36016 Video Front End or Fuji md0211 Video Front End (clone?)
99 * Micronas vpx3220a TV decoder 99 * Micronas vpx3220a TV decoder
100 * mse3000 TV encoder or Analog Devices adv7176 TV encoder * 100 * mse3000 TV encoder or Analog Devices adv7176 TV encoder *
101 Drivers to use: videodev, i2c-core, i2c-algo-bit, 101 Drivers to use: videodev, i2c-core, i2c-algo-bit,
102 videocodec, vpx3220, mse3000/adv7175, zr36050, zr36016, zr36067 102 videocodec, vpx3220, mse3000/adv7175, zr36050, zr36016, zr36067
103 Inputs/outputs: Composite, S-video and Internal 103 Inputs/outputs: Composite, S-video and Internal
104 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps) 104 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps)
105 Card number: 0 105 Card number: 0
106 106
107 Pinnacle/Miro DC30: * 107 Pinnacle/Miro DC30: *
108 * Zoran zr36057 PCI controller 108 * Zoran zr36057 PCI controller
109 * Zoran zr36050 MJPEG codec 109 * Zoran zr36050 MJPEG codec
110 * Zoran zr36016 Video Front End 110 * Zoran zr36016 Video Front End
111 * Micronas vpx3225d/vpx3220a/vpx3216b TV decoder 111 * Micronas vpx3225d/vpx3220a/vpx3216b TV decoder
112 * Analog Devices adv7176 TV encoder 112 * Analog Devices adv7176 TV encoder
113 Drivers to use: videodev, i2c-core, i2c-algo-bit, 113 Drivers to use: videodev, i2c-core, i2c-algo-bit,
114 videocodec, vpx3220/vpx3224, adv7175, zr36050, zr36016, zr36067 114 videocodec, vpx3220/vpx3224, adv7175, zr36050, zr36016, zr36067
115 Inputs/outputs: Composite, S-video and Internal 115 Inputs/outputs: Composite, S-video and Internal
116 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps) 116 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps)
117 Card number: 3 117 Card number: 3
118 118
119 Pinnacle/Miro DC30+: * 119 Pinnacle/Miro DC30+: *
120 * Zoran zr36067 PCI controller 120 * Zoran zr36067 PCI controller
121 * Zoran zr36050 MJPEG codec 121 * Zoran zr36050 MJPEG codec
122 * Zoran zr36016 Video Front End 122 * Zoran zr36016 Video Front End
123 * Micronas vpx3225d/vpx3220a/vpx3216b TV decoder 123 * Micronas vpx3225d/vpx3220a/vpx3216b TV decoder
124 * Analog Devices adv7176 TV encoder 124 * Analog Devices adv7176 TV encoder
125 Drivers to use: videodev, i2c-core, i2c-algo-bit, 125 Drivers to use: videodev, i2c-core, i2c-algo-bit,
126 videocodec, vpx3220/vpx3224, adv7175, zr36050, zr36015, zr36067 126 videocodec, vpx3220/vpx3224, adv7175, zr36050, zr36015, zr36067
127 Inputs/outputs: Composite, S-video and Internal 127 Inputs/outputs: Composite, S-video and Internal
128 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps) 128 Norms: PAL, SECAM (768x576 @ 25 fps), NTSC (640x480 @ 29.97 fps)
129 Card number: 4 129 Card number: 4
130 130
131 Note: No module for the mse3000 is available yet 131 Note: No module for the mse3000 is available yet
132 Note: No module for the vpx3224 is available yet 132 Note: No module for the vpx3224 is available yet
133 Note: use encoder=X or decoder=X for non-default i2c chips (see i2c-id.h) 133 Note: use encoder=X or decoder=X for non-default i2c chips (see i2c-id.h)
134 134
135 =========================== 135 ===========================
136 136
137 1.1 What the TV decoder can do an what not 137 1.1 What the TV decoder can do an what not
138 138
139 The best know TV standards are NTSC/PAL/SECAM. but for decoding a frame that 139 The best know TV standards are NTSC/PAL/SECAM. but for decoding a frame that
140 information is not enough. There are several formats of the TV standards. 140 information is not enough. There are several formats of the TV standards.
141 And not every TV decoder is able to handle every format. Also the every 141 And not every TV decoder is able to handle every format. Also the every
142 combination is supported by the driver. There are currently 11 different 142 combination is supported by the driver. There are currently 11 different
143 tv broadcast formats all aver the world. 143 tv broadcast formats all aver the world.
144 144
145 The CCIR defines parameters needed for broadcasting the signal. 145 The CCIR defines parameters needed for broadcasting the signal.
146 The CCIR has defined different standards: A,B,D,E,F,G,D,H,I,K,K1,L,M,N,... 146 The CCIR has defined different standards: A,B,D,E,F,G,D,H,I,K,K1,L,M,N,...
147 The CCIR says not much about about the colorsystem used !!! 147 The CCIR says not much about the colorsystem used !!!
148 And talking about a colorsystem says not to much about how it is broadcast. 148 And talking about a colorsystem says not to much about how it is broadcast.
149 149
150 The CCIR standards A,E,F are not used any more. 150 The CCIR standards A,E,F are not used any more.
151 151
152 When you speak about NTSC, you usually mean the standard: CCIR - M using 152 When you speak about NTSC, you usually mean the standard: CCIR - M using
153 the NTSC colorsystem which is used in the USA, Japan, Mexico, Canada 153 the NTSC colorsystem which is used in the USA, Japan, Mexico, Canada
154 and a few others. 154 and a few others.
155 155
156 When you talk about PAL, you usually mean: CCIR - B/G using the PAL 156 When you talk about PAL, you usually mean: CCIR - B/G using the PAL
157 colorsystem which is used in many Countries. 157 colorsystem which is used in many Countries.
158 158
159 When you talk about SECAM, you mean: CCIR - L using the SECAM Colorsystem 159 When you talk about SECAM, you mean: CCIR - L using the SECAM Colorsystem
160 which is used in France, and a few others. 160 which is used in France, and a few others.
161 161
162 There the other version of SECAM, CCIR - D/K is used in Bulgaria, China, 162 There the other version of SECAM, CCIR - D/K is used in Bulgaria, China,
163 Slovakai, Hungary, Korea (Rep.), Poland, Rumania and a others. 163 Slovakai, Hungary, Korea (Rep.), Poland, Rumania and a others.
164 164
165 The CCIR - H uses the PAL colorsystem (sometimes SECAM) and is used in 165 The CCIR - H uses the PAL colorsystem (sometimes SECAM) and is used in
166 Egypt, Libya, Sri Lanka, Syrain Arab. Rep. 166 Egypt, Libya, Sri Lanka, Syrain Arab. Rep.
167 167
168 The CCIR - I uses the PAL colorsystem, and is used in Great Britain, Hong Kong, 168 The CCIR - I uses the PAL colorsystem, and is used in Great Britain, Hong Kong,
169 Ireland, Nigeria, South Africa. 169 Ireland, Nigeria, South Africa.
170 170
171 The CCIR - N uses the PAL colorsystem and PAL frame size but the NTSC framerate, 171 The CCIR - N uses the PAL colorsystem and PAL frame size but the NTSC framerate,
172 and is used in Argentinia, Uruguay, an a few others 172 and is used in Argentinia, Uruguay, an a few others
173 173
174 We do not talk about how the audio is broadcast ! 174 We do not talk about how the audio is broadcast !
175 175
176 A rather good sites about the TV standards are: 176 A rather good sites about the TV standards are:
177 http://www.sony.jp/ServiceArea/Voltage_map/ 177 http://www.sony.jp/ServiceArea/Voltage_map/
178 http://info.electronicwerkstatt.de/bereiche/fernsehtechnik/frequenzen_und_normen/Fernsehnormen/ 178 http://info.electronicwerkstatt.de/bereiche/fernsehtechnik/frequenzen_und_normen/Fernsehnormen/
179 and http://www.cabl.com/restaurant/channel.html 179 and http://www.cabl.com/restaurant/channel.html
180 180
181 Other weird things around: NTSC 4.43 is a modificated NTSC, which is mainly 181 Other weird things around: NTSC 4.43 is a modificated NTSC, which is mainly
182 used in PAL VCR's that are able to play back NTSC. PAL 60 seems to be the same 182 used in PAL VCR's that are able to play back NTSC. PAL 60 seems to be the same
183 as NTSC 4.43 . The Datasheets also talk about NTSC 44, It seems as if it would 183 as NTSC 4.43 . The Datasheets also talk about NTSC 44, It seems as if it would
184 be the same as NTSC 4.43. 184 be the same as NTSC 4.43.
185 NTSC Combs seems to be a decoder mode where the decoder uses a comb filter 185 NTSC Combs seems to be a decoder mode where the decoder uses a comb filter
186 to split coma and luma instead of a Delay line. 186 to split coma and luma instead of a Delay line.
187 187
188 But I did not defiantly find out what NTSC Comb is. 188 But I did not defiantly find out what NTSC Comb is.
189 189
190 Philips saa7111 TV decoder 190 Philips saa7111 TV decoder
191 was introduced in 1997, is used in the BUZ and 191 was introduced in 1997, is used in the BUZ and
192 can handle: PAL B/G/H/I, PAL N, PAL M, NTSC M, NTSC N, NTSC 4.43 and SECAM 192 can handle: PAL B/G/H/I, PAL N, PAL M, NTSC M, NTSC N, NTSC 4.43 and SECAM
193 193
194 Philips saa7110a TV decoder 194 Philips saa7110a TV decoder
195 was introduced in 1995, is used in the Pinnacle/Miro DC10(new), DC10+ and 195 was introduced in 1995, is used in the Pinnacle/Miro DC10(new), DC10+ and
196 can handle: PAL B/G, NTSC M and SECAM 196 can handle: PAL B/G, NTSC M and SECAM
197 197
198 Philips saa7114 TV decoder 198 Philips saa7114 TV decoder
199 was introduced in 2000, is used in the LML33R10 and 199 was introduced in 2000, is used in the LML33R10 and
200 can handle: PAL B/G/D/H/I/N, PAL N, PAL M, NTSC M, NTSC 4.43 and SECAM 200 can handle: PAL B/G/D/H/I/N, PAL N, PAL M, NTSC M, NTSC 4.43 and SECAM
201 201
202 Brooktree bt819 TV decoder 202 Brooktree bt819 TV decoder
203 was introduced in 1996, and is used in the LML33 and 203 was introduced in 1996, and is used in the LML33 and
204 can handle: PAL B/D/G/H/I, NTSC M 204 can handle: PAL B/D/G/H/I, NTSC M
205 205
206 Micronas vpx3220a TV decoder 206 Micronas vpx3220a TV decoder
207 was introduced in 1996, is used in the DC30 and DC30+ and 207 was introduced in 1996, is used in the DC30 and DC30+ and
208 can handle: PAL B/G/H/I, PAL N, PAL M, NTSC M, NTSC 44, PAL 60, SECAM,NTSC Comb 208 can handle: PAL B/G/H/I, PAL N, PAL M, NTSC M, NTSC 44, PAL 60, SECAM,NTSC Comb
209 209
210 Samsung ks0127 TV decoder 210 Samsung ks0127 TV decoder
211 is used in the AVS6EYES card and 211 is used in the AVS6EYES card and
212 can handle: NTSC-M/N/44, PAL-M/N/B/G/H/I/D/K/L and SECAM 212 can handle: NTSC-M/N/44, PAL-M/N/B/G/H/I/D/K/L and SECAM
213 213
214 =========================== 214 ===========================
215 215
216 1.2 What the TV encoder can do an what not 216 1.2 What the TV encoder can do an what not
217 217
218 The TV encoder are doing the "same" as the decoder, but in the oder direction. 218 The TV encoder are doing the "same" as the decoder, but in the oder direction.
219 You feed them digital data and the generate a Composite or SVHS signal. 219 You feed them digital data and the generate a Composite or SVHS signal.
220 For information about the colorsystems and TV norm take a look in the 220 For information about the colorsystems and TV norm take a look in the
221 TV decoder section. 221 TV decoder section.
222 222
223 Philips saa7185 TV Encoder 223 Philips saa7185 TV Encoder
224 was introduced in 1996, is used in the BUZ 224 was introduced in 1996, is used in the BUZ
225 can generate: PAL B/G, NTSC M 225 can generate: PAL B/G, NTSC M
226 226
227 Brooktree bt856 TV Encoder 227 Brooktree bt856 TV Encoder
228 was introduced in 1994, is used in the LML33 228 was introduced in 1994, is used in the LML33
229 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M, PAL-N (Argentina) 229 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M, PAL-N (Argentina)
230 230
231 Analog Devices adv7170 TV Encoder 231 Analog Devices adv7170 TV Encoder
232 was introduced in 2000, is used in the LML300R10 232 was introduced in 2000, is used in the LML300R10
233 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M, PAL 60 233 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M, PAL 60
234 234
235 Analog Devices adv7175 TV Encoder 235 Analog Devices adv7175 TV Encoder
236 was introduced in 1996, is used in the DC10, DC10+, DC10 old, DC30, DC30+ 236 was introduced in 1996, is used in the DC10, DC10+, DC10 old, DC30, DC30+
237 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M 237 can generate: PAL B/D/G/H/I/N, PAL M, NTSC M
238 238
239 ITT mse3000 TV encoder 239 ITT mse3000 TV encoder
240 was introduced in 1991, is used in the DC10 old 240 was introduced in 1991, is used in the DC10 old
241 can generate: PAL , NTSC , SECAM 241 can generate: PAL , NTSC , SECAM
242 242
243 Conexant bt866 TV encoder 243 Conexant bt866 TV encoder
244 is used in AVS6EYES, and 244 is used in AVS6EYES, and
245 can generate: NTSC/PAL, PALยญM, PALยญN 245 can generate: NTSC/PAL, PALยญM, PALยญN
246 246
247 The adv717x, should be able to produce PAL N. But you find nothing PAL N 247 The adv717x, should be able to produce PAL N. But you find nothing PAL N
248 specific in the registers. Seem that you have to reuse a other standard 248 specific in the registers. Seem that you have to reuse a other standard
249 to generate PAL N, maybe it would work if you use the PAL M settings. 249 to generate PAL N, maybe it would work if you use the PAL M settings.
250 250
251 ========================== 251 ==========================
252 252
253 2. How do I get this damn thing to work 253 2. How do I get this damn thing to work
254 254
255 Load zr36067.o. If it can't autodetect your card, use the card=X insmod 255 Load zr36067.o. If it can't autodetect your card, use the card=X insmod
256 option with X being the card number as given in the previous section. 256 option with X being the card number as given in the previous section.
257 To have more than one card, use card=X1[,X2[,X3,[X4[..]]]] 257 To have more than one card, use card=X1[,X2[,X3,[X4[..]]]]
258 258
259 To automate this, add the following to your /etc/modprobe.conf: 259 To automate this, add the following to your /etc/modprobe.conf:
260 260
261 options zr36067 card=X1[,X2[,X3[,X4[..]]]] 261 options zr36067 card=X1[,X2[,X3[,X4[..]]]]
262 alias char-major-81-0 zr36067 262 alias char-major-81-0 zr36067
263 263
264 One thing to keep in mind is that this doesn't load zr36067.o itself yet. It 264 One thing to keep in mind is that this doesn't load zr36067.o itself yet. It
265 just automates loading. If you start using xawtv, the device won't load on 265 just automates loading. If you start using xawtv, the device won't load on
266 some systems, since you're trying to load modules as a user, which is not 266 some systems, since you're trying to load modules as a user, which is not
267 allowed ("permission denied"). A quick workaround is to add 'Load "v4l"' to 267 allowed ("permission denied"). A quick workaround is to add 'Load "v4l"' to
268 XF86Config-4 when you use X by default, or to run 'v4l-conf -c <device>' in 268 XF86Config-4 when you use X by default, or to run 'v4l-conf -c <device>' in
269 one of your startup scripts (normally rc.local) if you don't use X. Both 269 one of your startup scripts (normally rc.local) if you don't use X. Both
270 make sure that the modules are loaded on startup, under the root account. 270 make sure that the modules are loaded on startup, under the root account.
271 271
272 =========================== 272 ===========================
273 273
274 3. What mainboard should I use (or why doesn't my card work) 274 3. What mainboard should I use (or why doesn't my card work)
275 275
276 <insert lousy disclaimer here>. In short: good=SiS/Intel, bad=VIA. 276 <insert lousy disclaimer here>. In short: good=SiS/Intel, bad=VIA.
277 277
278 Experience tells us that people with a Buz, on average, have more problems 278 Experience tells us that people with a Buz, on average, have more problems
279 than users with a DC10+/LML33. Also, it tells us that people owning a VIA- 279 than users with a DC10+/LML33. Also, it tells us that people owning a VIA-
280 based mainboard (ktXXX, MVP3) have more problems than users with a mainboard 280 based mainboard (ktXXX, MVP3) have more problems than users with a mainboard
281 based on a different chipset. Here's some notes from Andrew Stevens: 281 based on a different chipset. Here's some notes from Andrew Stevens:
282 -- 282 --
283 Here's my experience of using LML33 and Buz on various motherboards: 283 Here's my experience of using LML33 and Buz on various motherboards:
284 284
285 VIA MVP3 285 VIA MVP3
286 Forget it. Pointless. Doesn't work. 286 Forget it. Pointless. Doesn't work.
287 Intel 430FX (Pentium 200) 287 Intel 430FX (Pentium 200)
288 LML33 perfect, Buz tolerable (3 or 4 frames dropped per movie) 288 LML33 perfect, Buz tolerable (3 or 4 frames dropped per movie)
289 Intel 440BX (early stepping) 289 Intel 440BX (early stepping)
290 LML33 tolerable. Buz starting to get annoying (6-10 frames/hour) 290 LML33 tolerable. Buz starting to get annoying (6-10 frames/hour)
291 Intel 440BX (late stepping) 291 Intel 440BX (late stepping)
292 Buz tolerable, LML3 almost perfect (occasional single frame drops) 292 Buz tolerable, LML3 almost perfect (occasional single frame drops)
293 SiS735 293 SiS735
294 LML33 perfect, Buz tolerable. 294 LML33 perfect, Buz tolerable.
295 VIA KT133(*) 295 VIA KT133(*)
296 LML33 starting to get annoying, Buz poor enough that I have up. 296 LML33 starting to get annoying, Buz poor enough that I have up.
297 297
298 Both 440BX boards were dual CPU versions. 298 Both 440BX boards were dual CPU versions.
299 -- 299 --
300 Bernhard Praschinger later added: 300 Bernhard Praschinger later added:
301 -- 301 --
302 AMD 751 302 AMD 751
303 Buz perfect-tolerable 303 Buz perfect-tolerable
304 AMD 760 304 AMD 760
305 Buz perfect-tolerable 305 Buz perfect-tolerable
306 -- 306 --
307 In general, people on the user mailinglist won't give you much of a chance 307 In general, people on the user mailinglist won't give you much of a chance
308 if you have a VIA-based motherboard. They may be cheap, but sometimes, you'd 308 if you have a VIA-based motherboard. They may be cheap, but sometimes, you'd
309 rather want to spend some more money on better boards. In general, VIA 309 rather want to spend some more money on better boards. In general, VIA
310 mainboard's IDE/PCI performance will also suck badly compared to others. 310 mainboard's IDE/PCI performance will also suck badly compared to others.
311 You'll noticed the DC10+/DC30+ aren't mentioned anywhere in the overview. 311 You'll noticed the DC10+/DC30+ aren't mentioned anywhere in the overview.
312 Basically, you can assume that if the Buz works, the LML33 will work too. If 312 Basically, you can assume that if the Buz works, the LML33 will work too. If
313 the LML33 works, the DC10+/DC30+ will work too. They're most tolerant to 313 the LML33 works, the DC10+/DC30+ will work too. They're most tolerant to
314 different mainboard chipsets from all of the supported cards. 314 different mainboard chipsets from all of the supported cards.
315 315
316 If you experience timeouts during capture, buy a better mainboard or lower 316 If you experience timeouts during capture, buy a better mainboard or lower
317 the quality/buffersize during capture (see 'Concerning buffer sizes, quality, 317 the quality/buffersize during capture (see 'Concerning buffer sizes, quality,
318 output size etc.'). If it hangs, there's little we can do as of now. Check 318 output size etc.'). If it hangs, there's little we can do as of now. Check
319 your IRQs and make sure the card has its own interrupts. 319 your IRQs and make sure the card has its own interrupts.
320 320
321 =========================== 321 ===========================
322 322
323 4. Programming interface 323 4. Programming interface
324 324
325 This driver conforms to video4linux and video4linux2, both can be used to 325 This driver conforms to video4linux and video4linux2, both can be used to
326 use the driver. Since video4linux didn't provide adequate calls to fully 326 use the driver. Since video4linux didn't provide adequate calls to fully
327 use the cards' features, we've introduced several programming extensions, 327 use the cards' features, we've introduced several programming extensions,
328 which are currently officially accepted in the 2.4.x branch of the kernel. 328 which are currently officially accepted in the 2.4.x branch of the kernel.
329 These extensions are known as the v4l/mjpeg extensions. See zoran.h for 329 These extensions are known as the v4l/mjpeg extensions. See zoran.h for
330 details (structs/ioctls). 330 details (structs/ioctls).
331 331
332 Information - video4linux: 332 Information - video4linux:
333 http://roadrunner.swansea.linux.org.uk/v4lapi.shtml 333 http://roadrunner.swansea.linux.org.uk/v4lapi.shtml
334 Documentation/video4linux/API.html 334 Documentation/video4linux/API.html
335 /usr/include/linux/videodev.h 335 /usr/include/linux/videodev.h
336 336
337 Information - video4linux/mjpeg extensions: 337 Information - video4linux/mjpeg extensions:
338 ./zoran.h 338 ./zoran.h
339 (also see below) 339 (also see below)
340 340
341 Information - video4linux2: 341 Information - video4linux2:
342 http://www.thedirks.org/v4l2/ 342 http://www.thedirks.org/v4l2/
343 /usr/include/linux/videodev2.h 343 /usr/include/linux/videodev2.h
344 http://www.bytesex.org/v4l/ 344 http://www.bytesex.org/v4l/
345 345
346 More information on the video4linux/mjpeg extensions, by Serguei 346 More information on the video4linux/mjpeg extensions, by Serguei
347 Miridonovi and Rainer Johanni: 347 Miridonovi and Rainer Johanni:
348 -- 348 --
349 The ioctls for that interface are as follows: 349 The ioctls for that interface are as follows:
350 350
351 BUZIOC_G_PARAMS 351 BUZIOC_G_PARAMS
352 BUZIOC_S_PARAMS 352 BUZIOC_S_PARAMS
353 353
354 Get and set the parameters of the buz. The user should always do a 354 Get and set the parameters of the buz. The user should always do a
355 BUZIOC_G_PARAMS (with a struct buz_params) to obtain the default 355 BUZIOC_G_PARAMS (with a struct buz_params) to obtain the default
356 settings, change what he likes and then make a BUZIOC_S_PARAMS call. 356 settings, change what he likes and then make a BUZIOC_S_PARAMS call.
357 357
358 BUZIOC_REQBUFS 358 BUZIOC_REQBUFS
359 359
360 Before being able to capture/playback, the user has to request 360 Before being able to capture/playback, the user has to request
361 the buffers he is wanting to use. Fill the structure 361 the buffers he is wanting to use. Fill the structure
362 zoran_requestbuffers with the size (recommended: 256*1024) and 362 zoran_requestbuffers with the size (recommended: 256*1024) and
363 the number (recommended 32 up to 256). There are no such restrictions 363 the number (recommended 32 up to 256). There are no such restrictions
364 as for the Video for Linux buffers, you should LEAVE SUFFICIENT 364 as for the Video for Linux buffers, you should LEAVE SUFFICIENT
365 MEMORY for your system however, else strange things will happen .... 365 MEMORY for your system however, else strange things will happen ....
366 On return, the zoran_requestbuffers structure contains number and 366 On return, the zoran_requestbuffers structure contains number and
367 size of the actually allocated buffers. 367 size of the actually allocated buffers.
368 You should use these numbers for doing a mmap of the buffers 368 You should use these numbers for doing a mmap of the buffers
369 into the user space. 369 into the user space.
370 The BUZIOC_REQBUFS ioctl also makes it happen, that the next mmap 370 The BUZIOC_REQBUFS ioctl also makes it happen, that the next mmap
371 maps the MJPEG buffer instead of the V4L buffers. 371 maps the MJPEG buffer instead of the V4L buffers.
372 372
373 BUZIOC_QBUF_CAPT 373 BUZIOC_QBUF_CAPT
374 BUZIOC_QBUF_PLAY 374 BUZIOC_QBUF_PLAY
375 375
376 Queue a buffer for capture or playback. The first call also starts 376 Queue a buffer for capture or playback. The first call also starts
377 streaming capture. When streaming capture is going on, you may 377 streaming capture. When streaming capture is going on, you may
378 only queue further buffers or issue syncs until streaming 378 only queue further buffers or issue syncs until streaming
379 capture is switched off again with a argument of -1 to 379 capture is switched off again with a argument of -1 to
380 a BUZIOC_QBUF_CAPT/BUZIOC_QBUF_PLAY ioctl. 380 a BUZIOC_QBUF_CAPT/BUZIOC_QBUF_PLAY ioctl.
381 381
382 BUZIOC_SYNC 382 BUZIOC_SYNC
383 383
384 Issue this ioctl when all buffers are queued. This ioctl will 384 Issue this ioctl when all buffers are queued. This ioctl will
385 block until the first buffer becomes free for saving its 385 block until the first buffer becomes free for saving its
386 data to disk (after BUZIOC_QBUF_CAPT) or for reuse (after BUZIOC_QBUF_PLAY). 386 data to disk (after BUZIOC_QBUF_CAPT) or for reuse (after BUZIOC_QBUF_PLAY).
387 387
388 BUZIOC_G_STATUS 388 BUZIOC_G_STATUS
389 389
390 Get the status of the input lines (video source connected/norm). 390 Get the status of the input lines (video source connected/norm).
391 391
392 For programming example, please, look at lavrec.c and lavplay.c code in 392 For programming example, please, look at lavrec.c and lavplay.c code in
393 lavtools-1.2p2 package (URL: http://www.cicese.mx/~mirsev/DC10plus/) 393 lavtools-1.2p2 package (URL: http://www.cicese.mx/~mirsev/DC10plus/)
394 and the 'examples' directory in the original Buz driver distribution. 394 and the 'examples' directory in the original Buz driver distribution.
395 395
396 Additional notes for software developers: 396 Additional notes for software developers:
397 397
398 The driver returns maxwidth and maxheight parameters according to 398 The driver returns maxwidth and maxheight parameters according to
399 the current TV standard (norm). Therefore, the software which 399 the current TV standard (norm). Therefore, the software which
400 communicates with the driver and "asks" for these parameters should 400 communicates with the driver and "asks" for these parameters should
401 first set the correct norm. Well, it seems logically correct: TV 401 first set the correct norm. Well, it seems logically correct: TV
402 standard is "more constant" for current country than geometry 402 standard is "more constant" for current country than geometry
403 settings of a variety of TV capture cards which may work in ITU or 403 settings of a variety of TV capture cards which may work in ITU or
404 square pixel format. Remember that users now can lock the norm to 404 square pixel format. Remember that users now can lock the norm to
405 avoid any ambiguity. 405 avoid any ambiguity.
406 -- 406 --
407 Please note that lavplay/lavrec are also included in the MJPEG-tools 407 Please note that lavplay/lavrec are also included in the MJPEG-tools
408 (http://mjpeg.sf.net/). 408 (http://mjpeg.sf.net/).
409 409
410 =========================== 410 ===========================
411 411
412 5. Applications 412 5. Applications
413 413
414 Applications known to work with this driver: 414 Applications known to work with this driver:
415 415
416 TV viewing: 416 TV viewing:
417 * xawtv 417 * xawtv
418 * kwintv 418 * kwintv
419 * probably any TV application that supports video4linux or video4linux2. 419 * probably any TV application that supports video4linux or video4linux2.
420 420
421 MJPEG capture/playback: 421 MJPEG capture/playback:
422 * mjpegtools/lavtools (or Linux Video Studio) 422 * mjpegtools/lavtools (or Linux Video Studio)
423 * gstreamer 423 * gstreamer
424 * mplayer 424 * mplayer
425 425
426 General raw capture: 426 General raw capture:
427 * xawtv 427 * xawtv
428 * gstreamer 428 * gstreamer
429 * probably any application that supports video4linux or video4linux2 429 * probably any application that supports video4linux or video4linux2
430 430
431 Video editing: 431 Video editing:
432 * Cinelerra 432 * Cinelerra
433 * MainActor 433 * MainActor
434 * mjpegtools (or Linux Video Studio) 434 * mjpegtools (or Linux Video Studio)
435 435
436 =========================== 436 ===========================
437 437
438 6. Concerning buffer sizes, quality, output size etc. 438 6. Concerning buffer sizes, quality, output size etc.
439 439
440 The zr36060 can do 1:2 JPEG compression. This is really the theoretical 440 The zr36060 can do 1:2 JPEG compression. This is really the theoretical
441 maximum that the chipset can reach. The driver can, however, limit compression 441 maximum that the chipset can reach. The driver can, however, limit compression
442 to a maximum (size) of 1:4. The reason for this is that some cards (e.g. Buz) 442 to a maximum (size) of 1:4. The reason for this is that some cards (e.g. Buz)
443 can't handle 1:2 compression without stopping capture after only a few minutes. 443 can't handle 1:2 compression without stopping capture after only a few minutes.
444 With 1:4, it'll mostly work. If you have a Buz, use 'low_bitrate=1' to go into 444 With 1:4, it'll mostly work. If you have a Buz, use 'low_bitrate=1' to go into
445 1:4 max. compression mode. 445 1:4 max. compression mode.
446 446
447 100% JPEG quality is thus 1:2 compression in practice. So for a full PAL frame 447 100% JPEG quality is thus 1:2 compression in practice. So for a full PAL frame
448 (size 720x576). The JPEG fields are stored in YUY2 format, so the size of the 448 (size 720x576). The JPEG fields are stored in YUY2 format, so the size of the
449 fields are 720x288x16/2 bits/field (2 fields/frame) = 207360 bytes/field x 2 = 449 fields are 720x288x16/2 bits/field (2 fields/frame) = 207360 bytes/field x 2 =
450 414720 bytes/frame (add some more bytes for headers and DHT (huffman)/DQT 450 414720 bytes/frame (add some more bytes for headers and DHT (huffman)/DQT
451 (quantization) tables, and you'll get to something like 512kB per frame for 451 (quantization) tables, and you'll get to something like 512kB per frame for
452 1:2 compression. For 1:4 compression, you'd have frames of half this size. 452 1:2 compression. For 1:4 compression, you'd have frames of half this size.
453 453
454 Some additional explanation by Martin Samuelsson, which also explains the 454 Some additional explanation by Martin Samuelsson, which also explains the
455 importance of buffer sizes: 455 importance of buffer sizes:
456 -- 456 --
457 > Hmm, I do not think it is really that way. With the current (downloaded 457 > Hmm, I do not think it is really that way. With the current (downloaded
458 > at 18:00 Monday) driver I get that output sizes for 10 sec: 458 > at 18:00 Monday) driver I get that output sizes for 10 sec:
459 > -q 50 -b 128 : 24.283.332 Bytes 459 > -q 50 -b 128 : 24.283.332 Bytes
460 > -q 50 -b 256 : 48.442.368 460 > -q 50 -b 256 : 48.442.368
461 > -q 25 -b 128 : 24.655.992 461 > -q 25 -b 128 : 24.655.992
462 > -q 25 -b 256 : 25.859.820 462 > -q 25 -b 256 : 25.859.820
463 463
464 I woke up, and can't go to sleep again. I'll kill some time explaining why 464 I woke up, and can't go to sleep again. I'll kill some time explaining why
465 this doesn't look strange to me. 465 this doesn't look strange to me.
466 466
467 Let's do some math using a width of 704 pixels. I'm not sure whether the Buz 467 Let's do some math using a width of 704 pixels. I'm not sure whether the Buz
468 actually use that number or not, but that's not too important right now. 468 actually use that number or not, but that's not too important right now.
469 469
470 704x288 pixels, one field, is 202752 pixels. Divided by 64 pixels per block; 470 704x288 pixels, one field, is 202752 pixels. Divided by 64 pixels per block;
471 3168 blocks per field. Each pixel consist of two bytes; 128 bytes per block; 471 3168 blocks per field. Each pixel consist of two bytes; 128 bytes per block;
472 1024 bits per block. 100% in the new driver mean 1:2 compression; the maximum 472 1024 bits per block. 100% in the new driver mean 1:2 compression; the maximum
473 output becomes 512 bits per block. Actually 510, but 512 is simpler to use 473 output becomes 512 bits per block. Actually 510, but 512 is simpler to use
474 for calculations. 474 for calculations.
475 475
476 Let's say that we specify d1q50. We thus want 256 bits per block; times 3168 476 Let's say that we specify d1q50. We thus want 256 bits per block; times 3168
477 becomes 811008 bits; 101376 bytes per field. We're talking raw bits and bytes 477 becomes 811008 bits; 101376 bytes per field. We're talking raw bits and bytes
478 here, so we don't need to do any fancy corrections for bits-per-pixel or such 478 here, so we don't need to do any fancy corrections for bits-per-pixel or such
479 things. 101376 bytes per field. 479 things. 101376 bytes per field.
480 480
481 d1 video contains two fields per frame. Those sum up to 202752 bytes per 481 d1 video contains two fields per frame. Those sum up to 202752 bytes per
482 frame, and one of those frames goes into each buffer. 482 frame, and one of those frames goes into each buffer.
483 483
484 But wait a second! -b128 gives 128kB buffers! It's not possible to cram 484 But wait a second! -b128 gives 128kB buffers! It's not possible to cram
485 202752 bytes of JPEG data into 128kB! 485 202752 bytes of JPEG data into 128kB!
486 486
487 This is what the driver notice and automatically compensate for in your 487 This is what the driver notice and automatically compensate for in your
488 examples. Let's do some math using this information: 488 examples. Let's do some math using this information:
489 489
490 128kB is 131072 bytes. In this buffer, we want to store two fields, which 490 128kB is 131072 bytes. In this buffer, we want to store two fields, which
491 leaves 65536 bytes for each field. Using 3168 blocks per field, we get 491 leaves 65536 bytes for each field. Using 3168 blocks per field, we get
492 20.68686868... available bytes per block; 165 bits. We can't allow the 492 20.68686868... available bytes per block; 165 bits. We can't allow the
493 request for 256 bits per block when there's only 165 bits available! The -q50 493 request for 256 bits per block when there's only 165 bits available! The -q50
494 option is silently overridden, and the -b128 option takes precedence, leaving 494 option is silently overridden, and the -b128 option takes precedence, leaving
495 us with the equivalence of -q32. 495 us with the equivalence of -q32.
496 496
497 This gives us a data rate of 165 bits per block, which, times 3168, sums up 497 This gives us a data rate of 165 bits per block, which, times 3168, sums up
498 to 65340 bytes per field, out of the allowed 65536. The current driver has 498 to 65340 bytes per field, out of the allowed 65536. The current driver has
499 another level of rate limiting; it won't accept -q values that fill more than 499 another level of rate limiting; it won't accept -q values that fill more than
500 6/8 of the specified buffers. (I'm not sure why. "Playing it safe" seem to be 500 6/8 of the specified buffers. (I'm not sure why. "Playing it safe" seem to be
501 a safe bet. Personally, I think I would have lowered requested-bits-per-block 501 a safe bet. Personally, I think I would have lowered requested-bits-per-block
502 by one, or something like that.) We can't use 165 bits per block, but have to 502 by one, or something like that.) We can't use 165 bits per block, but have to
503 lower it again, to 6/8 of the available buffer space: We end up with 124 bits 503 lower it again, to 6/8 of the available buffer space: We end up with 124 bits
504 per block, the equivalence of -q24. With 128kB buffers, you can't use greater 504 per block, the equivalence of -q24. With 128kB buffers, you can't use greater
505 than -q24 at -d1. (And PAL, and 704 pixels width...) 505 than -q24 at -d1. (And PAL, and 704 pixels width...)
506 506
507 The third example is limited to -q24 through the same process. The second 507 The third example is limited to -q24 through the same process. The second
508 example, using very similar calculations, is limited to -q48. The only 508 example, using very similar calculations, is limited to -q48. The only
509 example that actually grab at the specified -q value is the last one, which 509 example that actually grab at the specified -q value is the last one, which
510 is clearly visible, looking at the file size. 510 is clearly visible, looking at the file size.
511 -- 511 --
512 512
513 Conclusion: the quality of the resulting movie depends on buffer size, quality, 513 Conclusion: the quality of the resulting movie depends on buffer size, quality,
514 whether or not you use 'low_bitrate=1' as insmod option for the zr36060.c 514 whether or not you use 'low_bitrate=1' as insmod option for the zr36060.c
515 module to do 1:4 instead of 1:2 compression, etc. 515 module to do 1:4 instead of 1:2 compression, etc.
516 516
517 If you experience timeouts, lowering the quality/buffersize or using 517 If you experience timeouts, lowering the quality/buffersize or using
518 'low_bitrate=1 as insmod option for zr36060.o might actually help, as is 518 'low_bitrate=1 as insmod option for zr36060.o might actually help, as is
519 proven by the Buz. 519 proven by the Buz.
520 520
521 =========================== 521 ===========================
522 522
523 7. It hangs/crashes/fails/whatevers! Help! 523 7. It hangs/crashes/fails/whatevers! Help!
524 524
525 Make sure that the card has its own interrupts (see /proc/interrupts), check 525 Make sure that the card has its own interrupts (see /proc/interrupts), check
526 the output of dmesg at high verbosity (load zr36067.o with debug=2, 526 the output of dmesg at high verbosity (load zr36067.o with debug=2,
527 load all other modules with debug=1). Check that your mainboard is favorable 527 load all other modules with debug=1). Check that your mainboard is favorable
528 (see question 2) and if not, test the card in another computer. Also see the 528 (see question 2) and if not, test the card in another computer. Also see the
529 notes given in question 3 and try lowering quality/buffersize/capturesize 529 notes given in question 3 and try lowering quality/buffersize/capturesize
530 if recording fails after a period of time. 530 if recording fails after a period of time.
531 531
532 If all this doesn't help, give a clear description of the problem including 532 If all this doesn't help, give a clear description of the problem including
533 detailed hardware information (memory+brand, mainboard+chipset+brand, which 533 detailed hardware information (memory+brand, mainboard+chipset+brand, which
534 MJPEG card, processor, other PCI cards that might be of interest), give the 534 MJPEG card, processor, other PCI cards that might be of interest), give the
535 system PnP information (/proc/interrupts, /proc/dma, /proc/devices), and give 535 system PnP information (/proc/interrupts, /proc/dma, /proc/devices), and give
536 the kernel version, driver version, glibc version, gcc version and any other 536 the kernel version, driver version, glibc version, gcc version and any other
537 information that might possibly be of interest. Also provide the dmesg output 537 information that might possibly be of interest. Also provide the dmesg output
538 at high verbosity. See 'Contacting' on how to contact the developers. 538 at high verbosity. See 'Contacting' on how to contact the developers.
539 539
540 =========================== 540 ===========================
541 541
542 8. Maintainers/Contacting 542 8. Maintainers/Contacting
543 543
544 The driver is currently maintained by Laurent Pinchart and Ronald Bultje 544 The driver is currently maintained by Laurent Pinchart and Ronald Bultje
545 (<laurent.pinchart@skynet.be> and <rbultje@ronald.bitfreak.net>). For bug 545 (<laurent.pinchart@skynet.be> and <rbultje@ronald.bitfreak.net>). For bug
546 reports or questions, please contact the mailinglist instead of the developers 546 reports or questions, please contact the mailinglist instead of the developers
547 individually. For user questions (i.e. bug reports or how-to questions), send 547 individually. For user questions (i.e. bug reports or how-to questions), send
548 an email to <mjpeg-users@lists.sf.net>, for developers (i.e. if you want to 548 an email to <mjpeg-users@lists.sf.net>, for developers (i.e. if you want to
549 help programming), send an email to <mjpeg-developer@lists.sf.net>. See 549 help programming), send an email to <mjpeg-developer@lists.sf.net>. See
550 http://www.sf.net/projects/mjpeg/ for subscription information. 550 http://www.sf.net/projects/mjpeg/ for subscription information.
551 551
552 For bug reports, be sure to include all the information as described in 552 For bug reports, be sure to include all the information as described in
553 the section 'It hangs/crashes/fails/whatevers! Help!'. Please make sure 553 the section 'It hangs/crashes/fails/whatevers! Help!'. Please make sure
554 you're using the latest version (http://mjpeg.sf.net/driver-zoran/). 554 you're using the latest version (http://mjpeg.sf.net/driver-zoran/).
555 555
556 Previous maintainers/developers of this driver include Serguei Miridonov 556 Previous maintainers/developers of this driver include Serguei Miridonov
557 <mirsev@cicese.mx>, Wolfgang Scherr <scherr@net4you.net>, Dave Perks 557 <mirsev@cicese.mx>, Wolfgang Scherr <scherr@net4you.net>, Dave Perks
558 <dperks@ibm.net> and Rainer Johanni <Rainer@Johanni.de>. 558 <dperks@ibm.net> and Rainer Johanni <Rainer@Johanni.de>.
559 559
560 =========================== 560 ===========================
561 561
562 9. License 562 9. License
563 563
564 This driver is distributed under the terms of the General Public License. 564 This driver is distributed under the terms of the General Public License.
565 565
566 This program is free software; you can redistribute it and/or modify 566 This program is free software; you can redistribute it and/or modify
567 it under the terms of the GNU General Public License as published by 567 it under the terms of the GNU General Public License as published by
568 the Free Software Foundation; either version 2 of the License, or 568 the Free Software Foundation; either version 2 of the License, or
569 (at your option) any later version. 569 (at your option) any later version.
570 570
571 This program is distributed in the hope that it will be useful, 571 This program is distributed in the hope that it will be useful,
572 but WITHOUT ANY WARRANTY; without even the implied warranty of 572 but WITHOUT ANY WARRANTY; without even the implied warranty of
573 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 573 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
574 GNU General Public License for more details. 574 GNU General Public License for more details.
575 575
576 You should have received a copy of the GNU General Public License 576 You should have received a copy of the GNU General Public License
577 along with this program; if not, write to the Free Software 577 along with this program; if not, write to the Free Software
578 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. 578 Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
579 579
580 See http://www.gnu.org/ for more information. 580 See http://www.gnu.org/ for more information.
581 581
Documentation/vm/numa
1 Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> 1 Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
2 2
3 The intent of this file is to have an uptodate, running commentary 3 The intent of this file is to have an uptodate, running commentary
4 from different people about NUMA specific code in the Linux vm. 4 from different people about NUMA specific code in the Linux vm.
5 5
6 What is NUMA? It is an architecture where the memory access times 6 What is NUMA? It is an architecture where the memory access times
7 for different regions of memory from a given processor varies 7 for different regions of memory from a given processor varies
8 according to the "distance" of the memory region from the processor. 8 according to the "distance" of the memory region from the processor.
9 Each region of memory to which access times are the same from any 9 Each region of memory to which access times are the same from any
10 cpu, is called a node. On such architectures, it is beneficial if 10 cpu, is called a node. On such architectures, it is beneficial if
11 the kernel tries to minimize inter node communications. Schemes 11 the kernel tries to minimize inter node communications. Schemes
12 for this range from kernel text and read-only data replication 12 for this range from kernel text and read-only data replication
13 across nodes, and trying to house all the data structures that 13 across nodes, and trying to house all the data structures that
14 key components of the kernel need on memory on that node. 14 key components of the kernel need on memory on that node.
15 15
16 Currently, all the numa support is to provide efficient handling 16 Currently, all the numa support is to provide efficient handling
17 of widely discontiguous physical memory, so architectures which 17 of widely discontiguous physical memory, so architectures which
18 are not NUMA but can have huge holes in the physical address space 18 are not NUMA but can have huge holes in the physical address space
19 can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM. 19 can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
20 20
21 The initial port includes NUMAizing the bootmem allocator code by 21 The initial port includes NUMAizing the bootmem allocator code by
22 encapsulating all the pieces of information into a bootmem_data_t 22 encapsulating all the pieces of information into a bootmem_data_t
23 structure. Node specific calls have been added to the allocator. 23 structure. Node specific calls have been added to the allocator.
24 In theory, any platform which uses the bootmem allocator should 24 In theory, any platform which uses the bootmem allocator should
25 be able to to put the bootmem and mem_map data structures anywhere 25 be able to put the bootmem and mem_map data structures anywhere
26 it deems best. 26 it deems best.
27 27
28 Each node's page allocation data structures have also been encapsulated 28 Each node's page allocation data structures have also been encapsulated
29 into a pg_data_t. The bootmem_data_t is just one part of this. To 29 into a pg_data_t. The bootmem_data_t is just one part of this. To
30 make the code look uniform between NUMA and regular UMA platforms, 30 make the code look uniform between NUMA and regular UMA platforms,
31 UMA platforms have a statically allocated pg_data_t too (contig_page_data). 31 UMA platforms have a statically allocated pg_data_t too (contig_page_data).
32 For the sake of uniformity, the function num_online_nodes() is also defined 32 For the sake of uniformity, the function num_online_nodes() is also defined
33 for all platforms. As we run benchmarks, we might decide to NUMAize 33 for all platforms. As we run benchmarks, we might decide to NUMAize
34 more variables like low_on_memory, nr_free_pages etc into the pg_data_t. 34 more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
35 35
36 The NUMA aware page allocation code currently tries to allocate pages 36 The NUMA aware page allocation code currently tries to allocate pages
37 from different nodes in a round robin manner. This will be changed to 37 from different nodes in a round robin manner. This will be changed to
38 do concentratic circle search, starting from current node, once the 38 do concentratic circle search, starting from current node, once the
39 NUMA port achieves more maturity. The call alloc_pages_node has been 39 NUMA port achieves more maturity. The call alloc_pages_node has been
40 added, so that drivers can make the call and not worry about whether 40 added, so that drivers can make the call and not worry about whether
41 it is running on a NUMA or UMA platform. 41 it is running on a NUMA or UMA platform.
42 42