Blame view

Documentation/powerpc/pci_iov_resource_on_powernv.txt 14.2 KB
81f7e3824   Eric Lee   Initial Release, ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
  Wei Yang <weiyang@linux.vnet.ibm.com>
  Benjamin Herrenschmidt <benh@au1.ibm.com>
  Bjorn Helgaas <bhelgaas@google.com>
  26 Aug 2014
  
  This document describes the requirement from hardware for PCI MMIO resource
  sizing and assignment on PowerKVM and how generic PCI code handles this
  requirement. The first two sections describe the concepts of Partitionable
  Endpoints and the implementation on P8 (IODA2). The next two sections talks
  about considerations on enabling SRIOV on IODA2.
  
  1. Introduction to Partitionable Endpoints
  
  A Partitionable Endpoint (PE) is a way to group the various resources
  associated with a device or a set of devices to provide isolation between
  partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
  to freeze a device that is causing errors in order to limit the possibility
  of propagation of bad data.
  
  There is thus, in HW, a table of PE states that contains a pair of "frozen"
  state bits (one for MMIO and one for DMA, they get set together but can be
  cleared independently) for each PE.
  
  When a PE is frozen, all stores in any direction are dropped and all loads
  return all 1's value. MSIs are also blocked. There's a bit more state that
  captures things like the details of the error that caused the freeze etc., but
  that's not critical.
  
  The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
  are matched to their corresponding PEs.
  
  The following section provides a rough description of what we have on P8
  (IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
  is a completely separate HW entity that replicates the entire logic, so has
  its own set of PEs, etc.
  
  2. Implementation of Partitionable Endpoints on P8 (IODA2)
  
  P8 supports up to 256 Partitionable Endpoints per PHB.
  
    * Inbound
  
      For DMA, MSIs and inbound PCIe error messages, we have a table (in
      memory but accessed in HW by the chip) that provides a direct
      correspondence between a PCIe RID (bus/dev/fn) with a PE number.
      We call this the RTT.
  
      - For DMA we then provide an entire address space for each PE that can
        contain two "windows", depending on the value of PCI address bit 59.
        Each window can be configured to be remapped via a "TCE table" (IOMMU
        translation table), which has various configurable characteristics
        not described here.
  
      - For MSIs, we have two windows in the address space (one at the top of
        the 32-bit space and one much higher) which, via a combination of the
        address and MSI value, will result in one of the 2048 interrupts per
        bridge being triggered.  There's a PE# in the interrupt controller
        descriptor table as well which is compared with the PE# obtained from
        the RTT to "authorize" the device to emit that specific interrupt.
  
      - Error messages just use the RTT.
  
    * Outbound.  That's where the tricky part is.
  
      Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
      from the CPU address space to the PCI address space.  There is one M32
      window and sixteen M64 windows.  They have different characteristics.
      First what they have in common: they forward a configurable portion of
      the CPU address space to the PCIe bus and must be naturally aligned
      power of two in size.  The rest is different:
  
      - The M32 window:
  
        * Is limited to 4GB in size.
  
        * Drops the top bits of the address (above the size) and replaces
  	them with a configurable value.  This is typically used to generate
  	32-bit PCIe accesses.  We configure that window at boot from FW and
  	don't touch it from Linux; it's usually set to forward a 2GB
  	portion of address space from the CPU to PCIe
  	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
  	reserved for MSIs but this is not a problem at this point; we just
  	need to ensure Linux doesn't assign anything there, the M32 logic
  	ignores that however and will forward in that space if we try).
  
        * It is divided into 256 segments of equal size.  A table in the chip
  	maps each segment to a PE#.  That allows portions of the MMIO space
  	to be assigned to PEs on a segment granularity.  For a 2GB window,
  	the segment granularity is 2GB/256 = 8MB.
  
      Now, this is the "main" window we use in Linux today (excluding
      SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
      onto a segment alignment/granularity so that the space behind a bridge
      can be assigned to a PE.
  
      Ideally we would like to be able to have individual functions in PEs
      but that would mean using a completely different address allocation
      scheme where individual function BARs can be "grouped" to fit in one or
      more segments.
  
      - The M64 windows:
  
        * Must be at least 256MB in size.
  
        * Do not translate addresses (the address on PCIe is the same as the
  	address on the PowerBus).  There is a way to also set the top 14
  	bits which are not conveyed by PowerBus but we don't use this.
  
        * Can be configured to be segmented.  When not segmented, we can
  	specify the PE# for the entire window.  When segmented, a window
  	has 256 segments; however, there is no table for mapping a segment
  	to a PE#.  The segment number *is* the PE#.
  
        * Support overlaps.  If an address is covered by multiple windows,
  	there's a defined ordering for which window applies.
  
      We have code (fairly new compared to the M32 stuff) that exploits that
      for large BARs in 64-bit space:
  
      We configure an M64 window to cover the entire region of address space
      that has been assigned by FW for the PHB (about 64GB, ignore the space
      for the M32, it comes out of a different "reserve").  We configure it
      as segmented.
  
      Then we do the same thing as with M32, using the bridge alignment
      trick, to match to those giant segments.
  
      Since we cannot remap, we have two additional constraints:
  
      - We do the PE# allocation *after* the 64-bit space has been assigned
        because the addresses we use directly determine the PE#.  We then
        update the M32 PE# for the devices that use both 32-bit and 64-bit
        spaces or assign the remaining PE# to 32-bit only devices.
  
      - We cannot "group" segments in HW, so if a device ends up using more
        than one segment, we end up with more than one PE#.  There is a HW
        mechanism to make the freeze state cascade to "companion" PEs but
        that only works for PCIe error messages (typically used so that if
        you freeze a switch, it freezes all its children).  So we do it in
        SW.  We lose a bit of effectiveness of EEH in that case, but that's
        the best we found.  So when any of the PEs freezes, we freeze the
        other ones for that "domain".  We thus introduce the concept of
        "master PE" which is the one used for DMA, MSIs, etc., and "secondary
        PEs" that are used for the remaining M64 segments.
  
      We would like to investigate using additional M64 windows in "single
      PE" mode to overlay over specific BARs to work around some of that, for
      example for devices with very large BARs, e.g., GPUs.  It would make
      sense, but we haven't done it yet.
  
  3. Considerations for SR-IOV on PowerKVM
  
    * SR-IOV Background
  
      The PCIe SR-IOV feature allows a single Physical Function (PF) to
      support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
      Capability control the number of VFs and whether they are enabled.
  
      When VFs are enabled, they appear in Configuration Space like normal
      PCI devices, but the BARs in VF config space headers are unusual.  For
      a non-VF device, software uses BARs in the config space header to
      discover the BAR sizes and assign addresses for them.  For VF devices,
      software uses VF BAR registers in the *PF* SR-IOV Capability to
      discover sizes and assign addresses.  The BARs in the VF's config space
      header are read-only zeros.
  
      When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
      base address for all the corresponding VF(n) BARs.  For example, if the
      PF SR-IOV Capability is programmed to enable eight VFs, and it has a
      1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
      This region is divided into eight contiguous 1MB regions, each of which
      is a BAR0 for one of the VFs.  Note that even though the VF BAR
      describes an 8MB region, the alignment requirement is for a single VF,
      i.e., 1MB in this example.
  
    There are several strategies for isolating VFs in PEs:
  
    - M32 window: There's one M32 window, and it is split into 256
      equally-sized segments.  The finest granularity possible is a 256MB
      window with 1MB segments.  VF BARs that are 1MB or larger could be
      mapped to separate PEs in this window.  Each segment can be
      individually mapped to a PE via the lookup table, so this is quite
      flexible, but it works best when all the VF BARs are the same size.  If
      they are different sizes, the entire window has to be small enough that
      the segment size matches the smallest VF BAR, which means larger VF
      BARs span several segments.
  
    - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
      to a single PE, so it could only isolate one VF.
  
    - Single segmented M64 windows: A segmented M64 window could be used just
      like the M32 window, but the segments can't be individually mapped to
      PEs (the segment number is the PE#), so there isn't as much
      flexibility.  A VF with multiple BARs would have to be in a "domain" of
      multiple PEs, which is not as well isolated as a single PE.
  
    - Multiple segmented M64 windows: As usual, each window is split into 256
      equally-sized segments, and the segment number is the PE#.  But if we
      use several M64 windows, they can be set to different base addresses
      and different segment sizes.  If we have VFs that each have a 1MB BAR
      and a 32MB BAR, we could use one M64 window to assign 1MB segments and
      another M64 window to assign 32MB segments.
  
    Finally, the plan to use M64 windows for SR-IOV, which will be described
    more in the next two sections.  For a given VF BAR, we need to
    effectively reserve the entire 256 segments (256 * VF BAR size) and
    position the VF BAR to start at the beginning of a free range of
    segments/PEs inside that M64 window.
  
    The goal is of course to be able to give a separate PE for each VF.
  
    The IODA2 platform has 16 M64 windows, which are used to map MMIO
    range to PE#.  Each M64 window defines one MMIO range and this range is
    divided into 256 segments, with each segment corresponding to one PE.
  
    We decide to leverage this M64 window to map VFs to individual PEs, since
    SR-IOV VF BARs are all the same size.
  
    But doing so introduces another problem: total_VFs is usually smaller
    than the number of M64 window segments, so if we map one VF BAR directly
    to one M64 window, some part of the M64 window will map to another
    device's MMIO range.
  
    IODA supports 256 PEs, so segmented windows contain 256 segments, so if
    total_VFs is less than 256, we have the situation in Figure 1.0, where
    segments [total_VFs, 255] of the M64 window may map to some MMIO range on
    other devices:
  
       0      1                     total_VFs - 1
       +------+------+-     -+------+------+
       |      |      |  ...  |      |      |
       +------+------+-     -+------+------+
  
                             VF(n) BAR space
  
       0      1                     total_VFs - 1                255
       +------+------+-     -+------+------+-      -+------+------+
       |      |      |  ...  |      |      |   ...  |      |      |
       +------+------+-     -+------+------+-      -+------+------+
  
                             M64 window
  
  		Figure 1.0 Direct map VF(n) BAR space
  
    Our current solution is to allocate 256 segments even if the VF(n) BAR
    space doesn't need that much, as shown in Figure 1.1:
  
       0      1                     total_VFs - 1                255
       +------+------+-     -+------+------+-      -+------+------+
       |      |      |  ...  |      |      |   ...  |      |      |
       +------+------+-     -+------+------+-      -+------+------+
  
                             VF(n) BAR space + extra
  
       0      1                     total_VFs - 1                255
       +------+------+-     -+------+------+-      -+------+------+
       |      |      |  ...  |      |      |   ...  |      |      |
       +------+------+-     -+------+------+-      -+------+------+
  
  			   M64 window
  
  		Figure 1.1 Map VF(n) BAR space + extra
  
    Allocating the extra space ensures that the entire M64 window will be
    assigned to this one SR-IOV device and none of the space will be
    available for other devices.  Note that this only expands the space
    reserved in software; there are still only total_VFs VFs, and they only
    respond to segments [0, total_VFs - 1].  There's nothing in hardware that
    responds to segments [total_VFs, 255].
  
  4. Implications for the Generic PCI Code
  
  The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
  aligned to the size of an individual VF BAR.
  
  In IODA2, the MMIO address determines the PE#.  If the address is in an M32
  window, we can set the PE# by updating the table that translates segments
  to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
  set the PE# for the window.  But if it's in a segmented M64 window, the
  segment number is the PE#.
  
  Therefore, the only way to control the PE# for a VF is to change the base
  of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
  amount of space required for the VF(n) BAR space, the VF BAR value is fixed
  and cannot be changed.
  
  On the other hand, if the PCI core allocates additional space, the VF BAR
  value can be changed as long as the entire VF(n) BAR space remains inside
  the space allocated by the core.
  
  Ideally the segment size will be the same as an individual VF BAR size.
  Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
  are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
  allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
  
  If the segment size is smaller than the VF BAR size, it will take several
  segments to cover a VF BAR, and a VF will be in several PEs.  This is
  possible, but the isolation isn't as good, and it reduces the number of PE#
  choices because instead of consuming only numVFs segments, the VF(n) BAR
  space will consume (numVFs * n) segments.  That means there aren't as many
  available segments for adjusting base of the VF(n) BAR space.