Blame view

Documentation/powerpc/papr_hcalls.rst 13.8 KB
58b278f56   Vaibhav Jain   powerpc: Provide ...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
  .. SPDX-License-Identifier: GPL-2.0
  
  ===========================
  Hypercall Op-codes (hcalls)
  ===========================
  
  Overview
  =========
  
  Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
  specification [1]_ which describes the run-time environment for a guest
  operating system and how it should interact with the hypervisor for
  privileged operations. Currently there are two PAPR compliant hypervisors:
  
  - **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
    IBM-i and  Linux as supported guests (termed as Logical Partitions
    or LPARS). It supports the full PAPR specification.
  
  - **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
    Though it only implements a subset of PAPR specification called LoPAPR [2]_.
  
  On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
  a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
  issue hypercalls to the hypervisor whenever it needs to perform an action
  that is hypervisor priviledged [3]_ or for other services managed by the
  hypervisor.
  
  Hence a Hypercall (hcall) is essentially a request by the pseries guest
  asking hypervisor to perform a privileged operation on behalf of the guest. The
  guest issues a with necessary input operands. The hypervisor after performing
  the privilege operation returns a status code and output operands back to the
  guest.
  
  HCALL ABI
  =========
  The ABI specification for a hcall between a pseries guest and PAPR hypervisor
  is covered in section 14.5.3 of ref [2]_. Switch to the  Hypervisor context is
  done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
  and any in-arguments for the hcall are provided in registers *r4-r12*. If values
  have to be passed through a memory buffer, the data stored in that buffer should be
  in Big-endian byte order.
  
  Once control is returns back to the guest after hypervisor has serviced the
  'HVCS' instruction the return value of the hcall is available in *r3* and any
  out values are returned in registers *r4-r12*. Again like in case of in-arguments,
  any out values stored in a memory buffer will be in Big-endian byte order.
  
  Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
  in a arch specific header [4]_ to issue hcalls from the linux kernel
  running as pseries guest.
  
  Register Conventions
  ====================
  
  Any hcall should follow same register convention as described in section 2.2.1.1
  of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
  summarizes these conventions:
  
  +----------+----------+-------------------------------------------+
  | Register |Volatile  |  Purpose                                  |
  | Range    |(Y/N)     |                                           |
  +==========+==========+===========================================+
  |   r0     |    Y     |  Optional-usage                           |
  +----------+----------+-------------------------------------------+
  |   r1     |    N     |  Stack Pointer                            |
  +----------+----------+-------------------------------------------+
  |   r2     |    N     |  TOC                                      |
  +----------+----------+-------------------------------------------+
  |   r3     |    Y     |  hcall opcode/return value                |
  +----------+----------+-------------------------------------------+
  |  r4-r10  |    Y     |  in and out values                        |
  +----------+----------+-------------------------------------------+
  |   r11    |    Y     |  Optional-usage/Environmental pointer     |
  +----------+----------+-------------------------------------------+
  |   r12    |    Y     |  Optional-usage/Function entry address at |
  |          |          |  global entry point                       |
  +----------+----------+-------------------------------------------+
  |   r13    |    N     |  Thread-Pointer                           |
  +----------+----------+-------------------------------------------+
  |  r14-r31 |    N     |  Local Variables                          |
  +----------+----------+-------------------------------------------+
  |    LR    |    Y     |  Link Register                            |
  +----------+----------+-------------------------------------------+
  |   CTR    |    Y     |  Loop Counter                             |
  +----------+----------+-------------------------------------------+
  |   XER    |    Y     |  Fixed-point exception register.          |
  +----------+----------+-------------------------------------------+
  |  CR0-1   |    Y     |  Condition register fields.               |
  +----------+----------+-------------------------------------------+
  |  CR2-4   |    N     |  Condition register fields.               |
  +----------+----------+-------------------------------------------+
  |  CR5-7   |    Y     |  Condition register fields.               |
  +----------+----------+-------------------------------------------+
  |  Others  |    N     |                                           |
  +----------+----------+-------------------------------------------+
  
  DRC & DRC Indexes
  =================
  ::
  
       DR1                                  Guest
       +--+        +------------+         +---------+
       |  | <----> |            |         |  User   |
       +--+  DRC1  |            |   DRC   |  Space  |
                   |    PAPR    |  Index  +---------+
       DR2         | Hypervisor |         |         |
       +--+        |            | <-----> |  Kernel |
       |  | <----> |            |  Hcall  |         |
       +--+  DRC2  +------------+         +---------+
  
  PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
  available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
  an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
  to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
  called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
  where its present as an attribute in the device tree node associated with the
  DR.
  
  HCALL Return-values
  ===================
  
  After servicing the hcall, hypervisor sets the return-value in *r3* indicating
  success or failure of the hcall. In case of a failure an error code indicates
  the cause for error. These codes are defined and documented in arch specific
  header [4]_.
  
  In some cases a hcall can potentially take a long time and need to be issued
  multiple times in order to be completely serviced. These hcalls will usually
  accept an opaque value *continue-token* within there argument list and a
  return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
  servicing the hcall yet.
  
  To make such hcalls the guest need to set *continue-token == 0* for the
  initial call and use the hypervisor returned value of *continue-token*
  for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
  return value.
  
  HCALL Op-codes
  ==============
  
  Below is a partial list of HCALLs that are supported by PHYP. For the
  corresponding opcode values please look into the arch specific header [4]_:
  
  **H_SCM_READ_METADATA**
  
  | Input: *drcIndex, offset, buffer-address, numBytesToRead*
  | Out: *numBytesRead*
  | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*
  
  Given a DRC Index of an NVDIMM, read N-bytes from the the metadata area
  associated with it, at a specified offset and copy it to provided buffer.
  The metadata area stores configuration information such as label information,
  bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
  area hence a separate access semantics is provided.
  
  **H_SCM_WRITE_METADATA**
  
  | Input: *drcIndex, offset, data, numBytesToWrite*
  | Out: *None*
  | Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*
  
  Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
  associated with it, at the specified offset and from the provided buffer.
  
  **H_SCM_BIND_MEM**
  
  | Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
  | *targetLogicalMemoryAddress, continue-token*
  | Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
  | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
  | *H_Too_Big, H_P5, H_Busy*
  
  Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
  *(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
  at *targetLogicalMemoryAddress* within guest physical address space. In
  case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
  assigns a target address to the guest. The HCALL can fail if the Guest has
  an active PTE entry to the SCM block being bound.
  
  **H_SCM_UNBIND_MEM**
  | Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
  | Out: numScmBlocksUnbound
  | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
  | *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
  
  Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
  at *startingScmLogicalMemoryAddress* from guest physical address space. The
  HCALL can fail if the Guest has an active PTE entry to the SCM block being
  unbound.
  
  **H_SCM_QUERY_BLOCK_MEM_BINDING**
  
  | Input: *drcIndex, scmBlockIndex*
  | Out: *Guest-Physical-Address*
  | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
  
  Given a DRC-Index and an SCM Block index return the guest physical address to
  which the SCM block is mapped to.
  
  **H_SCM_QUERY_LOGICAL_MEM_BINDING**
  
  | Input: *Guest-Physical-Address*
  | Out: *drcIndex, scmBlockIndex*
  | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
  
  Given a guest physical address return which DRC Index and SCM block is mapped
  to that address.
  
  **H_SCM_UNBIND_ALL**
  
  | Input: *scmTargetScope, drcIndex*
  | Out: *None*
  | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
  | *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
  
  Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
  or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
  from the LPAR memory.
  
  **H_SCM_HEALTH**
  
  | Input: drcIndex
901e34905   Vaibhav Jain   powerpc: Document...
223
  | Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
58b278f56   Vaibhav Jain   powerpc: Provide ...
224
225
226
  | Return Value: *H_Success, H_Parameter, H_Hardware*
  
  Given a DRC Index return the info on predictive failure and overall health of
901e34905   Vaibhav Jain   powerpc: Document...
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
  the PMEM device. The asserted bits in the health-bitmap indicate one or more states
  (described in table below) of the PMEM device and health-bit-valid-bitmap indicate
  which bits in health-bitmap are valid. The bits are reported in
  reverse bit ordering for example a value of 0xC400000000000000
  indicates bits 0, 1, and 5 are valid.
  
  Health Bitmap Flags:
  
  +------+-----------------------------------------------------------------------+
  |  Bit |               Definition                                              |
  +======+=======================================================================+
  |  00  | PMEM device is unable to persist memory contents.                     |
  |      | If the system is powered down, nothing will be saved.                 |
  +------+-----------------------------------------------------------------------+
  |  01  | PMEM device failed to persist memory contents. Either contents were   |
  |      | not saved successfully on power down or were not restored properly on |
  |      | power up.                                                             |
  +------+-----------------------------------------------------------------------+
  |  02  | PMEM device contents are persisted from previous IPL. The data from   |
  |      | the last boot were successfully restored.                             |
  +------+-----------------------------------------------------------------------+
  |  03  | PMEM device contents are not persisted from previous IPL. There was no|
  |      | data to restore from the last boot.                                   |
  +------+-----------------------------------------------------------------------+
  |  04  | PMEM device memory life remaining is critically low                   |
  +------+-----------------------------------------------------------------------+
  |  05  | PMEM device will be garded off next IPL due to failure                |
  +------+-----------------------------------------------------------------------+
  |  06  | PMEM device contents cannot persist due to current platform health    |
  |      | status. A hardware failure may prevent data from being saved or       |
  |      | restored.                                                             |
  +------+-----------------------------------------------------------------------+
  |  07  | PMEM device is unable to persist memory contents in certain conditions|
  +------+-----------------------------------------------------------------------+
  |  08  | PMEM device is encrypted                                              |
  +------+-----------------------------------------------------------------------+
  |  09  | PMEM device has successfully completed a requested erase or secure    |
  |      | erase procedure.                                                      |
  +------+-----------------------------------------------------------------------+
  |10:63 | Reserved / Unused                                                     |
  +------+-----------------------------------------------------------------------+
58b278f56   Vaibhav Jain   powerpc: Provide ...
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
  
  **H_SCM_PERFORMANCE_STATS**
  
  | Input: drcIndex, resultBuffer Addr
  | Out: None
  | Return Value:  *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*
  
  Given a DRC Index collect the performance statistics for NVDIMM and copy them
  to the resultBuffer.
  
  References
  ==========
  .. [1] "Power Architecture Platform Reference"
         https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
  .. [2] "Linux on Power Architecture Platform Reference"
         https://members.openpowerfoundation.org/document/dl/469
  .. [3] "Definitions and Notation" Book III-Section 14.5.3
         https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
  .. [4] arch/powerpc/include/asm/hvcall.h
  .. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
         https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture