Commit 5776563648f6437ede91c91cbad85862ca682b0b
Committed by
Thomas Gleixner
1 parent
1de4fa14ee
Exists in
ti-lsk-linux-4.1.y
and in
10 other branches
x86, mpx: Add documentation on Intel MPX
This patch adds the Documentation/x86/intel_mpx.txt file with some information about Intel MPX. Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151832.7FDB1720@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Showing 1 changed file with 234 additions and 0 deletions Side-by-side Diff
Documentation/x86/intel_mpx.txt
1 | +1. Intel(R) MPX Overview | |
2 | +======================== | |
3 | + | |
4 | +Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability | |
5 | +introduced into Intel Architecture. Intel MPX provides hardware features | |
6 | +that can be used in conjunction with compiler changes to check memory | |
7 | +references, for those references whose compile-time normal intentions are | |
8 | +usurped at runtime due to buffer overflow or underflow. | |
9 | + | |
10 | +For more information, please refer to Intel(R) Architecture Instruction | |
11 | +Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection | |
12 | +Extensions. | |
13 | + | |
14 | +Note: Currently no hardware with MPX ISA is available but it is always | |
15 | +possible to use SDE (Intel(R) Software Development Emulator) instead, which | |
16 | +can be downloaded from | |
17 | +http://software.intel.com/en-us/articles/intel-software-development-emulator | |
18 | + | |
19 | + | |
20 | +2. How to get the advantage of MPX | |
21 | +================================== | |
22 | + | |
23 | +For MPX to work, changes are required in the kernel, binutils and compiler. | |
24 | +No source changes are required for applications, just a recompile. | |
25 | + | |
26 | +There are a lot of moving parts of this to all work right. The following | |
27 | +is how we expect the compiler, application and kernel to work together. | |
28 | + | |
29 | +1) Application developer compiles with -fmpx. The compiler will add the | |
30 | + instrumentation as well as some setup code called early after the app | |
31 | + starts. New instruction prefixes are noops for old CPUs. | |
32 | +2) That setup code allocates (virtual) space for the "bounds directory", | |
33 | + points the "bndcfgu" register to the directory and notifies the kernel | |
34 | + (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) that the app will be using | |
35 | + MPX. | |
36 | +3) The kernel detects that the CPU has MPX, allows the new prctl() to | |
37 | + succeed, and notes the location of the bounds directory. Userspace is | |
38 | + expected to keep the bounds directory at that locationWe note it | |
39 | + instead of reading it each time because the 'xsave' operation needed | |
40 | + to access the bounds directory register is an expensive operation. | |
41 | +4) If the application needs to spill bounds out of the 4 registers, it | |
42 | + issues a bndstx instruction. Since the bounds directory is empty at | |
43 | + this point, a bounds fault (#BR) is raised, the kernel allocates a | |
44 | + bounds table (in the user address space) and makes the relevant entry | |
45 | + in the bounds directory point to the new table. | |
46 | +5) If the application violates the bounds specified in the bounds registers, | |
47 | + a separate kind of #BR is raised which will deliver a signal with | |
48 | + information about the violation in the 'struct siginfo'. | |
49 | +6) Whenever memory is freed, we know that it can no longer contain valid | |
50 | + pointers, and we attempt to free the associated space in the bounds | |
51 | + tables. If an entire table becomes unused, we will attempt to free | |
52 | + the table and remove the entry in the directory. | |
53 | + | |
54 | +To summarize, there are essentially three things interacting here: | |
55 | + | |
56 | +GCC with -fmpx: | |
57 | + * enables annotation of code with MPX instructions and prefixes | |
58 | + * inserts code early in the application to call in to the "gcc runtime" | |
59 | +GCC MPX Runtime: | |
60 | + * Checks for hardware MPX support in cpuid leaf | |
61 | + * allocates virtual space for the bounds directory (malloc() essentially) | |
62 | + * points the hardware BNDCFGU register at the directory | |
63 | + * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to | |
64 | + start managing the bounds directories | |
65 | +Kernel MPX Code: | |
66 | + * Checks for hardware MPX support in cpuid leaf | |
67 | + * Handles #BR exceptions and sends SIGSEGV to the app when it violates | |
68 | + bounds, like during a buffer overflow. | |
69 | + * When bounds are spilled in to an unallocated bounds table, the kernel | |
70 | + notices in the #BR exception, allocates the virtual space, then | |
71 | + updates the bounds directory to point to the new table. It keeps | |
72 | + special track of the memory with a VM_MPX flag. | |
73 | + * Frees unused bounds tables at the time that the memory they described | |
74 | + is unmapped. | |
75 | + | |
76 | + | |
77 | +3. How does MPX kernel code work | |
78 | +================================ | |
79 | + | |
80 | +Handling #BR faults caused by MPX | |
81 | +--------------------------------- | |
82 | + | |
83 | +When MPX is enabled, there are 2 new situations that can generate | |
84 | +#BR faults. | |
85 | + * new bounds tables (BT) need to be allocated to save bounds. | |
86 | + * bounds violation caused by MPX instructions. | |
87 | + | |
88 | +We hook #BR handler to handle these two new situations. | |
89 | + | |
90 | +On-demand kernel allocation of bounds tables | |
91 | +-------------------------------------------- | |
92 | + | |
93 | +MPX only has 4 hardware registers for storing bounds information. If | |
94 | +MPX-enabled code needs more than these 4 registers, it needs to spill | |
95 | +them somewhere. It has two special instructions for this which allow | |
96 | +the bounds to be moved between the bounds registers and some new "bounds | |
97 | +tables". | |
98 | + | |
99 | +#BR exceptions are a new class of exceptions just for MPX. They are | |
100 | +similar conceptually to a page fault and will be raised by the MPX | |
101 | +hardware during both bounds violations or when the tables are not | |
102 | +present. The kernel handles those #BR exceptions for not-present tables | |
103 | +by carving the space out of the normal processes address space and then | |
104 | +pointing the bounds-directory over to it. | |
105 | + | |
106 | +The tables need to be accessed and controlled by userspace because | |
107 | +the instructions for moving bounds in and out of them are extremely | |
108 | +frequent. They potentially happen every time a register points to | |
109 | +memory. Any direct kernel involvement (like a syscall) to access the | |
110 | +tables would obviously destroy performance. | |
111 | + | |
112 | +Why not do this in userspace? MPX does not strictly require anything in | |
113 | +the kernel. It can theoretically be done completely from userspace. Here | |
114 | +are a few ways this could be done. We don't think any of them are practical | |
115 | +in the real-world, but here they are. | |
116 | + | |
117 | +Q: Can virtual space simply be reserved for the bounds tables so that we | |
118 | + never have to allocate them? | |
119 | +A: MPX-enabled application will possibly create a lot of bounds tables in | |
120 | + process address space to save bounds information. These tables can take | |
121 | + up huge swaths of memory (as much as 80% of the memory on the system) | |
122 | + even if we clean them up aggressively. In the worst-case scenario, the | |
123 | + tables can be 4x the size of the data structure being tracked. IOW, a | |
124 | + 1-page structure can require 4 bounds-table pages. An X-GB virtual | |
125 | + area needs 4*X GB of virtual space, plus 2GB for the bounds directory. | |
126 | + If we were to preallocate them for the 128TB of user virtual address | |
127 | + space, we would need to reserve 512TB+2GB, which is larger than the | |
128 | + entire virtual address space today. This means they can not be reserved | |
129 | + ahead of time. Also, a single process's pre-popualated bounds directory | |
130 | + consumes 2GB of virtual *AND* physical memory. IOW, it's completely | |
131 | + infeasible to prepopulate bounds directories. | |
132 | + | |
133 | +Q: Can we preallocate bounds table space at the same time memory is | |
134 | + allocated which might contain pointers that might eventually need | |
135 | + bounds tables? | |
136 | +A: This would work if we could hook the site of each and every memory | |
137 | + allocation syscall. This can be done for small, constrained applications. | |
138 | + But, it isn't practical at a larger scale since a given app has no | |
139 | + way of controlling how all the parts of the app might allocate memory | |
140 | + (think libraries). The kernel is really the only place to intercept | |
141 | + these calls. | |
142 | + | |
143 | +Q: Could a bounds fault be handed to userspace and the tables allocated | |
144 | + there in a signal handler intead of in the kernel? | |
145 | +A: mmap() is not on the list of safe async handler functions and even | |
146 | + if mmap() would work it still requires locking or nasty tricks to | |
147 | + keep track of the allocation state there. | |
148 | + | |
149 | +Having ruled out all of the userspace-only approaches for managing | |
150 | +bounds tables that we could think of, we create them on demand in | |
151 | +the kernel. | |
152 | + | |
153 | +Decoding MPX instructions | |
154 | +------------------------- | |
155 | + | |
156 | +If a #BR is generated due to a bounds violation caused by MPX. | |
157 | +We need to decode MPX instructions to get violation address and | |
158 | +set this address into extended struct siginfo. | |
159 | + | |
160 | +The _sigfault feild of struct siginfo is extended as follow: | |
161 | + | |
162 | +87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ | |
163 | +88 struct { | |
164 | +89 void __user *_addr; /* faulting insn/memory ref. */ | |
165 | +90 #ifdef __ARCH_SI_TRAPNO | |
166 | +91 int _trapno; /* TRAP # which caused the signal */ | |
167 | +92 #endif | |
168 | +93 short _addr_lsb; /* LSB of the reported address */ | |
169 | +94 struct { | |
170 | +95 void __user *_lower; | |
171 | +96 void __user *_upper; | |
172 | +97 } _addr_bnd; | |
173 | +98 } _sigfault; | |
174 | + | |
175 | +The '_addr' field refers to violation address, and new '_addr_and' | |
176 | +field refers to the upper/lower bounds when a #BR is caused. | |
177 | + | |
178 | +Glibc will be also updated to support this new siginfo. So user | |
179 | +can get violation address and bounds when bounds violations occur. | |
180 | + | |
181 | +Cleanup unused bounds tables | |
182 | +---------------------------- | |
183 | + | |
184 | +When a BNDSTX instruction attempts to save bounds to a bounds directory | |
185 | +entry marked as invalid, a #BR is generated. This is an indication that | |
186 | +no bounds table exists for this entry. In this case the fault handler | |
187 | +will allocate a new bounds table on demand. | |
188 | + | |
189 | +Since the kernel allocated those tables on-demand without userspace | |
190 | +knowledge, it is also responsible for freeing them when the associated | |
191 | +mappings go away. | |
192 | + | |
193 | +Here, the solution for this issue is to hook do_munmap() to check | |
194 | +whether one process is MPX enabled. If yes, those bounds tables covered | |
195 | +in the virtual address region which is being unmapped will be freed also. | |
196 | + | |
197 | +Adding new prctl commands | |
198 | +------------------------- | |
199 | + | |
200 | +Two new prctl commands are added to enable and disable MPX bounds tables | |
201 | +management in kernel. | |
202 | + | |
203 | +155 #define PR_MPX_ENABLE_MANAGEMENT 43 | |
204 | +156 #define PR_MPX_DISABLE_MANAGEMENT 44 | |
205 | + | |
206 | +Runtime library in userspace is responsible for allocation of bounds | |
207 | +directory. So kernel have to use XSAVE instruction to get the base | |
208 | +of bounds directory from BNDCFG register. | |
209 | + | |
210 | +But XSAVE is expected to be very expensive. In order to do performance | |
211 | +optimization, we have to get the base of bounds directory and save it | |
212 | +into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT | |
213 | +command execution. | |
214 | + | |
215 | + | |
216 | +4. Special rules | |
217 | +================ | |
218 | + | |
219 | +1) If userspace is requesting help from the kernel to do the management | |
220 | +of bounds tables, it may not create or modify entries in the bounds directory. | |
221 | + | |
222 | +Certainly users can allocate bounds tables and forcibly point the bounds | |
223 | +directory at them through XSAVE instruction, and then set valid bit | |
224 | +of bounds entry to have this entry valid. But, the kernel will decline | |
225 | +to assist in managing these tables. | |
226 | + | |
227 | +2) Userspace may not take multiple bounds directory entries and point | |
228 | +them at the same bounds table. | |
229 | + | |
230 | +This is allowed architecturally. See more information "Intel(R) Architecture | |
231 | +Instruction Set Extensions Programming Reference" (9.3.4). | |
232 | + | |
233 | +However, if users did this, the kernel might be fooled in to unmaping an | |
234 | +in-use bounds table since it does not recognize sharing. |