Blame view

Documentation/x86/pti.rst 8.91 KB
ea0765e83   Changbin Du   Documentation: x8...
1
2
3
4
5
  .. SPDX-License-Identifier: GPL-2.0
  
  ==========================
  Page Table Isolation (PTI)
  ==========================
01c9b17bf   Dave Hansen   x86/Documentation...
6
7
  Overview
  ========
ea0765e83   Changbin Du   Documentation: x8...
8
  Page Table Isolation (pti, previously known as KAISER [1]_) is a
01c9b17bf   Dave Hansen   x86/Documentation...
9
  countermeasure against attacks on the shared user/kernel address
ea0765e83   Changbin Du   Documentation: x8...
10
  space such as the "Meltdown" approach [2]_.
01c9b17bf   Dave Hansen   x86/Documentation...
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
  
  To mitigate this class of attacks, we create an independent set of
  page tables for use only when running userspace applications.  When
  the kernel is entered via syscalls, interrupts or exceptions, the
  page tables are switched to the full "kernel" copy.  When the system
  switches back to user mode, the user copy is used again.
  
  The userspace page tables contain only a minimal amount of kernel
  data: only what is needed to enter/exit the kernel such as the
  entry/exit functions themselves and the interrupt descriptor table
  (IDT).  There are a few strictly unnecessary things that get mapped
  such as the first C function when entering an interrupt (see
  comments in pti.c).
  
  This approach helps to ensure that side-channel attacks leveraging
  the paging structures do not function when PTI is enabled.  It can be
  enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
  Once enabled at compile-time, it can be disabled at boot with the
  'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
  
  Page Table Management
  =====================
  
  When PTI is enabled, the kernel manages two sets of page tables.
  The first set is very similar to the single set which is present in
  kernels without PTI.  This includes a complete mapping of userspace
  that the kernel can use for things like copy_to_user().
  
  Although _complete_, the user portion of the kernel page tables is
  crippled by setting the NX bit in the top level.  This ensures
  that any missed kernel->user CR3 switch will immediately crash
  userspace upon executing its first instruction.
  
  The userspace page tables map only the kernel data needed to enter
  and exit the kernel.  This data is entirely contained in the 'struct
  cpu_entry_area' structure which is placed in the fixmap which gives
  each CPU's copy of the area a compile-time-fixed virtual address.
  
  For new userspace mappings, the kernel makes the entries in its
  page tables like normal.  The only difference is when the kernel
  makes entries in the top (PGD) level.  In addition to setting the
  entry in the main kernel PGD, a copy of the entry is made in the
  userspace page tables' PGD.
  
  This sharing at the PGD level also inherently shares all the lower
  layers of the page tables.  This leaves a single, shared set of
  userspace page tables to manage.  One PTE to lock, one set of
  accessed bits, dirty bits, etc...
  
  Overhead
  ========
  
  Protection against side-channel attacks is important.  But,
  this protection comes at a cost:
  
  1. Increased Memory Use
ea0765e83   Changbin Du   Documentation: x8...
67

01c9b17bf   Dave Hansen   x86/Documentation...
68
69
70
71
72
73
74
75
    a. Each process now needs an order-1 PGD instead of order-0.
       (Consumes an additional 4k per process).
    b. The 'cpu_entry_area' structure must be 2MB in size and 2MB
       aligned so that it can be mapped by setting a single PMD
       entry.  This consumes nearly 2MB of RAM once the kernel
       is decompressed, but no space in the kernel image itself.
  
  2. Runtime Cost
ea0765e83   Changbin Du   Documentation: x8...
76

01c9b17bf   Dave Hansen   x86/Documentation...
77
78
79
80
81
82
83
84
85
86
    a. CR3 manipulation to switch between the page table copies
       must be done at interrupt, syscall, and exception entry
       and exit (it can be skipped when the kernel is interrupted,
       though.)  Moves to CR3 are on the order of a hundred
       cycles, and are required at every entry and exit.
    b. A "trampoline" must be used for SYSCALL entry.  This
       trampoline depends on a smaller set of resources than the
       non-PTI SYSCALL entry code, so requires mapping fewer
       things into the userspace page tables.  The downside is
       that stacks must be switched at entry time.
98f0fceec   zhenwei.pi   x86/pti: Document...
87
    c. Global pages are disabled for all kernel structures not
01c9b17bf   Dave Hansen   x86/Documentation...
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
       mapped into both kernel and userspace page tables.  This
       feature of the MMU allows different processes to share TLB
       entries mapping the kernel.  Losing the feature means more
       TLB misses after a context switch.  The actual loss of
       performance is very small, however, never exceeding 1%.
    d. Process Context IDentifiers (PCID) is a CPU feature that
       allows us to skip flushing the entire TLB when switching page
       tables by setting a special bit in CR3 when the page tables
       are changed.  This makes switching the page tables (at context
       switch, or kernel entry/exit) cheaper.  But, on systems with
       PCID support, the context switch code must flush both the user
       and kernel entries out of the TLB.  The user PCID TLB flush is
       deferred until the exit to userspace, minimizing the cost.
       See intel.com/sdm for the gory PCID/INVPCID details.
    e. The userspace page tables must be populated for each new
       process.  Even without PTI, the shared kernel mappings
       are created by copying top-level (PGD) entries into each
       new process.  But, with PTI, there are now *two* kernel
       mappings: one in the kernel page tables that maps everything
       and one for the entry/exit structures.  At fork(), we need to
       copy both.
    f. In addition to the fork()-time copying, there must also
       be an update to the userspace PGD any time a set_pgd() is done
       on a PGD used to map userspace.  This ensures that the kernel
       and userspace copies always map the same userspace
       memory.
    g. On systems without PCID support, each CR3 write flushes
       the entire TLB.  That means that each syscall, interrupt
       or exception flushes the TLB.
    h. INVPCID is a TLB-flushing instruction which allows flushing
       of TLB entries for non-current PCIDs.  Some systems support
       PCIDs, but do not support INVPCID.  On these systems, addresses
       can only be flushed from the TLB for the current PCID.  When
       flushing a kernel address, we need to flush all PCIDs, so a
       single kernel address flush will require a TLB-flushing CR3
       write upon the next use of every PCID.
  
  Possible Future Work
  ====================
  1. We can be more careful about not actually writing to CR3
     unless its value is actually changed.
  2. Allow PTI to be enabled/disabled at runtime in addition to the
     boot-time switching.
  
  Testing
  ========
  
  To test stability of PTI, the following test procedure is recommended,
  ideally doing all of these in parallel:
  
  1. Set CONFIG_DEBUG_ENTRY=y
  2. Run several copies of all of the tools/testing/selftests/x86/ tests
     (excluding MPX and protection_keys) in a loop on multiple CPUs for
     several minutes.  These tests frequently uncover corner cases in the
     kernel entry code.  In general, old kernels might cause these tests
     themselves to crash, but they should never crash the kernel.
  3. Run the 'perf' tool in a mode (top or record) that generates many
     frequent performance monitoring non-maskable interrupts (see "NMI"
     in /proc/interrupts).  This exercises the NMI entry/exit code which
     is known to trigger bugs in code paths that did not expect to be
     interrupted, including nested NMIs.  Using "-c" boosts the rate of
     NMIs, and using two -c with separate counters encourages nested NMIs
     and less deterministic behavior.
ea0765e83   Changbin Du   Documentation: x8...
151
     ::
01c9b17bf   Dave Hansen   x86/Documentation...
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
  
  	while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done
  
  4. Launch a KVM virtual machine.
  5. Run 32-bit binaries on systems supporting the SYSCALL instruction.
     This has been a lightly-tested code path and needs extra scrutiny.
  
  Debugging
  =========
  
  Bugs in PTI cause a few different signatures of crashes
  that are worth noting here.
  
   * Failures of the selftests/x86 code.  Usually a bug in one of the
     more obscure corners of entry_64.S
   * Crashes in early boot, especially around CPU bringup.  Bugs
     in the trampoline code or mappings cause these.
   * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
     like screwing up a page table switch.  Also caused by
     incorrectly mapping the IRQ handler entry code.
   * Crashes at the first NMI.  The NMI code is separate from main
     interrupt handlers and can have bugs that do not affect
     normal interrupts.  Also caused by incorrectly mapping NMI
     code.  NMIs that interrupt the entry code must be very
     careful and can be the cause of crashes that show up when
     running perf.
   * Kernel crashes at the first exit to userspace.  entry_64.S
     bugs, or failing to map some of the exit code.
   * Crashes at first interrupt that interrupts userspace. The paths
     in entry_64.S that return to userspace are sometimes separate
     from the ones that return to the kernel.
   * Double faults: overflowing the kernel stack because of page
     faults upon page faults.  Caused by touching non-pti-mapped
     data in the entry code, or forgetting to switch to kernel
     CR3 before calling into C functions which are not pti-mapped.
   * Userspace segfaults early in boot, sometimes manifesting
     as mount(8) failing to mount the rootfs.  These have
     tended to be TLB invalidation issues.  Usually invalidating
     the wrong PCID, or otherwise missing an invalidation.
ea0765e83   Changbin Du   Documentation: x8...
191
192
  .. [1] https://gruss.cc/files/kaiser.pdf
  .. [2] https://meltdownattack.com/meltdown.pdf