Commit 8e0aa6d436f303a37df7ec68758883ade077d123
Committed by
Benjamin Herrenschmidt
1 parent
e55d7f737d
Exists in
master
and in
20 other branches
fadump: Add documentation for firmware-assisted dump.
Documentation for firmware-assisted dump. This document is based on the original documentation written for phyp assisted dump by Linas Vepstas and Manish Ahuja, with few changes to reflect the current implementation. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Showing 1 changed file with 270 additions and 0 deletions Side-by-side Diff
Documentation/powerpc/firmware-assisted-dump.txt
1 | + | |
2 | + Firmware-Assisted Dump | |
3 | + ------------------------ | |
4 | + July 2011 | |
5 | + | |
6 | +The goal of firmware-assisted dump is to enable the dump of | |
7 | +a crashed system, and to do so from a fully-reset system, and | |
8 | +to minimize the total elapsed time until the system is back | |
9 | +in production use. | |
10 | + | |
11 | +- Firmware assisted dump (fadump) infrastructure is intended to replace | |
12 | + the existing phyp assisted dump. | |
13 | +- Fadump uses the same firmware interfaces and memory reservation model | |
14 | + as phyp assisted dump. | |
15 | +- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore | |
16 | + in the ELF format in the same way as kdump. This helps us reuse the | |
17 | + kdump infrastructure for dump capture and filtering. | |
18 | +- Unlike phyp dump, userspace tool does not need to refer any sysfs | |
19 | + interface while reading /proc/vmcore. | |
20 | +- Unlike phyp dump, fadump allows user to release all the memory reserved | |
21 | + for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. | |
22 | +- Once enabled through kernel boot parameter, fadump can be | |
23 | + started/stopped through /sys/kernel/fadump_registered interface (see | |
24 | + sysfs files section below) and can be easily integrated with kdump | |
25 | + service start/stop init scripts. | |
26 | + | |
27 | +Comparing with kdump or other strategies, firmware-assisted | |
28 | +dump offers several strong, practical advantages: | |
29 | + | |
30 | +-- Unlike kdump, the system has been reset, and loaded | |
31 | + with a fresh copy of the kernel. In particular, | |
32 | + PCI and I/O devices have been reinitialized and are | |
33 | + in a clean, consistent state. | |
34 | +-- Once the dump is copied out, the memory that held the dump | |
35 | + is immediately available to the running kernel. And therefore, | |
36 | + unlike kdump, fadump doesn't need a 2nd reboot to get back | |
37 | + the system to the production configuration. | |
38 | + | |
39 | +The above can only be accomplished by coordination with, | |
40 | +and assistance from the Power firmware. The procedure is | |
41 | +as follows: | |
42 | + | |
43 | +-- The first kernel registers the sections of memory with the | |
44 | + Power firmware for dump preservation during OS initialization. | |
45 | + These registered sections of memory are reserved by the first | |
46 | + kernel during early boot. | |
47 | + | |
48 | +-- When a system crashes, the Power firmware will save | |
49 | + the low memory (boot memory of size larger of 5% of system RAM | |
50 | + or 256MB) of RAM to the previous registered region. It will | |
51 | + also save system registers, and hardware PTE's. | |
52 | + | |
53 | + NOTE: The term 'boot memory' means size of the low memory chunk | |
54 | + that is required for a kernel to boot successfully when | |
55 | + booted with restricted memory. By default, the boot memory | |
56 | + size will be the larger of 5% of system RAM or 256MB. | |
57 | + Alternatively, user can also specify boot memory size | |
58 | + through boot parameter 'fadump_reserve_mem=' which will | |
59 | + override the default calculated size. Use this option | |
60 | + if default boot memory size is not sufficient for second | |
61 | + kernel to boot successfully. | |
62 | + | |
63 | +-- After the low memory (boot memory) area has been saved, the | |
64 | + firmware will reset PCI and other hardware state. It will | |
65 | + *not* clear the RAM. It will then launch the bootloader, as | |
66 | + normal. | |
67 | + | |
68 | +-- The freshly booted kernel will notice that there is a new | |
69 | + node (ibm,dump-kernel) in the device tree, indicating that | |
70 | + there is crash data available from a previous boot. During | |
71 | + the early boot OS will reserve rest of the memory above | |
72 | + boot memory size effectively booting with restricted memory | |
73 | + size. This will make sure that the second kernel will not | |
74 | + touch any of the dump memory area. | |
75 | + | |
76 | +-- User-space tools will read /proc/vmcore to obtain the contents | |
77 | + of memory, which holds the previous crashed kernel dump in ELF | |
78 | + format. The userspace tools may copy this info to disk, or | |
79 | + network, nas, san, iscsi, etc. as desired. | |
80 | + | |
81 | +-- Once the userspace tool is done saving dump, it will echo | |
82 | + '1' to /sys/kernel/fadump_release_mem to release the reserved | |
83 | + memory back to general use, except the memory required for | |
84 | + next firmware-assisted dump registration. | |
85 | + | |
86 | + e.g. | |
87 | + # echo 1 > /sys/kernel/fadump_release_mem | |
88 | + | |
89 | +Please note that the firmware-assisted dump feature | |
90 | +is only available on Power6 and above systems with recent | |
91 | +firmware versions. | |
92 | + | |
93 | +Implementation details: | |
94 | +---------------------- | |
95 | + | |
96 | +During boot, a check is made to see if firmware supports | |
97 | +this feature on that particular machine. If it does, then | |
98 | +we check to see if an active dump is waiting for us. If yes | |
99 | +then everything but boot memory size of RAM is reserved during | |
100 | +early boot (See Fig. 2). This area is released once we finish | |
101 | +collecting the dump from user land scripts (e.g. kdump scripts) | |
102 | +that are run. If there is dump data, then the | |
103 | +/sys/kernel/fadump_release_mem file is created, and the reserved | |
104 | +memory is held. | |
105 | + | |
106 | +If there is no waiting dump data, then only the memory required | |
107 | +to hold CPU state, HPTE region, boot memory dump and elfcore | |
108 | +header, is reserved at the top of memory (see Fig. 1). This area | |
109 | +is *not* released: this region will be kept permanently reserved, | |
110 | +so that it can act as a receptacle for a copy of the boot memory | |
111 | +content in addition to CPU state and HPTE region, in the case a | |
112 | +crash does occur. | |
113 | + | |
114 | + o Memory Reservation during first kernel | |
115 | + | |
116 | + Low memory Top of memory | |
117 | + 0 boot memory size | | |
118 | + | | |<--Reserved dump area -->| | |
119 | + V V | Permanent Reservation V | |
120 | + +-----------+----------/ /----------+---+----+-----------+----+ | |
121 | + | | |CPU|HPTE| DUMP |ELF | | |
122 | + +-----------+----------/ /----------+---+----+-----------+----+ | |
123 | + | ^ | |
124 | + | | | |
125 | + \ / | |
126 | + ------------------------------------------- | |
127 | + Boot memory content gets transferred to | |
128 | + reserved area by firmware at the time of | |
129 | + crash | |
130 | + Fig. 1 | |
131 | + | |
132 | + o Memory Reservation during second kernel after crash | |
133 | + | |
134 | + Low memory Top of memory | |
135 | + 0 boot memory size | | |
136 | + | |<------------- Reserved dump area ----------- -->| | |
137 | + V V V | |
138 | + +-----------+----------/ /----------+---+----+-----------+----+ | |
139 | + | | |CPU|HPTE| DUMP |ELF | | |
140 | + +-----------+----------/ /----------+---+----+-----------+----+ | |
141 | + | | | |
142 | + V V | |
143 | + Used by second /proc/vmcore | |
144 | + kernel to boot | |
145 | + Fig. 2 | |
146 | + | |
147 | +Currently the dump will be copied from /proc/vmcore to a | |
148 | +a new file upon user intervention. The dump data available through | |
149 | +/proc/vmcore will be in ELF format. Hence the existing kdump | |
150 | +infrastructure (kdump scripts) to save the dump works fine with | |
151 | +minor modifications. | |
152 | + | |
153 | +The tools to examine the dump will be same as the ones | |
154 | +used for kdump. | |
155 | + | |
156 | +How to enable firmware-assisted dump (fadump): | |
157 | +------------------------------------- | |
158 | + | |
159 | +1. Set config option CONFIG_FA_DUMP=y and build kernel. | |
160 | +2. Boot into linux kernel with 'fadump=on' kernel cmdline option. | |
161 | +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline | |
162 | + to specify size of the memory to reserve for boot memory dump | |
163 | + preservation. | |
164 | + | |
165 | +NOTE: If firmware-assisted dump fails to reserve memory then it will | |
166 | + fallback to existing kdump mechanism if 'crashkernel=' option | |
167 | + is set at kernel cmdline. | |
168 | + | |
169 | +Sysfs/debugfs files: | |
170 | +------------ | |
171 | + | |
172 | +Firmware-assisted dump feature uses sysfs file system to hold | |
173 | +the control files and debugfs file to display memory reserved region. | |
174 | + | |
175 | +Here is the list of files under kernel sysfs: | |
176 | + | |
177 | + /sys/kernel/fadump_enabled | |
178 | + | |
179 | + This is used to display the fadump status. | |
180 | + 0 = fadump is disabled | |
181 | + 1 = fadump is enabled | |
182 | + | |
183 | + This interface can be used by kdump init scripts to identify if | |
184 | + fadump is enabled in the kernel and act accordingly. | |
185 | + | |
186 | + /sys/kernel/fadump_registered | |
187 | + | |
188 | + This is used to display the fadump registration status as well | |
189 | + as to control (start/stop) the fadump registration. | |
190 | + 0 = fadump is not registered. | |
191 | + 1 = fadump is registered and ready to handle system crash. | |
192 | + | |
193 | + To register fadump echo 1 > /sys/kernel/fadump_registered and | |
194 | + echo 0 > /sys/kernel/fadump_registered for un-register and stop the | |
195 | + fadump. Once the fadump is un-registered, the system crash will not | |
196 | + be handled and vmcore will not be captured. This interface can be | |
197 | + easily integrated with kdump service start/stop. | |
198 | + | |
199 | + /sys/kernel/fadump_release_mem | |
200 | + | |
201 | + This file is available only when fadump is active during | |
202 | + second kernel. This is used to release the reserved memory | |
203 | + region that are held for saving crash dump. To release the | |
204 | + reserved memory echo 1 to it: | |
205 | + | |
206 | + echo 1 > /sys/kernel/fadump_release_mem | |
207 | + | |
208 | + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region | |
209 | + file will change to reflect the new memory reservations. | |
210 | + | |
211 | + The existing userspace tools (kdump infrastructure) can be easily | |
212 | + enhanced to use this interface to release the memory reserved for | |
213 | + dump and continue without 2nd reboot. | |
214 | + | |
215 | +Here is the list of files under powerpc debugfs: | |
216 | +(Assuming debugfs is mounted on /sys/kernel/debug directory.) | |
217 | + | |
218 | + /sys/kernel/debug/powerpc/fadump_region | |
219 | + | |
220 | + This file shows the reserved memory regions if fadump is | |
221 | + enabled otherwise this file is empty. The output format | |
222 | + is: | |
223 | + <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> | |
224 | + | |
225 | + e.g. | |
226 | + Contents when fadump is registered during first kernel | |
227 | + | |
228 | + # cat /sys/kernel/debug/powerpc/fadump_region | |
229 | + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 | |
230 | + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 | |
231 | + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 | |
232 | + | |
233 | + Contents when fadump is active during second kernel | |
234 | + | |
235 | + # cat /sys/kernel/debug/powerpc/fadump_region | |
236 | + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 | |
237 | + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 | |
238 | + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 | |
239 | + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 | |
240 | + | |
241 | +NOTE: Please refer to Documentation/filesystems/debugfs.txt on | |
242 | + how to mount the debugfs filesystem. | |
243 | + | |
244 | + | |
245 | +TODO: | |
246 | +----- | |
247 | + o Need to come up with the better approach to find out more | |
248 | + accurate boot memory size that is required for a kernel to | |
249 | + boot successfully when booted with restricted memory. | |
250 | + o The fadump implementation introduces a fadump crash info structure | |
251 | + in the scratch area before the ELF core header. The idea of introducing | |
252 | + this structure is to pass some important crash info data to the second | |
253 | + kernel which will help second kernel to populate ELF core header with | |
254 | + correct data before it gets exported through /proc/vmcore. The current | |
255 | + design implementation does not address a possibility of introducing | |
256 | + additional fields (in future) to this structure without affecting | |
257 | + compatibility. Need to come up with the better approach to address this. | |
258 | + The possible approaches are: | |
259 | + 1. Introduce version field for version tracking, bump up the version | |
260 | + whenever a new field is added to the structure in future. The version | |
261 | + field can be used to find out what fields are valid for the current | |
262 | + version of the structure. | |
263 | + 2. Reserve the area of predefined size (say PAGE_SIZE) for this | |
264 | + structure and have unused area as reserved (initialized to zero) | |
265 | + for future field additions. | |
266 | + The advantage of approach 1 over 2 is we don't need to reserve extra space. | |
267 | +--- | |
268 | +Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> | |
269 | +This document is based on the original documentation written for phyp | |
270 | +assisted dump by Linas Vepstas and Manish Ahuja. |