Commit 8e0aa6d436f303a37df7ec68758883ade077d123

Authored by Mahesh Salgaonkar
Committed by Benjamin Herrenschmidt
1 parent e55d7f737d

fadump: Add documentation for firmware-assisted dump.

Documentation for firmware-assisted dump. This document is based on the
original documentation written for phyp assisted dump by Linas Vepstas
and Manish Ahuja, with few changes to reflect the current implementation.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Showing 1 changed file with 270 additions and 0 deletions Side-by-side Diff

Documentation/powerpc/firmware-assisted-dump.txt
  1 +
  2 + Firmware-Assisted Dump
  3 + ------------------------
  4 + July 2011
  5 +
  6 +The goal of firmware-assisted dump is to enable the dump of
  7 +a crashed system, and to do so from a fully-reset system, and
  8 +to minimize the total elapsed time until the system is back
  9 +in production use.
  10 +
  11 +- Firmware assisted dump (fadump) infrastructure is intended to replace
  12 + the existing phyp assisted dump.
  13 +- Fadump uses the same firmware interfaces and memory reservation model
  14 + as phyp assisted dump.
  15 +- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
  16 + in the ELF format in the same way as kdump. This helps us reuse the
  17 + kdump infrastructure for dump capture and filtering.
  18 +- Unlike phyp dump, userspace tool does not need to refer any sysfs
  19 + interface while reading /proc/vmcore.
  20 +- Unlike phyp dump, fadump allows user to release all the memory reserved
  21 + for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
  22 +- Once enabled through kernel boot parameter, fadump can be
  23 + started/stopped through /sys/kernel/fadump_registered interface (see
  24 + sysfs files section below) and can be easily integrated with kdump
  25 + service start/stop init scripts.
  26 +
  27 +Comparing with kdump or other strategies, firmware-assisted
  28 +dump offers several strong, practical advantages:
  29 +
  30 +-- Unlike kdump, the system has been reset, and loaded
  31 + with a fresh copy of the kernel. In particular,
  32 + PCI and I/O devices have been reinitialized and are
  33 + in a clean, consistent state.
  34 +-- Once the dump is copied out, the memory that held the dump
  35 + is immediately available to the running kernel. And therefore,
  36 + unlike kdump, fadump doesn't need a 2nd reboot to get back
  37 + the system to the production configuration.
  38 +
  39 +The above can only be accomplished by coordination with,
  40 +and assistance from the Power firmware. The procedure is
  41 +as follows:
  42 +
  43 +-- The first kernel registers the sections of memory with the
  44 + Power firmware for dump preservation during OS initialization.
  45 + These registered sections of memory are reserved by the first
  46 + kernel during early boot.
  47 +
  48 +-- When a system crashes, the Power firmware will save
  49 + the low memory (boot memory of size larger of 5% of system RAM
  50 + or 256MB) of RAM to the previous registered region. It will
  51 + also save system registers, and hardware PTE's.
  52 +
  53 + NOTE: The term 'boot memory' means size of the low memory chunk
  54 + that is required for a kernel to boot successfully when
  55 + booted with restricted memory. By default, the boot memory
  56 + size will be the larger of 5% of system RAM or 256MB.
  57 + Alternatively, user can also specify boot memory size
  58 + through boot parameter 'fadump_reserve_mem=' which will
  59 + override the default calculated size. Use this option
  60 + if default boot memory size is not sufficient for second
  61 + kernel to boot successfully.
  62 +
  63 +-- After the low memory (boot memory) area has been saved, the
  64 + firmware will reset PCI and other hardware state. It will
  65 + *not* clear the RAM. It will then launch the bootloader, as
  66 + normal.
  67 +
  68 +-- The freshly booted kernel will notice that there is a new
  69 + node (ibm,dump-kernel) in the device tree, indicating that
  70 + there is crash data available from a previous boot. During
  71 + the early boot OS will reserve rest of the memory above
  72 + boot memory size effectively booting with restricted memory
  73 + size. This will make sure that the second kernel will not
  74 + touch any of the dump memory area.
  75 +
  76 +-- User-space tools will read /proc/vmcore to obtain the contents
  77 + of memory, which holds the previous crashed kernel dump in ELF
  78 + format. The userspace tools may copy this info to disk, or
  79 + network, nas, san, iscsi, etc. as desired.
  80 +
  81 +-- Once the userspace tool is done saving dump, it will echo
  82 + '1' to /sys/kernel/fadump_release_mem to release the reserved
  83 + memory back to general use, except the memory required for
  84 + next firmware-assisted dump registration.
  85 +
  86 + e.g.
  87 + # echo 1 > /sys/kernel/fadump_release_mem
  88 +
  89 +Please note that the firmware-assisted dump feature
  90 +is only available on Power6 and above systems with recent
  91 +firmware versions.
  92 +
  93 +Implementation details:
  94 +----------------------
  95 +
  96 +During boot, a check is made to see if firmware supports
  97 +this feature on that particular machine. If it does, then
  98 +we check to see if an active dump is waiting for us. If yes
  99 +then everything but boot memory size of RAM is reserved during
  100 +early boot (See Fig. 2). This area is released once we finish
  101 +collecting the dump from user land scripts (e.g. kdump scripts)
  102 +that are run. If there is dump data, then the
  103 +/sys/kernel/fadump_release_mem file is created, and the reserved
  104 +memory is held.
  105 +
  106 +If there is no waiting dump data, then only the memory required
  107 +to hold CPU state, HPTE region, boot memory dump and elfcore
  108 +header, is reserved at the top of memory (see Fig. 1). This area
  109 +is *not* released: this region will be kept permanently reserved,
  110 +so that it can act as a receptacle for a copy of the boot memory
  111 +content in addition to CPU state and HPTE region, in the case a
  112 +crash does occur.
  113 +
  114 + o Memory Reservation during first kernel
  115 +
  116 + Low memory Top of memory
  117 + 0 boot memory size |
  118 + | | |<--Reserved dump area -->|
  119 + V V | Permanent Reservation V
  120 + +-----------+----------/ /----------+---+----+-----------+----+
  121 + | | |CPU|HPTE| DUMP |ELF |
  122 + +-----------+----------/ /----------+---+----+-----------+----+
  123 + | ^
  124 + | |
  125 + \ /
  126 + -------------------------------------------
  127 + Boot memory content gets transferred to
  128 + reserved area by firmware at the time of
  129 + crash
  130 + Fig. 1
  131 +
  132 + o Memory Reservation during second kernel after crash
  133 +
  134 + Low memory Top of memory
  135 + 0 boot memory size |
  136 + | |<------------- Reserved dump area ----------- -->|
  137 + V V V
  138 + +-----------+----------/ /----------+---+----+-----------+----+
  139 + | | |CPU|HPTE| DUMP |ELF |
  140 + +-----------+----------/ /----------+---+----+-----------+----+
  141 + | |
  142 + V V
  143 + Used by second /proc/vmcore
  144 + kernel to boot
  145 + Fig. 2
  146 +
  147 +Currently the dump will be copied from /proc/vmcore to a
  148 +a new file upon user intervention. The dump data available through
  149 +/proc/vmcore will be in ELF format. Hence the existing kdump
  150 +infrastructure (kdump scripts) to save the dump works fine with
  151 +minor modifications.
  152 +
  153 +The tools to examine the dump will be same as the ones
  154 +used for kdump.
  155 +
  156 +How to enable firmware-assisted dump (fadump):
  157 +-------------------------------------
  158 +
  159 +1. Set config option CONFIG_FA_DUMP=y and build kernel.
  160 +2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
  161 +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
  162 + to specify size of the memory to reserve for boot memory dump
  163 + preservation.
  164 +
  165 +NOTE: If firmware-assisted dump fails to reserve memory then it will
  166 + fallback to existing kdump mechanism if 'crashkernel=' option
  167 + is set at kernel cmdline.
  168 +
  169 +Sysfs/debugfs files:
  170 +------------
  171 +
  172 +Firmware-assisted dump feature uses sysfs file system to hold
  173 +the control files and debugfs file to display memory reserved region.
  174 +
  175 +Here is the list of files under kernel sysfs:
  176 +
  177 + /sys/kernel/fadump_enabled
  178 +
  179 + This is used to display the fadump status.
  180 + 0 = fadump is disabled
  181 + 1 = fadump is enabled
  182 +
  183 + This interface can be used by kdump init scripts to identify if
  184 + fadump is enabled in the kernel and act accordingly.
  185 +
  186 + /sys/kernel/fadump_registered
  187 +
  188 + This is used to display the fadump registration status as well
  189 + as to control (start/stop) the fadump registration.
  190 + 0 = fadump is not registered.
  191 + 1 = fadump is registered and ready to handle system crash.
  192 +
  193 + To register fadump echo 1 > /sys/kernel/fadump_registered and
  194 + echo 0 > /sys/kernel/fadump_registered for un-register and stop the
  195 + fadump. Once the fadump is un-registered, the system crash will not
  196 + be handled and vmcore will not be captured. This interface can be
  197 + easily integrated with kdump service start/stop.
  198 +
  199 + /sys/kernel/fadump_release_mem
  200 +
  201 + This file is available only when fadump is active during
  202 + second kernel. This is used to release the reserved memory
  203 + region that are held for saving crash dump. To release the
  204 + reserved memory echo 1 to it:
  205 +
  206 + echo 1 > /sys/kernel/fadump_release_mem
  207 +
  208 + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
  209 + file will change to reflect the new memory reservations.
  210 +
  211 + The existing userspace tools (kdump infrastructure) can be easily
  212 + enhanced to use this interface to release the memory reserved for
  213 + dump and continue without 2nd reboot.
  214 +
  215 +Here is the list of files under powerpc debugfs:
  216 +(Assuming debugfs is mounted on /sys/kernel/debug directory.)
  217 +
  218 + /sys/kernel/debug/powerpc/fadump_region
  219 +
  220 + This file shows the reserved memory regions if fadump is
  221 + enabled otherwise this file is empty. The output format
  222 + is:
  223 + <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
  224 +
  225 + e.g.
  226 + Contents when fadump is registered during first kernel
  227 +
  228 + # cat /sys/kernel/debug/powerpc/fadump_region
  229 + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
  230 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
  231 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
  232 +
  233 + Contents when fadump is active during second kernel
  234 +
  235 + # cat /sys/kernel/debug/powerpc/fadump_region
  236 + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
  237 + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
  238 + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
  239 + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
  240 +
  241 +NOTE: Please refer to Documentation/filesystems/debugfs.txt on
  242 + how to mount the debugfs filesystem.
  243 +
  244 +
  245 +TODO:
  246 +-----
  247 + o Need to come up with the better approach to find out more
  248 + accurate boot memory size that is required for a kernel to
  249 + boot successfully when booted with restricted memory.
  250 + o The fadump implementation introduces a fadump crash info structure
  251 + in the scratch area before the ELF core header. The idea of introducing
  252 + this structure is to pass some important crash info data to the second
  253 + kernel which will help second kernel to populate ELF core header with
  254 + correct data before it gets exported through /proc/vmcore. The current
  255 + design implementation does not address a possibility of introducing
  256 + additional fields (in future) to this structure without affecting
  257 + compatibility. Need to come up with the better approach to address this.
  258 + The possible approaches are:
  259 + 1. Introduce version field for version tracking, bump up the version
  260 + whenever a new field is added to the structure in future. The version
  261 + field can be used to find out what fields are valid for the current
  262 + version of the structure.
  263 + 2. Reserve the area of predefined size (say PAGE_SIZE) for this
  264 + structure and have unused area as reserved (initialized to zero)
  265 + for future field additions.
  266 + The advantage of approach 1 over 2 is we don't need to reserve extra space.
  267 +---
  268 +Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
  269 +This document is based on the original documentation written for phyp
  270 +assisted dump by Linas Vepstas and Manish Ahuja.