Blame view

Documentation/nmi_watchdog.txt 4.16 KB
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
1
2
3
4
5
6
7
8
9
10
11
12
  
  [NMI watchdog is available for x86 and x86-64 architectures]
  
  Is your system locking up unpredictably? No keyboard activity, just
  a frustrating complete hard lockup? Do you want to help us debugging
  such lockups? If all yes then this document is definitely for you.
  
  On many x86/x86-64 type hardware there is a feature that enables
  us to generate 'watchdog NMI interrupts'.  (NMI: Non Maskable Interrupt
  which get executed even if the system is otherwise locked up hard).
  This can be used to debug hard kernel lockups.  By executing periodic
  NMI interrupts, the kernel can monitor whether any CPU has locked up,
afda335dc   Cyrill Gorcunov   x86: nmi_watchdog...
13
  and print out debugging messages if so.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
14
15
16
17
18
19
20
21
22
23
  
  In order to use the NMI watchdog, you need to have APIC support in your
  kernel. For SMP kernels, APIC support gets compiled in automatically. For
  UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local
  APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and
  features -> IO-APIC support on uniprocessors) in your kernel config.
  CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC.
  CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain
  kernel debugging options, such as Kernel Stack Meter or Kernel Tracer,
  may implicitly disable the NMI watchdog.]
afda335dc   Cyrill Gorcunov   x86: nmi_watchdog...
24
  For x86-64, the needed APIC is always compiled in.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
  
  Using local APIC (nmi_watchdog=2) needs the first performance register, so
  you can't use it for other purposes (such as high precision performance
  profiling.) However, at least oprofile and the perfctr driver disable the
  local APIC NMI watchdog automatically.
  
  To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot
  parameter.  Eg. the relevant lilo.conf entry:
  
          append="nmi_watchdog=1"
  
  For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1.
  For UP machines without an IO-APIC use nmi_watchdog=2, this only works
  for some processor types.  If in doubt, boot with nmi_watchdog=1 and
  check the NMI count in /proc/interrupts; if the count is zero then
  reboot with nmi_watchdog=2 and check the NMI count.  If it is still
  zero then log a problem, you probably have a processor that needs to be
  added to the nmi code.
  
  A 'lockup' is the following scenario: if any CPU in the system does not
  execute the period local timer interrupt for more than 5 seconds, then
  the NMI handler generates an oops and kills the process. This
  'controlled crash' (and the resulting kernel messages) can be used to
  debug the lockup. Thus whenever the lockup happens, wait 5 seconds and
  the oops will show up automatically. If the kernel produces no messages
  then the system has crashed so hard (eg. hardware-wise) that either it
  cannot even accept NMI interrupts, or the crash has made the kernel
  unable to print messages.
  
  Be aware that when using local APIC, the frequency of NMI interrupts
  it generates, depends on the system load. The local APIC NMI watchdog,
  lacking a better source, uses the "cycles unhalted" event. As you may
  guess it doesn't tick when the CPU is in the halted state (which happens
  when the system is idle), but if your system locks up on anything but the
  "hlt" processor instruction, the watchdog will trigger very soon as the
  "cycles unhalted" event will happen every clock tick. If it locks up on
  "hlt", then you are out of luck -- the event will not happen at all and the
  watchdog won't trigger. This is a shortcoming of the local APIC watchdog
  -- unfortunately there is no "clock ticks" event that would work all the
afda335dc   Cyrill Gorcunov   x86: nmi_watchdog...
64
  time. The I/O APIC watchdog is driven externally and has no such shortcoming.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
65
66
  But its NMI frequency is much higher, resulting in a more significant hit
  to the overall system performance.
afda335dc   Cyrill Gorcunov   x86: nmi_watchdog...
67
68
  On x86 nmi_watchdog is disabled by default so you have to enable it with
  a boot time parameter.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
69

8a1c8eb75   Aristeu Rozanski   x86, nmi-watchdog...
70
71
72
73
  It's possible to disable the NMI watchdog in run-time by writing "0" to
  /proc/sys/kernel/nmi_watchdog. Writing "1" to the same file will re-enable
  the NMI watchdog. Notice that you still need to use "nmi_watchdog=" parameter
  at boot time.
1bb3a0290   Ingo Molnar   x86: nmi_watchdog...
74
  NOTE: In kernels prior to 2.4.2-ac18 the NMI-oopser is enabled unconditionally
afda335dc   Cyrill Gorcunov   x86: nmi_watchdog...
75
  on x86 SMP boxes.
1da177e4c   Linus Torvalds   Linux-2.6.12-rc2
76
77
78
79
  
  [ feel free to send bug reports, suggestions and patches to
    Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing
    list at <linux-smp@vger.kernel.org> ]