23 May, 2011

2 commits

  • Before the conversion of the NMI watchdog to perf event, the
    watchdog timeout was 5 seconds. Now it is 60 seconds. For my
    particular application, netbooks, 5 seconds was a better
    timeout. With a short timeout, we catch faults earlier and are
    able to send back a panic. With a 60 second timeout, the user is
    unlikely to wait and will instead hit the power button, causing
    us to lose the panic info.

    This change configures the NMI period to watchdog_thresh and
    sets the softlockup_thresh to watchdog_thresh * 2. In addition,
    watchdog_thresh was reduced to 10 seconds as suggested by Ingo
    Molnar.

    Signed-off-by: Mandeep Singh Baines
    Cc: Marcin Slusarz
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1306127423-3347-4-git-send-email-msb@chromium.org
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Mandeep Singh Baines
     
  • This restores the previous behavior of softlock_thresh.

    Currently, setting watchdog_thresh to zero causes the watchdog
    kthreads to consume a lot of CPU.

    In addition, the logic of proc_dowatchdog_thresh and
    proc_dowatchdog_enabled has been factored into proc_dowatchdog.

    Signed-off-by: Mandeep Singh Baines
    Cc: Marcin Slusarz
    Cc: Don Zickus
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1306127423-3347-3-git-send-email-msb@chromium.org
    Signed-off-by: Ingo Molnar
    LKML-Reference:

    Mandeep Singh Baines
     

23 Dec, 2010

1 commit

  • The x86 arch has shifted its use of the nmi_watchdog from a
    local implementation to the global one provide by
    kernel/watchdog.c. This shift has caused a whole bunch of
    compile problems under different config options. I attempt to
    simplify things with the patch below.

    In order to simplify things, I had to come to terms with the
    meaning of two terms ARCH_HAS_NMI_WATCHDOG and
    CONFIG_HARDLOCKUP_DETECTOR. Basically they mean the same thing,
    the former on a local level and the latter on a global level.

    With the old x86 nmi watchdog gone, there is no need to rely on
    defining the ARCH_HAS_NMI_WATCHDOG variable because it doesn't
    make sense any more. x86 will now use the global
    implementation.

    The changes below do a few things. First it changes the few
    places that relied on ARCH_HAS_NMI_WATCHDOG to use
    CONFIG_X86_LOCAL_APIC (the former was an alias for the latter
    anyway, so nothing unusual here). Those pieces of code were
    relying more on local apic functionality the nmi watchdog
    functionality, so the change should make sense.

    Second, I removed the x86 implementation of
    touch_nmi_watchdog(). It isn't need now, instead x86 will rely
    on kernel/watchdog.c's implementation.

    Third, I removed the #define ARCH_HAS_NMI_WATCHDOG itself from
    x86. And tweaked the include/linux/nmi.h file to tell users to
    look for an externally defined touch_nmi_watchdog in the case of
    ARCH_HAS_NMI_WATCHDOG _or_ CONFIG_HARDLOCKUP_DETECTOR. This
    changes removes some of the ugliness in that file.

    Finally, I added a Kconfig dependency for
    CONFIG_HARDLOCKUP_DETECTOR that said you can't have
    ARCH_HAS_NMI_WATCHDOG _and_ CONFIG_HARDLOCKUP_DETECTOR. You can
    only have one nmi_watchdog.

    Tested with
    ARCH=i386: allnoconfig, defconfig, allyesconfig, (various broken
    configs) ARCH=x86_64: allnoconfig, defconfig, allyesconfig,
    (various broken configs)

    Hopefully, after this patch I won't get any more compile broken
    emails. :-)

    v3:
    changed a couple of 'linux/nmi.h' -> 'asm/nmi.h' to pick-up correct function
    prototypes when CONFIG_HARDLOCKUP_DETECTOR is not set.

    Signed-off-by: Don Zickus
    Cc: Peter Zijlstra
    Cc: fweisbec@gmail.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     

10 Dec, 2010

1 commit


18 Nov, 2010

2 commits

  • Now that the bulk of the old nmi_watchdog is gone, remove all
    the stub variables and hooks associated with it.

    This touches lots of files mainly because of how the io_apic
    nmi_watchdog was implemented. Now that the io_apic nmi_watchdog
    is forever gone, remove all its fingers.

    Most of this code was not being exercised by virtue of
    nmi_watchdog != NMI_IO_APIC, so there shouldn't be anything to
    risky here.

    Signed-off-by: Don Zickus
    Cc: fweisbec@gmail.com
    Cc: gorcunov@openvz.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     
  • Now that we have a new nmi_watchdog that is more generic and
    sits on top of the perf subsystem, we really do not need the old
    nmi_watchdog any more.

    In addition, the old nmi_watchdog doesn't really work if you are
    using the default clocksource, hpet. The old nmi_watchdog code
    relied on local apic interrupts to determine if the cpu is still
    alive. With hpet as the clocksource, these interrupts don't
    increment any more and the old nmi_watchdog triggers false
    postives.

    This piece removes the old nmi_watchdog code and stubs out any
    variables and functions calls. The stubs are the same ones used
    by the new nmi_watchdog code, so it should be well tested.

    Signed-off-by: Don Zickus
    Cc: fweisbec@gmail.com
    Cc: gorcunov@openvz.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     

16 May, 2010

1 commit

  • Combining the softlockup and hardlockup code causes watchdog.c
    to build even without the hardlockup detection support.

    So if an arch, that has the previous and the new nmi watchdog
    implementations cohabiting, wants to know if the generic one
    is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check.
    We need to use CONFIG_HARDLOCKUP_DETECTOR instead.

    Fixes:
    kernel/built-in.o: In function `touch_nmi_watchdog':
    (.text+0x449bc): multiple definition of `touch_nmi_watchdog'
    arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here

    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Don Zickus
    Cc: Cyrill Gorcunov
    LKML-Reference:
    [ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI]
    Signed-off-by: Frederic Weisbecker

    Don Zickus
     

13 May, 2010

1 commit

  • The new nmi_watchdog (which uses the perf event subsystem) is very
    similar in structure to the softlockup detector. Using Ingo's
    suggestion, I combined the two functionalities into one file:
    kernel/watchdog.c.

    Now both the nmi_watchdog (or hardlockup detector) and softlockup
    detector sit on top of the perf event subsystem, which is run every
    60 seconds or so to see if there are any lockups.

    To detect hardlockups, cpus not responding to interrupts, I
    implemented an hrtimer that runs 5 times for every perf event
    overflow event. If that stops counting on a cpu, then the cpu is
    most likely in trouble.

    To detect softlockups, tasks not yielding to the scheduler, I used the
    previous kthread idea that now gets kicked every time the hrtimer fires.
    If the kthread isn't being scheduled neither is anyone else and the
    warning is printed to the console.

    I tested this on x86_64 and both the softlockup and hardlockup paths
    work.

    V2:
    - cleaned up the Kconfig and softlockup combination
    - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
    - seperated out the softlockup case from perf event subsystem
    - re-arranged the enabling/disabling nmi watchdog from proc space
    - added cpumasks for hardlockup failure cases
    - removed fallback to soft events if no PMU exists for hard events

    V3:
    - comment cleanups
    - drop support for older softlockup code
    - per_cpu cleanups
    - completely remove software clock base hardlockup detector
    - use per_cpu masking on hard/soft lockup detection
    - #ifdef cleanups
    - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR
    - documentation additions

    V4:
    - documentation fixes
    - convert per_cpu to __get_cpu_var
    - powerpc compile fixes

    V5:
    - split apart warn flags for hard and soft lockups

    TODO:
    - figure out how to make an arch-agnostic clock2cycles call
    (if possible) to feed into perf events as a sample period

    [fweisbec: merged conflict patch]

    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Randy Dunlap
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Don Zickus
     

25 Feb, 2010

1 commit

  • Mostly copy/paste whitespace damage with a couple of nitpicks by
    the checkpatch script. Fix the struct definition as requested by Ingo too.

    Signed-off-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: gorcunov@gmail.com
    Cc: aris@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar
    --
    arch/x86/kernel/apic/hw_nmi.c | 14 +++++------
    arch/x86/kernel/traps.c | 6 ++--
    include/linux/nmi.h | 2 -
    kernel/nmi_watchdog.c | 51 ++++++++++++++++++++----------------------
    4 files changed, 36 insertions(+), 37 deletions(-)

    Don Zickus
     

14 Feb, 2010

1 commit

  • The original patch was x86_64 centric. Changed the code to make
    it less so.

    ested by building and running on a powerpc.

    Signed-off-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: gorcunov@gmail.com
    Cc: aris@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     

08 Feb, 2010

1 commit

  • These are the bits that enable the new nmi_watchdog and safely
    isolate the old nmi_watchdog. Only one or the other can run,
    not both at the same time.

    Signed-off-by: Don Zickus
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: gorcunov@gmail.com
    Cc: aris@redhat.com
    Cc: peterz@infradead.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus
     

03 Aug, 2009

1 commit

  • As Andrew noted, my previous patch ("debug lockups: Improve lockup
    detection") broke/removed SysRq-L support from architecture that do
    not provide a __trigger_all_cpu_backtrace implementation.

    Restore a fallback path and clean up the SysRq-L machinery a bit:

    - Rename the arch method to arch_trigger_all_cpu_backtrace()

    - Simplify the define

    - Document the method a bit - in the hope of more architectures
    adding support for it.

    [ The patch touches Sparc code for the rename. ]

    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: "David S. Miller"
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

13 Feb, 2007

1 commit

  • During kernel bootup, a new T60 laptop (CoreDuo, 32-bit) hangs about
    10%-20% of the time in acpi_init():

    Calling initcall 0xc055ce1a: topology_init+0x0/0x2f()
    Calling initcall 0xc055d75e: mtrr_init_finialize+0x0/0x2c()
    Calling initcall 0xc05664f3: param_sysfs_init+0x0/0x175()
    Calling initcall 0xc014cb65: pm_sysrq_init+0x0/0x17()
    Calling initcall 0xc0569f99: init_bio+0x0/0xf4()
    Calling initcall 0xc056b865: genhd_device_init+0x0/0x50()
    Calling initcall 0xc056c4bd: fbmem_init+0x0/0x87()
    Calling initcall 0xc056dd74: acpi_init+0x0/0x1ee()

    It's a hard hang that not even an NMI could punch through! Frustratingly,
    adding printks or function tracing to the ACPI code made the hangs go away
    ...

    After some time an additional detail emerged: disabling the NMI watchdog
    made these occasional hangs go away.

    So i spent the better part of today trying to debug this and trying out
    various theories when i finally found the likely reason for the hang: if
    acpi_ns_initialize_devices() executes an _INI AML method and an NMI
    happens to hit that AML execution in the wrong moment, the machine would
    hang. (my theory is that this must be some sort of chipset setup method
    doing stores to chipset mmio registers?)

    Unfortunately given the characteristics of the hang it was sheer
    impossible to figure out which of the numerous AML methods is impacted
    by this problem.

    As a workaround i wrote an interface to disable chipset-based NMIs while
    executing _INI sections - and indeed this fixed the hang. I did a
    boot-loop of 100 separate reboots and none hung - while without the patch
    it would hang every 5-10 attempts. Out of caution i did not touch the
    nmi_watchdog=2 case (it's not related to the chipset anyway and didnt
    hang).

    I implemented this for both x86_64 and i686, tested the i686 laptop both
    with nmi_watchdog=1 [which triggered the hangs] and nmi_watchdog=2, and
    tested an Athlon64 box with the 64-bit kernel as well. Everything builds
    and works with the patch applied.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andi Kleen
    Cc: Andi Kleen
    Cc: Len Brown
    Signed-off-by: Andrew Morton

    Ingo Molnar
     

07 Dec, 2006

1 commit

  • When a spinlock lockup occurs, arrange for the NMI code to emit an all-cpu
    backtrace, so we get to see which CPU is holding the lock, and where.

    Cc: Andi Kleen
    Cc: Ingo Molnar
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Andi Kleen

    Andrew Morton
     

30 Sep, 2006

1 commit

  • touch_nmi_watchdog() calls touch_softlockup_watchdog() on both
    architectures that implement it (i386 and x86_64). On other architectures
    it does nothing at all. touch_nmi_watchdog() should imply
    touch_softlockup_watchdog() on all architectures. Suggested by Andi Kleen.

    [heiko.carstens@de.ibm.com: s390 fix]
    Signed-off-by: Michal Schmidt
    Cc: Andi Kleen
    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Cc: Michal Schmidt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Schmidt
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds