11 Jun, 2009

2 commits

  • This patch introduces three boot options (no_cmci, dont_log_ce
    and ignore_ce) to control handling for corrected errors.

    The "mce=no_cmci" boot option disables the CMCI feature.

    Since CMCI is a new feature so having boot controls to disable
    it will be a help if the hardware is misbehaving.

    The "mce=dont_log_ce" boot option disables logging for corrected
    errors. All reported corrected errors will be cleared silently.
    This option will be useful if you never care about corrected
    errors.

    The "mce=ignore_ce" boot option disables features for corrected
    errors, i.e. polling timer and cmci. All corrected events are
    not cleared and kept in bank MSRs.

    Usually this disablement is not recommended, however it will be
    a help if there are some conflict with the BIOS or hardware
    monitoring applications etc., that clears corrected events in
    banks instead of OS.

    [ And trivial cleanup (space -> tab) for doc is included. ]

    Signed-off-by: Hidetoshi Seto
    Reviewed-by: Andi Kleen
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • This patch:

    - Adds print_mce_head() instead of first flag
    - Makes the header to be printed always
    - Stops double printing of corrected errors

    [ This portion originates from Huang Ying's patch ]

    Originally-From: Huang Ying
    Signed-off-by: Hidetoshi Seto
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

04 Jun, 2009

23 commits

  • Make the MCE counters work on 32bit and add poll count in
    arch_irq_stat_cpu.

    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Hidetoshi Seto
     
  • Newer Intel CPUs support a new class of machine checks called recoverable
    action optional.

    Action Optional means that the CPU detected some form of corruption in
    the background and tells the OS about using a machine check
    exception. The OS can then take appropiate action, like killing the
    process with the corrupted data or logging the event properly to disk.

    This is done by the new generic high level memory failure handler added
    in a earlier patch. The high level handler takes the address with the
    failed memory and does the appropiate action, like killing the process.

    In this version of the patch the high level handler is stubbed out
    with a weak function to not create a direct dependency on the hwpoison
    branch.

    The high level handler cannot be directly called from the machine check
    exception though, because it has to run in a defined process context to
    be able to sleep when taking VM locks (it is not expected to sleep for a
    long time, just do so in some exceptional cases like lock contention)

    Thus the MCE handler has to queue a work item for process context,
    trigger process context and then call the high level handler from there.

    This patch adds two path to process context: through a per thread kernel
    exit notify_user() callback or through a high priority work item.
    The first runs when the process exits back to user space, the other when
    it goes to sleep and there is no higher priority process.

    The machine check handler will schedule both, and whoever runs first
    will grab the event. This is done because quick reaction to this
    event is critical to avoid a potential more fatal machine check
    when the corruption is consumed.

    There is a simple lock less ring buffer to queue the corrupted
    addresses between the exception handler and the process context handler.
    Then in process context it just calls the high level VM code with
    the corrupted PFNs.

    The code adds the required code to extract the failed address from
    the CPU's machine check registers. It doesn't try to handle all
    possible cases -- the specification has 6 different ways to specify
    memory address -- but only the linear address.

    Most of the required checking has been already done earlier in the
    mce_severity rule checking engine. Following the Intel
    recommendations Action Optional errors are only enabled for known
    situations (encoded in MCACODs). The errors are ignored otherwise,
    because they are action optional.

    v2: Improve comment, disable preemption while processing ring buffer
    (reported by Ying Huang)

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Add MCE_VECTOR for the #MC exception.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Rename the mce_notify_user function to mce_notify_irq. The next
    patch will split the wakeup handling of interrupt context
    and of process context and it's better to give it a clearer
    name for this.

    Contains a fix from Ying Huang

    [ Impact: cleanup ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Cc: Huang Ying
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • For some time each panic() called with interrupts disabled
    triggered the !irqs_disabled() WARN_ON in smp_call_function(),
    producing ugly backtraces and confusing users.

    This is a common situation with machine checks for example which
    tend to call panic with interrupts disabled, but will also hit
    in other situations e.g. panic during early boot. In fact it
    means that panic cannot be called in many circumstances, which
    would be bad.

    This all started with the new fancy queued smp_call_function,
    which is then used by the shutdown path to shut down the other
    CPUs.

    On closer examination it turned out that the fancy RCU
    smp_call_function() does lots of things not suitable in a panic
    situation anyways, like allocating memory and relying on complex
    system state.

    I originally tried to patch this over by checking for panic
    there, but it was quite complicated and the original patch
    was also not very popular. This also didn't fix some of the
    underlying complexity problems.

    The new code in post 2.6.29 tries to patch around this by
    checking for oops_in_progress, but that is not enough to make
    this fully safe and I don't think that's a real solution
    because panic has to be reliable.

    So instead use an own vector to reboot. This makes the reboot
    code extremly straight forward, which is definitely a big plus
    in a panic situation where it is important to avoid relying on
    too much kernel state. The new simple code is also safe to be
    called from interupts off region because it is very very simple.

    There can be situations where it is important that panic
    is reliable. For example on a fatal machine check the panic
    is needed to get the system up again and running as quickly
    as possible. So it's important that panic is reliable and
    all function it calls simple.

    This is why I came up with this simple vector scheme.
    It's very hard to beat in simplicity. Vectors are not
    particularly precious anymore since all big systems are
    using per CPU vectors.

    Another possibility would have been to use an NMI similar
    to kdump, but there is still the problem that NMIs don't
    work reliably on some systems due to BIOS issues. NMIs
    would have been able to stop CPUs running with interrupts
    off too. In the sake of universal reliability I opted for
    using a non NMI vector for now.

    I put the reboot vector into the highest priority bucket of
    the APIC vectors and moved the 64bit UV_BAU message down
    instead into the next lower priority.

    [ Impact: bug fix, fixes an old regression ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • The MCE severity judgement code is data-driven, so code coverage tools
    such as gcov can not be used for measuring coverage. Instead a dedicated
    coverage mechanism is implemented. The kernel keeps track of rules
    executed and reports them in debugfs.

    This is useful for increasing coverage of the mce-test testsuite.

    Right now it's unconditionally enabled because it's very little code.

    Signed-off-by: Huang Ying
    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Huang Ying
     
  • The x86 architecture recently added some new machine check status bits:
    S(ignalled) and AR (Action-Required). Signalled allows to check
    if a specific event caused an exception or was just logged through CMCI.
    AR allows the kernel to decide if an event needs immediate action
    or can be delayed or ignored.

    Implement support for these new status bits. mce_severity() uses
    the new bits to grade the machine check correctly and decide what
    to do. The exception handler uses AR to decide to kill or not.
    The S bit is used to separate events between the poll/CMCI handler
    and the exception handler.

    Classical UC always leads to panic. That was true before anyways
    because the existing CPUs always passed a PCC with it.

    Also corrects the rules whether to kill in user or kernel context
    and how to handle missing RIPV.

    The machine check handler largely uses the mce-severity grading
    engine now instead of making its own decisions. This means the logic
    is centralized in one place. This is useful because it has to be
    evaluated multiple times.

    v2: Some rule fixes; Add AO events
    Fix RIPV, RIPV|EIPV order (Ying Huang)
    Fix UCNA with AR=1 message (Ying Huang)
    Add comment about panicing in m_c_p.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • When multiple MCEs are printed print the "HARDWARE ERROR" header
    and "This is not a software error" footer only once. This
    makes the output much more compact with many CPUs.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Fatal machine checks can be logged to disk after boot, but only if
    the system did a warm reboot. That's unfortunately difficult with the
    default panic behaviour, which waits forever and the admin has to
    press the power button because modern systems usually miss a reset button.
    This clears the machine checks in the registers and make
    it impossible to log them.

    This patch changes the default for machine check panic to always
    reboot after 30s. Then the mce can be successfully logged after
    reboot.

    I believe this will improve machine check experience for any
    system running the X server.

    This is dependent on successfull boot logging of MCEs. This currently
    only works on Intel systems, on AMD there are quite a lot of systems
    around which leave junk in the machine check registers after boot,
    so it's disabled here. These systems will continue to default
    to endless waiting panic.

    v2: Only force panic timeout when it's shorter (H.Seto)
    v3: Only force timeout when there is no timeout
    (based on comment H.Seto)

    [ Fix changelog - HS ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Assume IP on the stack is valid when either EIPV or RIPV are set.
    This influences whether the machine check exception handler decides
    to return or panic.

    This fixes a test case in the mce-test suite and is more compliant
    to the specification.

    This currently only makes a difference in a artificial testing
    scenario with the mce-test test suite.

    Also in addition do not force the EIPV to be valid with the exact
    register MSRs, and keep in trust the CS value on stack even if MSR
    is available.

    [AK: combination of patches from Huang Ying and Hidetoshi Seto, with
    new description by me]
    [add some description, no code changed - HS]

    Signed-off-by: Huang Ying
    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Huang Ying
     
  • ... instead of "Machine check". This is for consistency with the Monarch
    panic message.

    Based on a report from Ying Huang.

    v2: But add a descriptive postfix so that the test suite can distingush.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • On Intel platforms machine check exceptions are always broadcast to
    all CPUs. This patch makes the machine check handler synchronize all
    these machine checks, elect a Monarch to handle the event and collect
    the worst event from all CPUs and then process it first.

    This has some advantages:

    - When there is a truly data corrupting error the system panics as
    quickly as possible. This improves containment of corrupted
    data and makes sure the corrupted data never hits stable storage.

    - The panics are synchronized and do not reenter the panic code
    on multiple CPUs (which currently does not handle this well).

    - All the errors are reported. Currently it often happens that
    another CPU happens to do the panic first, but reports useless
    information (empty machine check) because the real error
    happened on another CPU which came in later.
    This is a big advantage on Nehalem where the 8 threads per CPU
    lead to often the wrong CPU winning the race and dumping
    useless information on a machine check. The problem also occurs
    in a less severe form on older CPUs.

    - The system can detect when no CPUs detected a machine check
    and shut down the system. This can happen when one CPU is so
    badly hung that that it cannot process a machine check anymore
    or when some external agent wants to stop the system by
    asserting the machine check pin. This follows Intel hardware
    recommendations.

    - This matches the recommended error model by the CPU designers.

    - The events can be output in true severity order

    - When a panic happens on another CPU it makes sure to be actually
    be able to process the stop IPI by enabling interrupts.

    The code is extremly careful to handle timeouts while waiting
    for other CPUs. It can't rely on the normal timing mechanisms
    (jiffies, ktime_get) because of its asynchronous/lockless nature,
    so it uses own timeouts using ndelay() and a "SPINUNIT"

    The timeout is configurable. By default it waits for upto one
    second for the other CPUs. This can be also disabled.

    From some informal testing AMD systems do not see to broadcast
    machine checks, so right now it's always disabled by default on
    non Intel CPUs or also on very old Intel systems.

    Includes fixes from Ying Huang
    Fixed a "ecception" in a comment (H.Seto)
    Moved global_nwo reset later based on suggestion from H.Seto
    v2: Avoid duplicate messages

    [ Impact: feature, fixes long standing problems. ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • In some circumstances multiple CPUs can enter mce_panic() in parallel.
    This gives quite confused output because they will all dump the same
    machine check buffer.

    The other problem is that they would all panic in parallel, but not
    process each other's shutdown IPIs because interrupts are disabled.

    Detect this situation early on in mce_panic(). On the first CPU
    entering will do the panic, the others will just wait to be killed.

    For paranoia reasons in case the other CPU dies during the MCE I added
    a 5 seconds timeout. If it expires each CPU will panic on its own again.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Machine checks support waking up the mcelog daemon quickly.

    The original wake up code for this was pretty ugly, relying on
    a idle notifier and a special process flag. The reason it did
    it this way is that the machine check handler is not subject
    to normal interrupt locking rules so it's not safe
    to call wake_up(). Instead it set a process flag
    and then either did the wakeup in the syscall return
    or in the idle notifier.

    This patch adds a new "bootstraping" method as replacement.

    The idea is that the handler checks if it's in a state where
    it is unsafe to call wake_up(). If it's safe it calls it directly.
    When it's not safe -- that is it interrupted in a critical
    section with interrupts disables -- it uses a new "self IPI" to trigger
    an IPI to its own CPU. This can be done safely because IPI
    triggers are atomic with some care. The IPI is raised
    once the interrupts are reenabled and can then safely call
    wake_up().

    When APICs are disabled the event is just queued and will be picked up
    eventually by the next polling timer. I think that's a reasonable
    compromise, since it should only happen quite rarely.

    Contains fixes from Ying Huang.

    [ solve conflict on irqinit, make it work on 32bit (entry_arch.h) - HS ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • The exception handler should behave differently if the exception is
    fatal versus one that can be returned from. In the first case it should
    never clear any registers because these need to be preserved
    for logging after the next boot. Otherwise it should clear them
    on each CPU step by step so that other CPUs sharing the same bank don't
    see duplicate events. Otherwise we risk reporting events multiple
    times on any CPUs which have shared machine check banks, which
    is a common problem on Intel Nehalem which has both SMT (two
    CPU threads sharing banks) and shared machine check banks in the uncore.

    Determine early in a special pass if any event requires a panic.
    This uses the mce_severity() function added earlier.

    This is needed for the next patch.

    Also fixes a problem together with an earlier patch
    that corrected events weren't logged on a fatal MCE.

    [ Impact: Feature ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • The machine check grading (as in deciding what should be done for a given
    register value) has to be done multiple times soon and it's also getting
    more complicated.
    So it makes sense to consolidate it into a single function. To get smaller
    and more straight forward and possibly more extensible code I opted towards
    a new table driven method. The various rules are put into a table
    when is then executed by a very simple interpreter.

    The grading engine is in a new file mce-severity.c. I also added a private
    include file mce-internal.h, because mce.h is already a bit too cluttered.

    This is dead code right now, but will be used in followon patches.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Previously mce_panic used a simple heuristic to avoid printing
    old so far unreported machine check events on a mce panic. This worked
    by comparing the TSC value at the start of the machine check handler
    with the event time stamp and only printing newer ones.

    This has a couple of issues, in particular on systems where the TSC
    is not fully synchronized between CPUs it could lose events or print
    old ones.

    It is also problematic with full system synchronization as it is
    added by the next patch.

    Remove the TSC heuristic and instead replace it with a simple heuristic
    to print corrected errors first and after that uncorrected errors
    and finally the worst machine check as determined by the machine
    check handler.

    This simplifies the code because there is no need to pass the
    original TSC value around.

    Contains fixes from Ying Huang

    [ Impact: bug fix, cleanup ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Cc: Ying Huang
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Normally the machine check handler ignores corrected errors and leaves
    them to machine_check_poll(). But when panicing mcp won't run, so
    log all errors.

    Note: this can still miss some cases until the "early no way out"
    patch later is applied too.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Experience has shown that struct mce which is used to pass an machine
    check to the user space daemon currently a few limitations. Also some
    data which is useful to print at panic level is also missing.

    This patch addresses most of them. The same information is also
    printed out together with mce panic.

    struct mce can be painlessly extended in a compatible way, the mcelog
    user space code just ignores additional fields with a warning.

    - It doesn't provide a wall time timestamp. There have been a few
    complaints about that. Fix that by adding a 64bit time_t

    - It doesn't provide the exact CPU identification. This makes
    it awkward for mcelog to decode the event correctly, especially
    when there are variations in the supported MCE codes on different
    CPU models or when mcelog is running on a different host after a panic.
    Previously the administrator had to specify the correct CPU
    when mcelog ran on a different host, but with the more variation
    in machine checks now it's better to auto detect that.
    It's also useful for more detailed analysis of CPU events.
    Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.

    - Socket ID and initial APIC ID are useful to report because they
    allow to identify the failing CPU in some (not all) cases.
    This is also especially useful for the panic situation.
    This addresses one of the complaints from Thomas Gleixner earlier.

    - The MCG capabilities MSR needs to be reported for some advanced
    error processing in mcelog

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • The old struct mce had a limitation to 256 CPUs. But x86 Linux supports
    more than that now with x2apic. Add a new field extcpu to report the
    extended number.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • This makes it easier for tools who want to extract the mcelog out of
    crash images or memory dumps to adapt to changing struct mce size.
    The length field replaces padding, so it's fully compatible.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Keep a count of the machine check polls (or CMCI events) in
    /proc/interrupts.

    Andi needs this for debugging, but it's also useful in general
    to see what's going in by the kernel.

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     
  • Useful for debugging, but it's also good general policy
    to have a counter for all special interrupts there. This makes it easier
    to diagnose where a CPU is spending its time.

    [ Impact: feature, debugging tool ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: H. Peter Anvin

    Andi Kleen
     

02 Jun, 2009

3 commits


01 Jun, 2009

10 commits


31 May, 2009

2 commits