Eric Lee / smarc-fsl-linux-kernel

17 Jan, 2012

1 commit

e032d8077 mce: fix warning messages about static struct mce_device ... Browse Code »

When suspending, there was a large list of warnings going something like:

Device 'machinecheck1' does not have a release() function, it is broken and must be fixed

This patch turns the static mce_devices into dynamically allocated, and
properly frees them when they are removed from the system. It solves
the warning messages on my laptop here.

Reported-by: "Srivatsa S. Bhat"
Reported-by: Linus Torvalds
Tested-by: Djalal Harouni
Cc: Kay Sievers
Cc: Tony Luck
Cc: Borislav Petkov
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds

Greg Kroah-Hartman
2012-01-17 09:08:42 +0800

08 Jan, 2012

1 commit

7affca353 Merge branch 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
arm: fix up some samsung merge sysdev conversion problems
firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
Drivers:hv: Fix a bug in vmbus_driver_unregister()
driver core: remove __must_check from device_create_file
debugfs: add missing #ifdef HAS_IOMEM
arm: time.h: remove device.h #include
driver-core: remove sysdev.h usage.
clockevents: remove sysdev.h
arm: convert sysdev_class to a regular subsystem
arm: leds: convert sysdev_class to a regular subsystem
kobject: remove kset_find_obj_hinted()
m86k: gpio - convert sysdev_class to a regular subsystem
mips: txx9_sram - convert sysdev_class to a regular subsystem
mips: 7segled - convert sysdev_class to a regular subsystem
sh: dma - convert sysdev_class to a regular subsystem
sh: intc - convert sysdev_class to a regular subsystem
power: suspend - convert sysdev_class to a regular subsystem
power: qe_ic - convert sysdev_class to a regular subsystem
power: cmm - convert sysdev_class to a regular subsystem
s390: time - convert sysdev_class to a regular subsystem
...

Fix up conflicts with 'struct sysdev' removal from various platform
drivers that got changed:
- arch/arm/mach-exynos/cpu.c
- arch/arm/mach-exynos/irq-eint.c
- arch/arm/mach-s3c64xx/common.c
- arch/arm/mach-s3c64xx/cpu.c
- arch/arm/mach-s5p64x0/cpu.c
- arch/arm/mach-s5pv210/common.c
- arch/arm/plat-samsung/include/plat/cpu.h
- arch/powerpc/kernel/sysfs.c
and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h

Linus Torvalds
2012-01-08 04:03:30 +0800

07 Jan, 2012

1 commit

ff4b8a57f Merge branch 'driver-core-next' into Linux 3.2 ... Browse Code »

This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
and it fixes the build error in the arch/x86/kernel/microcode_core.c
file, that the merge did not catch.

The microcode_core.c patch was provided by Stephen Rothwell
who was invaluable in the merge issues involved
with the large sysdev removal process in the driver-core tree.

Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2012-01-07 03:42:52 +0800

22 Dec, 2011

1 commit

8a25a2fd1 cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem ... Browse Code »
129

This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.

After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.

Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: Chris Metcalf
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Borislav Petkov
Cc: Tigran Aivazian
Cc: Len Brown
Cc: Zhang Rui
Cc: Dave Jones
Cc: Peter Zijlstra
Cc: Russell King
Cc: Andrew Morton
Cc: Arjan van de Ven
Cc: "Rafael J. Wysocki"
Cc: "Srivatsa S. Bhat"
Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman

Kay Sievers
2011-12-22 06:29:42 +0800

18 Dec, 2011

1 commit

a228b5892 Merge branch 'mce-inject' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce Browse Code »

Ingo Molnar
2011-12-18 16:18:45 +0800

17 Dec, 2011

1 commit

2c29d9dd5 x86: add IRQ context simulation in module mce-inject ... Browse Code »

mce-inject provides a mechanism to simulate errors so that test
scripts can check for correct operation of the kernel without
requiring any specialized hardware to create rare events.

The existing code can simulate events in normal process context
and also in NMI context - but not in IRQ context. This patch
fills that gap.

Link: https://lkml.org/lkml/2011/12/7/537
Signed-off-by: Chen Gong
Signed-off-by: Tony Luck

Chen Gong
2011-12-17 03:20:02 +0800

14 Dec, 2011

1 commit

3653ada5d x86, mce: Add wrappers for registering on the decode chain ... Browse Code »

No functionality change, this is done so that in a follow-on patch all
queued-up MCEs can be decoded after registering on the chain.

Signed-off-by: Borislav Petkov

Borislav Petkov
2011-12-14 19:50:12 +0800

08 Nov, 2011

1 commit

66f5ddf30 x86/mce: Make mce_chrdev_ops 'static const' ... Browse Code »

Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.

Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.

Reported-by: Arjan van de Ven
Signed-off-by: Tony Luck
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.com
Signed-off-by: Ingo Molnar

Luck, Tony
2011-11-08 23:17:11 +0800

27 Jul, 2011

1 commit

60063497a atomic: use <linux/atomic.h> ... Browse Code »

This allows us to move duplicated code in
(atomic_inc_not_zero() for now) to

Signed-off-by: Arun Sharma
Reviewed-by: Eric Dumazet
Cc: Ingo Molnar
Cc: David Miller
Cc: Eric Dumazet
Acked-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun Sharma
2011-07-27 07:49:47 +0800

16 Jun, 2011

2 commits

c7cece89f x86, mce: Use mce_sysdev_ prefix to group functions ... Browse Code »

There are many functions named mce_* so use a new prefix for the subset
of functions related to sysfs support.

And since f3c6ea1b06c71b43f751b36bd99345369fe911af introduces
syscore_ops, use the prefix mce_syscore for some functions related to
power management which were in sysdev_class before.

Before: After:
mce_device mce_sysdev
mce_sysclass mce_sysdev_class
mce_attrs mce_sysdev_attrs
mce_dev_initialized mce_sysdev_initialized
mce_create_device mce_sysdev_create
mce_remove_device mce_sysdev_remove

mce_suspend mce_syscore_suspend
mce_shutdown mce_syscore_shutdown
mce_resume mce_syscore_resume

Signed-off-by: Hidetoshi Seto
Acked-by: Tony Luck
Link: http://lkml.kernel.org/r/4DEED81B.8020506@jp.fujitsu.com
Signed-off-by: Borislav Petkov

Hidetoshi Seto
2011-06-16 18:10:16 +0800
2b90e77ea x86, mce: Replace MCM_ with MCI_MISC_ ... Browse Code »

Follow other MCi register defines. Plus define MCI_MISC_ADDR_LSB() and
MCI_MISC_ADDR_MODE().

Signed-off-by: Hidetoshi Seto
Acked-by: Tony Luck
Link: http://lkml.kernel.org/r/4DEED6E8.9090509@jp.fujitsu.com
Signed-off-by: Borislav Petkov

Hidetoshi Seto
2011-06-16 18:10:10 +0800

21 Apr, 2011

1 commit

dffa4b2f6 x86, mce: Drop the default decoding notifier ... Browse Code »

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding
hint only in the uncorrectable errors case and when no notifier
is registered. Also, limit issuing the "mcelog --ascii" message
in the rare case when we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the
header.

Signed-off-by: Borislav Petkov
Signed-off-by: Prarit Bhargava
Cc: Tony Luck
Cc: Nagananda Chumbalkar
Cc: Russ Anderson
Link: http://lkml.kernel.org/r/20110420102349.GB1361@aftab
Signed-off-by: Ingo Molnar

Borislav Petkov
2011-04-21 17:35:10 +0800

04 Jan, 2011

1 commit

9e76a97ef x86, hwmon: Add core threshold notification to therm_throt.c ... Browse Code »

This patch adds code to therm_throt.c to notify core thermal threshold
events. These thresholds are supported by the IA32_THERM_INTERRUPT register.
The status/log for the same is monitored using the IA32_THERM_STATUS register.
The necessary #defines are in msr-index.h. A call back is added to mce.h, to
further notify the thermal stack, about the threshold events.

Signed-off-by: Durgadoss R
LKML-Reference:
Signed-off-by: H. Peter Anvin

R, Durgadoss
2011-01-04 00:30:30 +0800

11 Jun, 2010

2 commits

3c4175886 x86, mce: Fix MSR_IA32_MCI_CTL2 CMCI threshold setup ... Browse Code »

It is reported that CMCI is not raised when number of corrected error
reaches preset threshold. After inspection, it is found that
MSR_IA32_MCI_CTL2 threshold field is not setup properly. This patch
fixed it.

Value of MCI_CTL2_CMCI_THRESHOLD_MASK is fixed according to x86_64
Software Developer's Manual too.

Reported-by: Shaohui Zheng
Signed-off-by: Huang Ying
LKML-Reference:
Reviewed-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Huang Ying
2010-06-11 12:27:36 +0800
1f9a0bd49 x86, mce: Rename MSR_IA32_MCx_CTL2 value ... Browse Code »

Rename CMCI_EN to MCI_CTL2_CMCI_EN and CMCI_THRESHOLD_MASK to
MCI_CTL2_CMCI_THRESHOLD_MASK to make naming consistent.

Signed-off-by: Huang Ying
LKML-Reference:
Signed-off-by: H. Peter Anvin

Huang Ying
2010-06-11 12:27:26 +0800

20 May, 2010

1 commit

d334a4911 ACPI, APEI, Generic Hardware Error Source memory error support ... Browse Code »

Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.

Now, only SCI notification type and memory errors are supported. More
notification type and hardware error type will be added later. These
memory errors are reported to user space through /dev/mcelog via
faking a corrected Machine Check, so that the error memory page can be
offlined by /sbin/mcelog if the error count for one page is beyond the
threshold.

On some machines, Machine Check can not report physical address for
some corrected memory errors, but GHES can do that. So this simplified
GHES is implemented firstly.

Signed-off-by: Huang Ying
Signed-off-by: Andi Kleen
Signed-off-by: Len Brown

Huang Ying
2010-05-20 10:41:16 +0800

13 Jan, 2010

1 commit

df39a2e48 x86: mce.h: Fix warning in header checks ... Browse Code »

Someone isn't reading their build output: Move the definition
out of the exported header.

Signed-off-by: Alan Cox
Cc: linux-kernel@vger.kernelorg
Signed-off-by: Ingo Molnar

Alan Cox
2010-01-13 17:41:22 +0800

10 Nov, 2009

1 commit

a2202aa29 x86: Under BIOS control, restore AP's APIC_LVTTHMR to the BSP value ... Browse Code »

On platforms where the BIOS handles the thermal monitor interrupt,
APIC_LVTTHMR on each logical CPU is programmed to generate a SMI
and OS must not touch it.

Unfortunately AP bringup sequence using INIT-SIPI-SIPI clears all
the LVT entries except the mask bit. Essentially this results in
all LVT entries including the thermal monitoring interrupt set
to masked (clearing the bios programmed value for APIC_LVTTHMR).

And this leads to kernel take over the thermal monitoring
interrupt on AP's but not on BSP (leaving the bios programmed
value only on BSP).

As a result of this, we have seen system hangs when the thermal
monitoring interrupt is generated.

Fix this by reading the initial value of thermal LVT entry on
BSP and if bios has taken over the control, then program the
same value on all AP's and leave the thermal monitoring
interrupt control on all the logical cpu's to the bios.

Signed-off-by: Yong Wang
Reviewed-by: Suresh Siddha
Cc: Borislav Petkov
Cc: Arjan van de Ven
LKML-Reference:
Signed-off-by: Ingo Molnar
Cc: stable@kernel.org

Yong Wang
2009-11-10 12:57:55 +0800

16 Oct, 2009

1 commit

5e09954a9 x86, mce: Fix up MCE naming nomenclature ... Browse Code »

Prefix global/setup routines with "mcheck_" thus differentiating
from the internal facilities prefixed with "mce_". Also, prefix
the per cpu calls with mcheck_cpu and rename them to reflect the
MCE setup hierarchy of calls better.

There should be no functionality change resulting from this
patch.

Signed-off-by: Borislav Petkov
Cc: Andi Kleen
LKML-Reference:
Signed-off-by: Ingo Molnar

Borislav Petkov
2009-10-16 20:46:49 +0800

12 Oct, 2009

1 commit

fb2531953 mce, edac: Use an atomic notifier for MCEs decoding ... Browse Code »

Add an atomic notifier which ensures proper locking when conveying
MCE info to EDAC for decoding. The actual notifier call overrides a
default, negative priority notifier.

Note: make sure we register the default decoder only once since
mcheck_init() runs on each CPU.

Signed-off-by: Borislav Petkov
LKML-Reference:
Signed-off-by: Ingo Molnar

Borislav Petkov
2009-10-12 18:24:45 +0800

02 Oct, 2009

1 commit

f436f8bb7 x86: EDAC: MCE: Fix MCE decoding callback logic ... Browse Code »

Make decoding of MCEs happen only on AMD hardware by registering a
non-default callback only on CPU families which support it.

While looking at the interaction of decode_mce() with the other MCE
code i also noticed a few other things and made the following
cleanups/fixes:

- Fixed the mce_decode() weak alias - a weak alias is really not
good here, it should be a proper callback. A weak alias will be
overriden if a piece of code is built into the kernel - not
good, obviously.

- The patch initializes the callback on AMD family 10h and 11h.

- Added the more correct fallback printk of:

No support for human readable MCE decoding on this CPU type.
Transcribe the message and run it through 'mcelog --ascii' to decode.

On CPUs that dont have a decoder.

- Made the surrounding code more readable.

Note that the callback allows us to have a default fallback -
without having to check the CPU versions during the printout
itself. When an EDAC module registers itself, it can install the
decode-print function.

(there's no unregister needed as this is core code.)

version -v2 by Borislav Petkov:

- add K8 to the set of supported CPUs

- always build in edac_mce_amd since we use an early_initcall now

- fix checkpatch warnings

Signed-off-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Andi Kleen
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2009-10-02 21:42:18 +0800

11 Aug, 2009

2 commits

0dcc66851 x86, mce: Support specifying raise mode for software MCE injection ... Browse Code »

Raise mode include raising as exception or raising as poll, it is
specified via the mce.inject_flags field.

This can be used to specify raise mode of UCNA, which is UC error but
raised not as exception. And this can be used to test the filter code
of poll handler or exception handler too. For example, enforce a poll
raise mode for a fatal MCE.

ChangeLog:

v2:

- Re-base on latest x86-tip.git/mce3

Signed-off-by: Huang Ying
Signed-off-by: H. Peter Anvin

Huang Ying
2009-08-11 04:58:41 +0800
5b7e88edc x86, mce: Support specifying context for software mce injection ... Browse Code »

The cpu context is specified via the new mce.inject_flags fields.
This allows more realistic machine check testing in different
situations. "RANDOM" context is implemented via NMI broadcasting to
add randomization to testing.

AK: Fix NMI broadcasting check. Fix 32-bit building. Some race
fixes. Move to module. Various changes

ChangeLog:

v3:

- Re-based on latest x86-tip.git/mce4

- Fix 32-bit building

v2:

- Re-base on latest x86-tip.git/mce3

Signed-off-by: Huang Ying
Signed-off-by: Andi Kleen
Signed-off-by: H. Peter Anvin

Huang Ying
2009-08-11 04:58:27 +0800

10 Jul, 2009

2 commits

3ccdccfad x86: mce: Lower maximum number of banks to architecture limit ... Browse Code »

The Intel x86 architecture right now only supports 32 machine check
banks, more would bump into other MSRs.

So lower the max define to 32.

This only affects a few bitmaps, most data structures are dynamically
sized anyways.

Signed-off-by: Andi Kleen
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-07-10 09:39:47 +0800
5bb38adcb x86: mce: Remove old i386 machine check code ... Browse Code »

As announced in feature-remove-schedule.txt remove CONFIG_X86_OLD_MCE

This patch only removes code.

The ancient machine check code for very old systems that are not supported
by CONFIG_X86_NEW_MCE is still kept.

Signed-off-by: Andi Kleen
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-07-10 09:39:46 +0800

21 Jun, 2009

1 commit

e48768399 x86, mce: fix typo in comment in asm/mce.h ... Browse Code »

Fix comment to match the actual declaration.

Signed-off-by: Borislav Petkov
Cc: Andi Kleen
Signed-off-by: H. Peter Anvin

Borislav Petkov
2009-06-21 14:27:16 +0800

17 Jun, 2009

5 commits

58995d2d5 x86, mce: mce.h cleanup ... Browse Code »

Reorder definitions.

- static inline dummy mcheck_init() for !CONFIG_X86_MCE
- gather defs for exception, threshold handler

Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Hidetoshi Seto
2009-06-17 07:56:10 +0800
8363fc82d x86, mce: remove intel_set_thermal_handler() ... Browse Code »

and make intel_thermal_interrupt() static.

Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Hidetoshi Seto
2009-06-17 07:56:08 +0800
e8ce2c5ee x86, mce: unify smp_thermal_interrupt, prepare ... Browse Code »

Let them in same shape.

Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Hidetoshi Seto
2009-06-17 07:56:08 +0800
c69783698 x86, mce: make mce_disabled boolean ... Browse Code »

The mce_disabled on 32bit is a tristate variable [1,0,-1],
while 64bit version is boolean [0,1].
This patch makes mce_disabled always boolean, and use mce_p5_enabled
to indicate the third state instead.

Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Hidetoshi Seto
2009-06-17 07:56:07 +0800
9e55e44e3 x86, mce: unify mce.h ... Browse Code »

There are 2 headers:
arch/x86/include/asm/mce.h
arch/x86/kernel/cpu/mcheck/mce.h
and in the latter small header:
#include

This patch move all contents in the latter header into the former,
and fix all files using the latter to include the former instead.

Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Hidetoshi Seto
2009-06-17 07:56:07 +0800

11 Jun, 2009

1 commit

62fdac591 x86, mce: Add boot options for corrected errors ... Browse Code »

This patch introduces three boot options (no_cmci, dont_log_ce
and ignore_ce) to control handling for corrected errors.

The "mce=no_cmci" boot option disables the CMCI feature.

Since CMCI is a new feature so having boot controls to disable
it will be a help if the hardware is misbehaving.

The "mce=dont_log_ce" boot option disables logging for corrected
errors. All reported corrected errors will be cleared silently.
This option will be useful if you never care about corrected
errors.

The "mce=ignore_ce" boot option disables features for corrected
errors, i.e. polling timer and cmci. All corrected events are
not cleared and kept in bank MSRs.

Usually this disablement is not recommended, however it will be
a help if there are some conflict with the BIOS or hardware
monitoring applications etc., that clears corrected events in
banks instead of OS.

[ And trivial cleanup (space -> tab) for doc is included. ]

Signed-off-by: Hidetoshi Seto
Reviewed-by: Andi Kleen
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-06-11 17:42:18 +0800

04 Jun, 2009

8 commits

9b1beaf2b x86, mce: support action-optional machine checks ... Browse Code »

Newer Intel CPUs support a new class of machine checks called recoverable
action optional.

Action Optional means that the CPU detected some form of corruption in
the background and tells the OS about using a machine check
exception. The OS can then take appropiate action, like killing the
process with the corrupted data or logging the event properly to disk.

This is done by the new generic high level memory failure handler added
in a earlier patch. The high level handler takes the address with the
failed memory and does the appropiate action, like killing the process.

In this version of the patch the high level handler is stubbed out
with a weak function to not create a direct dependency on the hwpoison
branch.

The high level handler cannot be directly called from the machine check
exception though, because it has to run in a defined process context to
be able to sleep when taking VM locks (it is not expected to sleep for a
long time, just do so in some exceptional cases like lock contention)

Thus the MCE handler has to queue a work item for process context,
trigger process context and then call the high level handler from there.

This patch adds two path to process context: through a per thread kernel
exit notify_user() callback or through a high priority work item.
The first runs when the process exits back to user space, the other when
it goes to sleep and there is no higher priority process.

The machine check handler will schedule both, and whoever runs first
will grab the event. This is done because quick reaction to this
event is critical to avoid a potential more fatal machine check
when the corruption is consumed.

There is a simple lock less ring buffer to queue the corrupted
addresses between the exception handler and the process context handler.
Then in process context it just calls the high level VM code with
the corrupted PFNs.

The code adds the required code to extract the failed address from
the CPU's machine check registers. It doesn't try to handle all
possible cases -- the specification has 6 different ways to specify
memory address -- but only the linear address.

Most of the required checking has been already done earlier in the
mce_severity rule checking engine. Following the Intel
recommendations Action Optional errors are only enabled for known
situations (encoded in MCACODs). The errors are ignored otherwise,
because they are action optional.

v2: Improve comment, disable preemption while processing ring buffer
(reported by Ying Huang)

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:48:59 +0800
9ff36ee96 x86, mce: rename mce_notify_user to mce_notify_irq ... Browse Code »

Rename the mce_notify_user function to mce_notify_irq. The next
patch will split the wakeup handling of interrupt context
and of process context and it's better to give it a clearer
name for this.

Contains a fix from Ying Huang

[ Impact: cleanup ]

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Cc: Huang Ying
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:48:04 +0800
ed7290d0e x86, mce: implement new status bits ... Browse Code »

The x86 architecture recently added some new machine check status bits:
S(ignalled) and AR (Action-Required). Signalled allows to check
if a specific event caused an exception or was just logged through CMCI.
AR allows the kernel to decide if an event needs immediate action
or can be delayed or ignored.

Implement support for these new status bits. mce_severity() uses
the new bits to grade the machine check correctly and decide what
to do. The exception handler uses AR to decide to kill or not.
The S bit is used to separate events between the poll/CMCI handler
and the exception handler.

Classical UC always leads to panic. That was true before anyways
because the existing CPUs always passed a PCC with it.

Also corrects the rules whether to kill in user or kernel context
and how to handle missing RIPV.

The machine check handler largely uses the mce-severity grading
engine now instead of making its own decisions. This means the logic
is centralized in one place. This is useful because it has to be
evaluated multiple times.

v2: Some rule fixes; Add AO events
Fix RIPV, RIPV|EIPV order (Ying Huang)
Fix UCNA with AR=1 message (Ying Huang)
Add comment about panicing in m_c_p.

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:45:34 +0800
8ee08347c x86, mce: extend struct mce user interface with more information. ... Browse Code »

Experience has shown that struct mce which is used to pass an machine
check to the user space daemon currently a few limitations. Also some
data which is useful to print at panic level is also missing.

This patch addresses most of them. The same information is also
printed out together with mce panic.

struct mce can be painlessly extended in a compatible way, the mcelog
user space code just ignores additional fields with a warning.

- It doesn't provide a wall time timestamp. There have been a few
complaints about that. Fix that by adding a 64bit time_t

- It doesn't provide the exact CPU identification. This makes
it awkward for mcelog to decode the event correctly, especially
when there are variations in the supported MCE codes on different
CPU models or when mcelog is running on a different host after a panic.
Previously the administrator had to specify the correct CPU
when mcelog ran on a different host, but with the more variation
in machine checks now it's better to auto detect that.
It's also useful for more detailed analysis of CPU events.
Pass CPUID 1.EAX and the cpu vendor (as encoded in processor.h) instead.

- Socket ID and initial APIC ID are useful to report because they
allow to identify the failing CPU in some (not all) cases.
This is also especially useful for the panic situation.
This addresses one of the complaints from Thomas Gleixner earlier.

- The MCG capabilities MSR needs to be reported for some advanced
error processing in mcelog

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:40:38 +0800
d620c67fb x86, mce: support more than 256 CPUs in struct mce ... Browse Code »

The old struct mce had a limitation to 256 CPUs. But x86 Linux supports
more than that now with x2apic. Add a new field extcpu to report the
extended number.

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:40:38 +0800
f6fb0ac08 x86, mce: store record length into memory struct mce anchor ... Browse Code »

This makes it easier for tools who want to extract the mcelog out of
crash images or memory dumps to adapt to changing struct mce size.
The length field replaces padding, so it's fully compatible.

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:40:38 +0800
ca84f6969 x86, mce: add MCE poll count to /proc/interrupts ... Browse Code »

Keep a count of the machine check polls (or CMCI events) in
/proc/interrupts.

Andi needs this for debugging, but it's also useful in general
to see what's going in by the kernel.

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:40:38 +0800
01ca79f14 x86, mce: add machine check exception count in /proc/interrupts ... Browse Code »

Useful for debugging, but it's also good general policy
to have a counter for all special interrupts there. This makes it easier
to diagnose where a CPU is spending its time.

[ Impact: feature, debugging tool ]

Signed-off-by: Andi Kleen
Signed-off-by: Hidetoshi Seto
Signed-off-by: H. Peter Anvin

Andi Kleen
2009-06-04 05:40:38 +0800