21 Feb, 2017
1 commit
-
Pull RAS updates from Ingo Molnar:
"The main changes in this cycle were:- Assign notifier chain priorities for all RAS related handlers to
make the ordering explicit (Borislav Petkov)- Improve the AMD MCA banks sysfs output (Yazen Ghannam)
- Various cleanups and restructuring of the x86 RAS code (Borislav
Petkov)"* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
x86/ras: Get rid of mce_process_work()
EDAC/mce/amd: Dump TSC value
EDAC/mce/amd: Unexport amd_decode_mce()
x86/ras/amd/inj: Change dependency
x86/ras: Flip the TSC-adding logic
x86/ras/amd: Make sysfs names of banks more user-friendly
x86/ras/therm_throt: Do not log a fake MCE for thermal events
x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
16 Feb, 2017
1 commit
-
Currently, the IPID and Syndrome are printed on the same line as the
Address. There are cases when we can have a valid Syndrome but not a
valid Address.For example, the MCA_SYND register can be used to hold more detailed
error info that the hardware folks can use. It's not just DRAM ECC
syndromes. There are some error types that aren't related to memory that
may have valid syndromes, like some errors related to links in the Data
Fabric, etc.In these cases, the IPID and Syndrome are not printed at the same log
level as the rest of the stanza, so users won't see them on the console.Console:
[Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2Dmesg:
[Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
, Syndrome: 0x000000010b404000, IPID: 0x0001002e00000002
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2Print the IPID first and on a new line. The IPID should always be
printed on SMCA systems. The Syndrome will then be printed with the IPID
and at the same log level when valid:[Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
[Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000010b404000
[Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1487192182-2474-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
28 Jan, 2017
1 commit
-
Users may not be familiar with the concept of deferred errors. There is
no action for users to take on this type of error, so give more context
in the error message to make this more clear.Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485297149-13733-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
24 Jan, 2017
3 commits
-
Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.Suggested-by: Thomas Gleixner
Signed-off-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Tony Luck
Cc: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
Signed-off-by: Ingo Molnar -
Dump the TSC value of the time when the MCE got logged.
Signed-off-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de
Signed-off-by: Ingo Molnar -
It is not used outside of the driver anymore.
Signed-off-by: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de
Signed-off-by: Ingo Molnar
29 Nov, 2016
1 commit
-
MCA_STATUS[43] has been defined as "Poison" or "Reserved" for every bank
since Fam15h except for Fam15h, bank 4 in which case it's defined as
part of the McaStatSubCache bitfield.Filter out that case.
Reported-by: Dean Liberty
Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479478222-19896-1-git-send-email-Yazen.Ghannam@amd.com
[ Split an almost unparseable ternary conditional, add a comment. ]
Signed-off-by: Borislav Petkov
24 Nov, 2016
1 commit
-
tip:ras/core contains the respective Fam17h x86 RAS bits which
amd64_edac is going to use. So merge it into the EDAC branch.Signed-off-by: Borislav Petkov
21 Nov, 2016
1 commit
-
nb_bus_decoder() is only used for DRAM ECC errors so rename it so that
the name is more generic and descriptive.Also, call it for DRAM ECC errors on SMCA systems.
[ Boris: rename it to real function name with a verb in it. ]
Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1479423463-8536-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
09 Nov, 2016
3 commits
-
Add accessor functions and hide the smca_names array. Also, add a
sanity-check to bank HWID assignment in get_smca_bank_info().Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/20161104152317.5r276t35df53qk76@pd.tnic
Signed-off-by: Thomas Gleixner -
Make it differ more from struct smca_bank_name for better readability.
Signed-off-by: Borislav Petkov
Tested-by: Yazen Ghannam
Link: http://lkml.kernel.org/r/20161103125556.15482-3-bp@alien8.de
Signed-off-by: Thomas Gleixner -
Call it simply smca_hwid and call local variables "hwid". More readable.
Signed-off-by: Borislav Petkov
Tested-by: Yazen Ghannam
Link: http://lkml.kernel.org/r/20161103125556.15482-2-bp@alien8.de
Signed-off-by: Thomas Gleixner
13 Sep, 2016
6 commits
-
Bank 4 is reserved on family 0x17 and shouldn't generate any MCE
records. However, broken hardware and software is not something unheard
of so warn about bank 4 errors. They shouldn't be coming from bank 4
naturally but users can still use mce_amd_inj to simulate errors from it
for testing purposed.Also, avoid special handling in the injector mce_amd_inj like it is
being done on the older families.[ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ]
Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Reviewed-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner -
The MCA_SYND and MCA_IPID registers contain valuable information and
should be included in MCE output. The MCA_SYND register contains
syndrome and other error information, and the MCA_IPID register will
uniquely identify the MCA bank's type without having to rely on system
software.Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner -
Scalable MCA defines a number of IP types. An MCA bank on an SMCA
system is defined as one of these IP types. A bank's type is uniquely
identified by the combination of the HWID and MCATYPE values read from
its MCA_IPID register.Add the required tables in order to be able to lookup error descriptions
based on a bank's type and the error's extended error code.[ bp: Align comments, simplify a bit. ]
Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner -
The error descriptions defined for Fam17h can be reused for other SMCA
systems, so their names should reflect this.Change f17h prefix to smca for error descriptions.
Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner -
Add missing SMCA error descriptions to the error descriptions arrays.
Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/1472673994-12235-3-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner -
Print SyndV bit status and print the raw value of the MCA_SYND register.
Further decoding of the syndrome from struct mce.synd can be done in
other places where appropriate, e.g. DRAM ECC.Boris: make the error stanza more compact by putting the error address
and syndrome on the same line:[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 0x96204100001e0117
[Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000
[Hardware Error]: Invalid IP block specified.
[Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RDSigned-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Thomas Gleixner
12 May, 2016
1 commit
-
Use X86_FEATURE_SMCA when detecting if SMCA is available instead of
directly using CPUID 0x80000007_EBX.Signed-off-by: Yazen Ghannam
Signed-off-by: Borislav Petkov
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: linux-edac
Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar
08 Mar, 2016
1 commit
-
For Scalable MCA enabled processors, errors are listed per IP block. And
since it is not required for an IP to map to a particular bank, we need
to use HWID and McaType values from the MCx_IPID register to figure out
which IP a given bank represents.We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
context is corrupt.Add logic here to decode errors from all known IP blocks for Fam17h
Model 00-0fh and to print TCC errors.[ Minor fixups. ]
Signed-off-by: Aravind Gopalakrishnan
Signed-off-by: Borislav Petkov
Cc: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: linux-edac
Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar
14 Jul, 2015
1 commit
-
Currently, when decoding an MCE, we display 'CE' for a Deferred error, like
this:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|Deferred|-|UECC]: 0xdc04b00095080813
When the 'UC' bit in the MCx_STATUS register is clear, the error status
is either a Corrected error or Deferred error as determined by the
'Deferred' bit. So do not print 'CE' on a deferred error.Refer to AMD Error Scope Hierarchy table in a newer BKDG (example:
49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features").Signed-off-by: Aravind Gopalakrishnan
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Link: http://lkml.kernel.org/r/1436788382-6463-1-git-send-email-aravind.gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov
25 Nov, 2014
1 commit
-
Write out MCx_ADDR into the more humanly readable "MCx Error Address"
and remove double colon in the output.Cc: Aravind Gopalakrishnan
Signed-off-by: Borislav Petkov
05 Nov, 2014
1 commit
-
Extended error code meanings are tabulated for other banks. Extend that
tradition for MC6 too.Signed-off-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1415122868-10969-1-git-send-email-aravind.gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov
14 Jul, 2014
1 commit
-
Add decoding logic for new Fam15h model 60h.
Tested using mce_amd_inj module and works fine.
Signed-off-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1405098795-4678-1-git-send-email-Aravind.Gopalakrishnan@amd.com
[ Boris: simplify a bit. ]
Signed-off-by: Borislav Petkov
09 May, 2014
1 commit
-
295d8cda2689 ("EDAC, MCE, AMD: Drop local coreid reporting") removed the
code snippet which used that mask but forgot to drop the mask itself. Do
that now.Signed-off-by: Borislav Petkov
24 Feb, 2014
1 commit
-
We want to still be able to issue some error information on systems for
which there is no decoding support (think older distro kernels here,
for example). Therefore, we allow module registration but skip the
per-family bank-specific decoders and issue the general information
only, i.e.:[ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
[ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
[ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)with the hope that it still contains helpful useful bits.
Suggested-by: Aravind Gopalakrishnan
Tested-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1392659391-2411-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov
08 Jun, 2013
1 commit
-
Add a new error signature for Family 15h, models 30h-3fh. Patch has been
tested on Fam15h using mce_amd_inj facility and has been verified to
work correctly.Signed-off-by: Aravind Gopalakrishnan
[ cleanup commit message and error string ]
Signed-off-by: Borislav Petkov
23 Jan, 2013
3 commits
-
Initially, those strings describing different parts of an MCE message
were shared with amd64_edac and were therefore exported to modules.
However, all except pp_msgs are used only in one place right now so hide
them and make them static.No functionality change.
Reported-by: Fengguang Wu
Signed-off-by: Borislav Petkov -
Add MCE decoding logic for AMD Family 16h processors.
Boris:
- drop unneeded uu_msgs export
- exit early in cat_mc1_mce and save us an indentation levelSigned-off-by: Jacob Shin
Signed-off-by: Borislav Petkov -
Currently only AMD Family 15h processors have special handling for MC2
errors. Since upcoming Family 16h will also need unique handling, let's
make MC2 handling part of amd_decoder_ops.Signed-off-by: Jacob Shin
Signed-off-by: Borislav Petkov
28 Nov, 2012
4 commits
-
Dump error status after decoding the error which describes the error
disposition.Signed-off-by: Borislav Petkov
-
Instead of starting with the error details, report the decoded, readable
error type first.Signed-off-by: Borislav Petkov
-
It is very useful to have the family/model/stepping with the reported
error so dump it. This saves us asking the bug reporter about it.Signed-off-by: Borislav Petkov
-
Having the functional unit names in each bank decode is only misleading
as this code supports multiple families and there's no guarantee the
mapping between FUs and MCE banks will stay the same.And also, knowing the functional unit name doesn't help much since you
end up looking at the respective BKDG anyway.So drop all FU references and use the MC bank numbers instead.
Signed-off-by: Borislav Petkov
04 Apr, 2012
1 commit
-
MCA details seldom change inbetween the models of a family so don't
be too conservative and enable decoding on everything starting from
K8 onwards. Minor adjustments can come in later but most importantly,
we have some decoding infrastructure in place for upcoming models by
default.Signed-off-by: Borislav Petkov
19 Mar, 2012
5 commits
-
... so that checkpatch can chill out.
Signed-off-by: Borislav Petkov
Reviewed-by: Andreas Herrmann -
... and remove superfluous ErrorCodeExt check.
Signed-off-by: Borislav Petkov
Reviewed-by: Andreas Herrmann -
Correct their formulation, replace per-family functions with a single,
unified lookup table.Signed-off-by: Borislav Petkov
Reviewed-by: Andreas Herrmann -
Sync with latest BKDG error types.
Signed-off-by: Borislav Petkov
Reviewed-by: Andreas Herrmann -
This MC1 error signature is called differently now, fix it.
Signed-off-by: Borislav Petkov
Reviewed-by: Andreas Herrmann