21 Feb, 2017

1 commit

  • Pull RAS updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Assign notifier chain priorities for all RAS related handlers to
    make the ordering explicit (Borislav Petkov)

    - Improve the AMD MCA banks sysfs output (Yazen Ghannam)

    - Various cleanups and restructuring of the x86 RAS code (Borislav
    Petkov)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
    x86/ras: Get rid of mce_process_work()
    EDAC/mce/amd: Dump TSC value
    EDAC/mce/amd: Unexport amd_decode_mce()
    x86/ras/amd/inj: Change dependency
    x86/ras: Flip the TSC-adding logic
    x86/ras/amd: Make sysfs names of banks more user-friendly
    x86/ras/therm_throt: Do not log a fake MCE for thermal events
    x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

    Linus Torvalds
     

16 Feb, 2017

1 commit

  • Currently, the IPID and Syndrome are printed on the same line as the
    Address. There are cases when we can have a valid Syndrome but not a
    valid Address.

    For example, the MCA_SYND register can be used to hold more detailed
    error info that the hardware folks can use. It's not just DRAM ECC
    syndromes. There are some error types that aren't related to memory that
    may have valid syndromes, like some errors related to links in the Data
    Fabric, etc.

    In these cases, the IPID and Syndrome are not printed at the same log
    level as the rest of the stanza, so users won't see them on the console.

    Console:
    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Dmesg:
    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    , Syndrome: 0x000000010b404000, IPID: 0x0001002e00000002
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Print the IPID first and on a new line. The IPID should always be
    printed on SMCA systems. The Syndrome will then be printed with the IPID
    and at the same log level when valid:

    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000010b404000
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1487192182-2474-1-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

28 Jan, 2017

1 commit

  • Users may not be familiar with the concept of deferred errors. There is
    no action for users to take on this type of error, so give more context
    in the error message to make this more clear.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-2-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

24 Jan, 2017

3 commits

  • Assign all notifiers on the MCE decode chain a priority so that they get
    called in the correct order.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • Dump the TSC value of the time when the MCE got logged.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • It is not used outside of the driver anymore.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

29 Nov, 2016

1 commit

  • MCA_STATUS[43] has been defined as "Poison" or "Reserved" for every bank
    since Fam15h except for Fam15h, bank 4 in which case it's defined as
    part of the McaStatSubCache bitfield.

    Filter out that case.

    Reported-by: Dean Liberty
    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479478222-19896-1-git-send-email-Yazen.Ghannam@amd.com
    [ Split an almost unparseable ternary conditional, add a comment. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

24 Nov, 2016

1 commit


21 Nov, 2016

1 commit

  • nb_bus_decoder() is only used for DRAM ECC errors so rename it so that
    the name is more generic and descriptive.

    Also, call it for DRAM ECC errors on SMCA systems.

    [ Boris: rename it to real function name with a verb in it. ]

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1479423463-8536-4-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

09 Nov, 2016

3 commits


13 Sep, 2016

6 commits

  • Bank 4 is reserved on family 0x17 and shouldn't generate any MCE
    records. However, broken hardware and software is not something unheard
    of so warn about bank 4 errors. They shouldn't be coming from bank 4
    naturally but users can still use mce_amd_inj to simulate errors from it
    for testing purposed.

    Also, avoid special handling in the injector mce_amd_inj like it is
    being done on the older families.

    [ bp: Rewrite commit message and merge into one patch. Use boot_cpu_data. ]

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Reviewed-by: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/1473384591-5323-1-git-send-email-Yazen.Ghannam@amd.com
    Link: http://lkml.kernel.org/r/1473384591-5323-2-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     
  • The MCA_SYND and MCA_IPID registers contain valuable information and
    should be included in MCE output. The MCA_SYND register contains
    syndrome and other error information, and the MCA_IPID register will
    uniquely identify the MCA bank's type without having to rely on system
    software.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     
  • Scalable MCA defines a number of IP types. An MCA bank on an SMCA
    system is defined as one of these IP types. A bank's type is uniquely
    identified by the combination of the HWID and MCATYPE values read from
    its MCA_IPID register.

    Add the required tables in order to be able to lookup error descriptions
    based on a bank's type and the error's extended error code.

    [ bp: Align comments, simplify a bit. ]

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1472741832-1690-1-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     
  • The error descriptions defined for Fam17h can be reused for other SMCA
    systems, so their names should reflect this.

    Change f17h prefix to smca for error descriptions.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1472673994-12235-4-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     
  • Add missing SMCA error descriptions to the error descriptions arrays.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1472673994-12235-3-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     
  • Print SyndV bit status and print the raw value of the MCA_SYND register.
    Further decoding of the syndrome from struct mce.synd can be done in
    other places where appropriate, e.g. DRAM ECC.

    Boris: make the error stanza more compact by putting the error address
    and syndrome on the same line:

    [Hardware Error]: Corrected error, no action required.
    [Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 0x96204100001e0117
    [Hardware Error]: Error Addr: 0x000000007f4c52e3, Syndrome: 0x0000000000000000
    [Hardware Error]: Invalid IP block specified.
    [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Thomas Gleixner

    Yazen Ghannam
     

12 May, 2016

1 commit

  • Use X86_FEATURE_SMCA when detecting if SMCA is available instead of
    directly using CPUID 0x80000007_EBX.

    Signed-off-by: Yazen Ghannam
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1462971509-3856-7-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Yazen Ghannam
     

08 Mar, 2016

1 commit

  • For Scalable MCA enabled processors, errors are listed per IP block. And
    since it is not required for an IP to map to a particular bank, we need
    to use HWID and McaType values from the MCx_IPID register to figure out
    which IP a given bank represents.

    We also have a new bit (TCC) in the MCx_STATUS register to indicate Task
    context is corrupt.

    Add logic here to decode errors from all known IP blocks for Fam17h
    Model 00-0fh and to print TCC errors.

    [ Minor fixups. ]
    Signed-off-by: Aravind Gopalakrishnan
    Signed-off-by: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1457021458-2522-3-git-send-email-Aravind.Gopalakrishnan@amd.com
    Signed-off-by: Ingo Molnar

    Aravind Gopalakrishnan
     

14 Jul, 2015

1 commit

  • Currently, when decoding an MCE, we display 'CE' for a Deferred error, like
    this:

    [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|Deferred|-|UECC]: 0xdc04b00095080813

    When the 'UC' bit in the MCx_STATUS register is clear, the error status
    is either a Corrected error or Deferred error as determined by the
    'Deferred' bit. So do not print 'CE' on a deferred error.

    Refer to AMD Error Scope Hierarchy table in a newer BKDG (example:
    49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features").

    Signed-off-by: Aravind Gopalakrishnan
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1436788382-6463-1-git-send-email-aravind.gopalakrishnan@amd.com
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan
     

25 Nov, 2014

1 commit


05 Nov, 2014

1 commit


14 Jul, 2014

1 commit


09 May, 2014

1 commit


24 Feb, 2014

1 commit

  • We want to still be able to issue some error information on systems for
    which there is no decoding support (think older distro kernels here,
    for example). Therefore, we allow module registration but skip the
    per-family bank-specific decoders and issue the general information
    only, i.e.:

    [ 46.822828] [Hardware Error]: Error Status: Uncorrected, software containable error.
    [ 46.822846] [Hardware Error]: CPU:0 (15:30:0) MC0_STATUS[-|UE|-|-|-|-|-]: 0xa000000000010f0f
    [ 46.822858] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)

    with the hope that it still contains helpful useful bits.

    Suggested-by: Aravind Gopalakrishnan
    Tested-by: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/1392659391-2411-1-git-send-email-Aravind.Gopalakrishnan@amd.com
    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

08 Jun, 2013

1 commit


23 Jan, 2013

3 commits


28 Nov, 2012

4 commits


04 Apr, 2012

1 commit

  • MCA details seldom change inbetween the models of a family so don't
    be too conservative and enable decoding on everything starting from
    K8 onwards. Minor adjustments can come in later but most importantly,
    we have some decoding infrastructure in place for upcoming models by
    default.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

19 Mar, 2012

5 commits