18 Aug, 2020

1 commit

  • IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
    the stack as part of machine check delivery is valid or not.

    Various drivers copied a code fragment that uses the RIPV bit to
    determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
    or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
    RIPV is set as "FATAL").

    Reverse the tests so that the error is marked fatal when RIPV is not set.

    Reported-by: Gabriele Paoloni
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com

    Tony Luck
     

11 Jun, 2020

1 commit


01 Jun, 2020

1 commit


20 May, 2020

1 commit

  • The skx_edac driver wrongly uses the mtr register to retrieve two fields
    close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
    to get the two fields.

    Cc:
    Signed-off-by: Qiuxu Zhuo
    Reported-by: Matthew Riley
    Acked-by: Aristeu Rozanski
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com

    Qiuxu Zhuo
     

28 Apr, 2020

1 commit

  • The device ID for configuration agent PCI device and the offset for
    bus number configuration register can be CPU model specific. So add
    a new structure res_config to make them configurable and pass res_config
    to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.

    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Reviewed-by: Borislav Petkov
    Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic

    Qiuxu Zhuo
     

14 Apr, 2020

2 commits

  • When acpi_extlog was added, we were worried that the same error would
    be reported more than once by different subsystems. But in the ensuing
    years I've seen complaints that people could not find an error log
    (because this mechanism suppressed the log they were looking for).

    Rip it all out. People are smart enough to notice the same address from
    different reporting mechanisms.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com

    Tony Luck
     
  • If the handler took any action to log or deal with the error, set a bit
    in mce->kflags so that the default handler on the end of the machine
    check chain can see what has been done.

    Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
    skip over errors already processed by CEC.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com

    Tony Luck
     

11 Dec, 2019

1 commit

  • Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
    available which leads the module to be automatically loaded in virtual machines
    as well. That will fail due the missing PCI devices. In both drivers the first
    function to make use of the PCI devices is skx_get_hi_lo() will simply print

    EDAC skx: Can't get tolm/tohm

    for each CPU core, which is noisy. This patch makes it a debug message.

    Signed-off-by: Aristeu Rozanski
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com

    Aristeu Rozanski
     

19 Oct, 2019

2 commits

  • Skylake logs some additional useful information in per-channel
    registers in addition the the architectural status/addr/misc
    logged in the machine check bank.

    Pick up this information and add it to the EDAC log:

    retry_rd_err_[five 32-bit register values]

    Sorry, no definitions for these registers. OEMs and DIMM vendors
    will be able to use them to isolate which cells in the DIMM are
    causing problems.

    correrrcnt[per rank corrected error counts]

    Note that if additional errors are logged while these registers are
    being read, you may see a jumble of values some from earlier errors,
    others from later errors (since the registers report the most recent
    logged error). The correrrcnt registers provide error counts per possible
    rank. If these counts only change by one since the previous error logged
    for this channel, then it is safe to assume that the registers logged
    provide a coherent view of one error.

    With this change EDAC logs look like this:

    EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])

    Acked-by: Aristeu Rozanski
    Signed-off-by: Tony Luck

    Tony Luck
     
  • Simplifies the code a little.

    Acked-by: Aristeu Rozanski
    Signed-off-by: Tony Luck

    Tony Luck
     

01 Oct, 2019

1 commit


27 Jun, 2019

1 commit


23 Mar, 2019

1 commit

  • The following Kconfig constellations fail randconfig builds:

    CONFIG_ACPI_NFIT=y
    CONFIG_EDAC_DEBUG=y
    CONFIG_EDAC_SKX=m
    CONFIG_EDAC_I10NM=y

    or

    CONFIG_ACPI_NFIT=y
    CONFIG_EDAC_DEBUG=y
    CONFIG_EDAC_SKX=y
    CONFIG_EDAC_I10NM=m

    with:
    ...
    CC [M] drivers/edac/skx_common.o
    ...
    .../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module'

    That is because if one of the two drivers - skx_edac or i10nm_edac - is
    built-in and the other one is a module, the shared file skx_common.c
    gets linked into a module object by kbuild. Therefore, when linking that
    same file into vmlinux, the '__this_module' symbol used in debugfs isn't
    defined, leading to the above error.

    Fix it by moving all debugfs code from skx_common.c to both skx_base.c
    and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the
    '__this_module' symbol anymore.

    Clarify skx_common.c's purpose at the top of the file for future
    reference, while at it.

    [ bp: Make text more readable. ]

    Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
    Reported-by: Arnd Bergmann
    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc: James Morse
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk

    Qiuxu Zhuo
     

06 Feb, 2019

1 commit

  • A new error code for systems that use DRAM as an extra level of cache
    looks like:

    000F 0010 1MMM CCCC

    where the MMM and CCCC bits are used for the same purpose as the
    original code. For this new class of errors the ADXL translation will
    provide details of both the DIMM used as cache for the error location
    and the component that is being cached.

    Note: This new error code is first supported in Skylake. Older EDAC
    drivers do not need to be updated.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc: Aristeu Rozanski
    Cc: James Morse
    Cc: Mauro Carvalho Chehab
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: https://lkml.kernel.org/r/20190205182109.27828-1-tony.luck@intel.com

    Tony Luck
     

02 Feb, 2019

1 commit

  • Parts of skx_edac can be shared with the Intel 10nm server EDAC driver.

    Carve out the common parts from skx_edac in preparation to support both
    skx_edac driver and i10nm_edac drivers.

    Co-developed-by: Tony Luck
    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc: James Morse
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com

    Qiuxu Zhuo