18 Aug, 2020
1 commit
-
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").Reverse the tests so that the error is marked fatal when RIPV is not set.
Reported-by: Gabriele Paoloni
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
11 Jun, 2020
1 commit
-
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
01 Jun, 2020
1 commit
-
Signed-off-by: Borislav Petkov
20 May, 2020
1 commit
-
The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.Cc:
Signed-off-by: Qiuxu Zhuo
Reported-by: Matthew Riley
Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
28 Apr, 2020
1 commit
-
The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Reviewed-by: Borislav Petkov
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
14 Apr, 2020
2 commits
-
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com -
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com
11 Dec, 2019
1 commit
-
Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
available which leads the module to be automatically loaded in virtual machines
as well. That will fail due the missing PCI devices. In both drivers the first
function to make use of the PCI devices is skx_get_hi_lo() will simply printEDAC skx: Can't get tolm/tohm
for each CPU core, which is noisy. This patch makes it a debug message.
Signed-off-by: Aristeu Rozanski
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com
19 Oct, 2019
2 commits
-
Skylake logs some additional useful information in per-channel
registers in addition the the architectural status/addr/misc
logged in the machine check bank.Pick up this information and add it to the EDAC log:
retry_rd_err_[five 32-bit register values]
Sorry, no definitions for these registers. OEMs and DIMM vendors
will be able to use them to isolate which cells in the DIMM are
causing problems.correrrcnt[per rank corrected error counts]
Note that if additional errors are logged while these registers are
being read, you may see a jumble of values some from earlier errors,
others from later errors (since the registers report the most recent
logged error). The correrrcnt registers provide error counts per possible
rank. If these counts only change by one since the previous error logged
for this channel, then it is safe to assume that the registers logged
provide a coherent view of one error.With this change EDAC logs look like this:
EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck -
Simplifies the code a little.
Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck
01 Oct, 2019
1 commit
-
drivers/edac/skx_common.c: In function ‘skx_mce_output_error’:
drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
478 | char *type, *optype;
| ^~~~Acked-by: Borislav Petkov
Acked-by: Tony Luck
Signed-off-by: Mauro Carvalho Chehab
27 Jun, 2019
1 commit
-
The source ID register offset for Skylake server is 0xf0, while for
Icelake server is 0xf8. Pass the correct offset to get the source ID.Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
23 Mar, 2019
1 commit
-
The following Kconfig constellations fail randconfig builds:
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=m
CONFIG_EDAC_I10NM=yor
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=mwith:
...
CC [M] drivers/edac/skx_common.o
...
.../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module'That is because if one of the two drivers - skx_edac or i10nm_edac - is
built-in and the other one is a module, the shared file skx_common.c
gets linked into a module object by kbuild. Therefore, when linking that
same file into vmlinux, the '__this_module' symbol used in debugfs isn't
defined, leading to the above error.Fix it by moving all debugfs code from skx_common.c to both skx_base.c
and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the
'__this_module' symbol anymore.Clarify skx_common.c's purpose at the top of the file for future
reference, while at it.[ bp: Make text more readable. ]
Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Reported-by: Arnd Bergmann
Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk
06 Feb, 2019
1 commit
-
A new error code for systems that use DRAM as an extra level of cache
looks like:000F 0010 1MMM CCCC
where the MMM and CCCC bits are used for the same purpose as the
original code. For this new class of errors the ADXL translation will
provide details of both the DIMM used as cache for the error location
and the component that is being cached.Note: This new error code is first supported in Skylake. Older EDAC
drivers do not need to be updated.Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: Aristeu Rozanski
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: Qiuxu Zhuo
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190205182109.27828-1-tony.luck@intel.com
02 Feb, 2019
1 commit
-
Parts of skx_edac can be shared with the Intel 10nm server EDAC driver.
Carve out the common parts from skx_edac in preparation to support both
skx_edac driver and i10nm_edac drivers.Co-developed-by: Tony Luck
Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com