Eric Lee / smarc-fsl-linux-kernel

18 Aug, 2020

1 commit

45bc6098a EDAC/{i7core,sb,pnd2,skx}: Fix error event severity ... Browse Code »

IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.

Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").

Reverse the tests so that the error is marked fatal when RIPV is not set.

Reported-by: Gabriele Paoloni
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com

Tony Luck
2020-08-18 21:40:30 +0800

11 Jun, 2020

1 commit

f77d26a9f Merge branch 'x86/entry' into ras/core ... Browse Code »

to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.

Thomas Gleixner
2020-06-11 21:17:57 +0800

01 Jun, 2020

1 commit

2a02ca042 Merge branches 'edac-i10nm' and 'edac-misc' into edac-updates-for-5.8 ... Browse Code »

Signed-off-by: Borislav Petkov

Borislav Petkov
2020-06-01 17:39:15 +0800

20 May, 2020

1 commit

103209505 EDAC/skx: Use the mcmtr register to retrieve close_pg/bank_xor_enable ... Browse Code »

The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.

Cc:
Signed-off-by: Qiuxu Zhuo
Reported-by: Matthew Riley
Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com

Qiuxu Zhuo
2020-05-20 06:11:29 +0800

28 Apr, 2020

1 commit

ee5340aba EDAC, {skx,i10nm}: Make some configurations CPU model specific ... Browse Code »

The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.

Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Reviewed-by: Borislav Petkov
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic

Qiuxu Zhuo
2020-04-28 00:29:41 +0800

14 Apr, 2020

2 commits

7fc0b9b99 EDAC: Drop the EDAC report status checks ... Browse Code »

When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).

Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.

Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com

Tony Luck
2020-04-14 22:01:01 +0800
23ba710a0 x86/mce: Fix all mce notifiers to update the mce->kflags bitmask ... Browse Code »

If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.

Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.

Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com

Tony Luck
2020-04-14 21:59:26 +0800

11 Dec, 2019

1 commit

854bb4801 EDAC: skx_common: downgrade message importance on missing PCI device ... Browse Code »

Both skx_edac and i10nm_edac drivers are loaded based on the matching CPU being
available which leads the module to be automatically loaded in virtual machines
as well. That will fail due the missing PCI devices. In both drivers the first
function to make use of the PCI devices is skx_get_hi_lo() will simply print

EDAC skx: Can't get tolm/tohm

for each CPU core, which is noisy. This patch makes it a debug message.

Signed-off-by: Aristeu Rozanski
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20191204212325.c4k47p5hrnn3vpb5@redhat.com

Aristeu Rozanski
2019-12-11 06:14:43 +0800

19 Oct, 2019

2 commits

e80634a75 EDAC, skx: Retrieve and print retry_rd_err_log registers ... Browse Code »

Skylake logs some additional useful information in per-channel
registers in addition the the architectural status/addr/misc
logged in the machine check bank.

Pick up this information and add it to the EDAC log:

retry_rd_err_[five 32-bit register values]

Sorry, no definitions for these registers. OEMs and DIMM vendors
will be able to use them to isolate which cells in the DIMM are
causing problems.

correrrcnt[per rank corrected error counts]

Note that if additional errors are logged while these registers are
being read, you may see a jumble of values some from earlier errors,
others from later errors (since the registers report the most recent
logged error). The correrrcnt registers provide error counts per possible
rank. If these counts only change by one since the previous error logged
for this channel, then it is safe to assume that the registers logged
provide a coherent view of one error.

With this change EDAC logs look like this:

EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])

Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck

Tony Luck
2019-10-19 06:27:58 +0800
29b8e84fb EDAC, skx_common: Refactor so that we initialize "dev" in result of adxl decode. ... Browse Code »

Simplifies the code a little.

Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck

Tony Luck
2019-10-19 06:27:48 +0800

01 Oct, 2019

1 commit

f05390d30 EDAC: skx_common: get rid of unused type var ... Browse Code »

drivers/edac/skx_common.c: In function ‘skx_mce_output_error’:
drivers/edac/skx_common.c:478:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
478 | char *type, *optype;
| ^~~~

Acked-by: Borislav Petkov
Acked-by: Tony Luck
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2019-10-01 02:41:54 +0800

27 Jun, 2019

1 commit

1dc78f1ff EDAC, skx, i10nm: Fix source ID register offset ... Browse Code »

The source ID register offset for Skylake server is 0xf0, while for
Icelake server is 0xf8. Pass the correct offset to get the source ID.

Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck

Qiuxu Zhuo
2019-06-27 01:07:27 +0800

23 Mar, 2019

1 commit

fe783516e EDAC, skx, i10nm: Make skx_common.c a pure library ... Browse Code »

The following Kconfig constellations fail randconfig builds:

CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=m
CONFIG_EDAC_I10NM=y

or

CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=m

with:
...
CC [M] drivers/edac/skx_common.o
...
.../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module'

That is because if one of the two drivers - skx_edac or i10nm_edac - is
built-in and the other one is a module, the shared file skx_common.c
gets linked into a module object by kbuild. Therefore, when linking that
same file into vmlinux, the '__this_module' symbol used in debugfs isn't
defined, leading to the above error.

Fix it by moving all debugfs code from skx_common.c to both skx_base.c
and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the
'__this_module' symbol anymore.

Clarify skx_common.c's purpose at the top of the file for future
reference, while at it.

[ bp: Make text more readable. ]

Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Reported-by: Arnd Bergmann
Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk

Qiuxu Zhuo
2019-03-23 16:43:50 +0800

06 Feb, 2019

1 commit

cbfa482f7 EDAC, skx_common: Add code to recognise new compound error code ... Browse Code »

A new error code for systems that use DRAM as an extra level of cache
looks like:

000F 0010 1MMM CCCC

where the MMM and CCCC bits are used for the same purpose as the
original code. For this new class of errors the ADXL translation will
provide details of both the DIMM used as cache for the error location
and the component that is being cached.

Note: This new error code is first supported in Skylake. Older EDAC
drivers do not need to be updated.

Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: Aristeu Rozanski
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: Qiuxu Zhuo
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190205182109.27828-1-tony.luck@intel.com

Tony Luck
2019-02-06 18:03:06 +0800

02 Feb, 2019

1 commit

88a242c98 EDAC, skx_common: Separate common code out from skx_edac ... Browse Code »

Parts of skx_edac can be shared with the Intel 10nm server EDAC driver.

Carve out the common parts from skx_edac in preparation to support both
skx_edac driver and i10nm_edac drivers.

Co-developed-by: Tony Luck
Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc: James Morse
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com

Qiuxu Zhuo
2019-02-02 17:50:59 +0800