18 Aug, 2020
1 commit
-
IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
the stack as part of machine check delivery is valid or not.Various drivers copied a code fragment that uses the RIPV bit to
determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
RIPV is set as "FATAL").Reverse the tests so that the error is marked fatal when RIPV is not set.
Reported-by: Gabriele Paoloni
Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Cc:
Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com
15 Aug, 2020
1 commit
-
Pull edac fix from Tony Luck:
"Fix for the ie31200 driver that missed the first pull"* tag 'edac_updates_for_5.9_pt2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/ie31200: Fallback if host bridge device is already initialized
11 Aug, 2020
1 commit
-
The Intel uncore driver may claim some of the pci ids from ie31200 which
means that the ie31200 edac driver will not initialize them as part of
pci_register_driver().Let's add a fallback for this case to 'pci_get_device()' to get a
reference on the device such that it can still be configured. This is
similar in approach to other edac drivers.Signed-off-by: Jason Baron
Cc: Borislav Petkov
Cc: Mauro Carvalho Chehab
Cc: linux-edac
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com
04 Aug, 2020
2 commits
-
Pull EDAC updates from Tony Luck:
"Boris is on vacation and aske me to send you the EDAC changes"* tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC: Fix reference count leaks
EDAC: Remove edac_get_dimm_by_index()
EDAC/ghes: Scan the system once on driver init
EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt
EDAC/ghes: Setup DIMM label from DMI and use it in error reports
EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations
EDAC/mc: Call edac_inc_ue_error() before panic
EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier -
Pull x86 RAS updates from Ingo Molnar:
"Boris is on vacation and he asked us to send you the pending RAS bits:- Print the PPIN field on CPUs that fill them out
- Fix an MCE injection bug
- Simplify a kzalloc in dev_mcelog_init_device()"
* tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce, EDAC/mce_amd: Print PPIN in machine check records
x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()
x86/mce/inject: Fix a wrong assignment of i_mce.status
23 Jun, 2020
1 commit
-
Print the Protected Processor Identification Number (PPIN) on processors
which support it.[ bp: Massage. ]
Signed-off-by: Smita Koralahalli
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200623130059.8870-1-Smita.KoralahalliChannabasappa@amd.com
22 Jun, 2020
1 commit
19 Jun, 2020
1 commit
-
Commit:
da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
added support for F15h, model 0x60 CPUs but in doing so, missed to read
back SCRCTRL PCI config register on F15h CPUs which are *not* model
0x60. Add that read so that doing$ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
can show the previously set DRAM scrub rate.
Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
Reported-by: Anders Andersson
Signed-off-by: Borislav Petkov
Cc: #v4.4..
Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com
17 Jun, 2020
2 commits
-
When kobject_init_and_add() returns an error, it should be handled
because kobject_init_and_add() takes a reference even when it fails. If
this function returns an error, kobject_put() must be called to properly
clean up the memory associated with the object.Therefore, replace calling kfree() and call kobject_put() and add a
missing kobject_put() in the edac_device_register_sysfs_main_kobj()
error path.[ bp: Massage and merge into a single patch. ]
Fixes: b2ed215a3338 ("Kobject: change drivers/edac to use kobject_init_and_add")
Signed-off-by: Qiushi Wu
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu
Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu -
Change the hardware scanning and figuring out how many DIMMs a machine
has to a single, one-time thing which happens once on driver init. After
that scanning completes, struct ghes_hw_desc contains a representation
of the hardware which the driver can then use for later initialization.Then, copy the DIMM information into the respective EDAC core
representation of those.Get rid of ghes_edac_dimm_fill and use a struct dimm_info array
directly.This way, hw detection and further driver initialization is nicely
and logically split. Further additions should all be added to
ghes_scan_system() and the hw representation extended as needed.There should be no functionality change resulting from this patch.
Signed-off-by: Borislav Petkov
16 Jun, 2020
5 commits
-
The struct members list and ghes of struct ghes_edac_pvt are unused,
remove them. On that occasion, rename it to the shorter name struct
ghes_pvt.Signed-off-by: Robert Richter
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com -
The ghes driver reports errors with 'unknown label' even if the actual
DIMM label is known, e.g.:EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)Fix this by using struct dimm_info's label string in error reports:
EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
DRAM memory)The labels are initialized by reading the bank and device strings
from DMI. Now, the label information can also read from sysfs. E.g. a
ThunderX2 system will show the following:/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
/sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
/sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
/sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
/sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
/sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
/sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
/sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
/sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
/sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
/sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
/sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
/sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
/sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
/sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
/sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0Since dimm_labels can be rewritten, that label will be used in a later
error report:# echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
# # some error injection here
# dmesg | grep foobar
[ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
memory)[ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]
Signed-off-by: Robert Richter
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com -
Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU
stepping specific configurations to {skx,i10nm}_init(), so can delete
the CPU stepping check from 10nm_init().Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com -
By calling edac_inc_ue_error() before panic, we get a correct UE error
count for core dump analysis.Signed-off-by: Zhenzhong Duan
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200610065846.3626-2-zhenzhong.duan@gmail.com -
Avoid giving it MCE_PRIO_LOWEST priority by default.
Signed-off-by: Zhenzhong Duan
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200610065846.3626-1-zhenzhong.duan@gmail.com
14 Jun, 2020
2 commits
-
Pull more Kbuild updates from Masahiro Yamada:
- fix build rules in binderfs sample
- fix build errors when Kbuild recurses to the top Makefile
- covert '---help---' in Kconfig to 'help'
* tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
treewide: replace '---help---' in Kconfig files with 'help'
kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
samples: binderfs: really compile this sample and fix build issues -
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.There are a variety of indentation styles found.
a) 4 spaces + '---help---'
b) 7 spaces + '---help---'
c) 8 spaces + '---help---'
d) 1 space + 1 tab + '---help---'
e) 1 tab + '---help---' (correct indentation)
f) 1 tab + 1 space + '---help---'
g) 1 tab + 2 spaces + '---help---'In order to convert all of them to 1 tab + 'help', I ran the
following commend:$ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: Masahiro Yamada
11 Jun, 2020
1 commit
-
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
01 Jun, 2020
1 commit
-
Signed-off-by: Borislav Petkov
29 May, 2020
1 commit
-
The variable ret is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant so remove it.Signed-off-by: Colin Ian King
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200429154847.287001-1-colin.king@canonical.com
23 May, 2020
1 commit
-
Add support for AMD Renoir (4000-series Ryzen CPUs).
Signed-off-by: Alexander Monakov
Signed-off-by: Borislav Petkov
Acked-by: Yazen Ghannam
Link: https://lkml.kernel.org/r/20200510204842.2603-4-amonakov@ispras.ru
20 May, 2020
1 commit
-
The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.Cc:
Signed-off-by: Qiuxu Zhuo
Reported-by: Matthew Riley
Acked-by: Aristeu Rozanski
Signed-off-by: Tony Luck
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
28 Apr, 2020
2 commits
-
The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
servers with CPU stepping >=4, the offset for bus number configuration
register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
steppings use the updated 0xd0 offset.Fix the issue by using the appropriate offset for bus number
configuration register according to the CPU model number and stepping.Reported-by: Jerry Chen
Reported-and-tested-by: Jin Wen
Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Reviewed-by: Borislav Petkov
Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic -
The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.Signed-off-by: Qiuxu Zhuo
Signed-off-by: Tony Luck
Reviewed-by: Borislav Petkov
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
24 Apr, 2020
1 commit
-
Fix the following gcc warning:
drivers/edac/amd8131_edac.c:47:21: warning: ‘bridge_str’ defined but not
used [-Wunused-const-variable=]
static char * const bridge_str[] = {
^~~~~~~~~~Reported-by: Hulk Robot
Signed-off-by: Jason Yan
Signed-off-by: Borislav Petkov
Reviewed-by: Robert Richter
Link: https://lkml.kernel.org/r/20200415085006.6732-1-yanaijie@huawei.com
23 Apr, 2020
1 commit
-
Make a couple of symbols static, as reported by sparse.
[ bp: Massage. ]
Reported-by: Hulk Robot
Signed-off-by: Zou Wei
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/1587624744-97240-1-git-send-email-zou_wei@huawei.com
14 Apr, 2020
5 commits
-
When acpi_extlog was added, we were worried that the same error would
be reported more than once by different subsystems. But in the ensuing
years I've seen complaints that people could not find an error log
(because this mechanism suppressed the log they were looking for).Rip it all out. People are smart enough to notice the same address from
different reporting mechanisms.Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com -
If the handler took any action to log or deal with the error, set a bit
in mce->kflags so that the default handler on the end of the machine
check chain can see what has been done.Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
skip over errors already processed by CEC.Signed-off-by: Tony Luck
Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com -
... because no one should be interested in spurious MCEs anyway. Make
the filtering unconditional and move it to amd_filter_mce().Signed-off-by: Borislav Petkov
Tested-by: Tony Luck
Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de -
Fix the following gcc warning:
drivers/edac/xgene_edac.c:1486:7: warning: variable ‘address’ set but
not used [-Wunused-but-set-variable]
u32 address;
^~~~~~~
Remove the unused macro RBERRADDR_RD while at it.Reported-by: Hulk Robot
Signed-off-by: Jason Yan
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200409093259.20069-1-yanaijie@huawei.com -
Fix spelling (s/Aramda/Armada/) in a log message and in a comment. While
at it, add a trailing '\n' in messages.Signed-off-by: Christophe JAILLET
Signed-off-by: Borislav Petkov
Reviewed-by: Jan Luebbe
Link: https://lkml.kernel.org/r/20200413041556.3514-1-christophe.jaillet@wanadoo.fr
31 Mar, 2020
1 commit
-
Pull perf updates from Ingo Molnar:
"The main changes in this cycle were:Kernel side changes:
- A couple of x86/cpu cleanups and changes were grandfathered in due
to patch dependencies. These clean up the set of CPU model/family
matching macros with a consistent namespace and C99 initializer
style.- A bunch of updates to various low level PMU drivers:
* AMD Family 19h L3 uncore PMU
* Intel Tiger Lake uncore support
* misc fixes to LBR TOS sampling- optprobe fixes
- perf/cgroup: optimize cgroup event sched-in processing
- misc cleanups and fixes
Tooling side changes are to:
- perf {annotate,expr,record,report,stat,test}
- perl scripting
- libapi, libperf and libtraceevent
- vendor events on Intel and S390, ARM cs-etm
- Intel PT updates
- Documentation changes and updates to core facilities
- misc cleanups, fixes and other enhancements"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
cpufreq/intel_pstate: Fix wrong macro conversion
x86/cpu: Cleanup the now unused CPU match macros
hwrng: via_rng: Convert to new X86 CPU match macros
crypto: Convert to new CPU match macros
ASoC: Intel: Convert to new X86 CPU match macros
powercap/intel_rapl: Convert to new X86 CPU match macros
PCI: intel-mid: Convert to new X86 CPU match macros
mmc: sdhci-acpi: Convert to new X86 CPU match macros
intel_idle: Convert to new X86 CPU match macros
extcon: axp288: Convert to new X86 CPU match macros
thermal: Convert to new X86 CPU match macros
hwmon: Convert to new X86 CPU match macros
platform/x86: Convert to new CPU match macros
EDAC: Convert to new X86 CPU match macros
cpufreq: Convert to new X86 CPU match macros
ACPI: Convert to new X86 CPU match macros
x86/platform: Convert to new CPU match macros
x86/kernel: Convert to new CPU match macros
x86/kvm: Convert to new CPU match macros
x86/perf/events: Convert to new CPU match macros
...
30 Mar, 2020
1 commit
-
…into edac-updates-for-5.7
Signed-off-by: Borislav Petkov <bp@suse.de>
25 Mar, 2020
2 commits
-
Conflicts:
arch/x86/events/intel/uncore.cSigned-off-by: Ingo Molnar
-
The new macro set has a consistent namespace and uses C99 initializers
instead of the grufty C89 ones.Signed-off-by: Thomas Gleixner
Signed-off-by: Borislav Petkov
Reviewed-by: Greg Kroah-Hartman
Acked-by: Tony Luck
Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de
18 Mar, 2020
1 commit
-
Since snprintf() returns the would-be-output size instead of the actual
output size, the succeeding calls may go beyond the given buffer limit.
Fix it by replacing with scnprintf().Signed-off-by: Takashi Iwai
Signed-off-by: Borislav Petkov
Reviewed-by: Jan Luebbe
Link: https://lkml.kernel.org/r/20200311071728.4541-1-tiwai@suse.de
17 Mar, 2020
1 commit
-
On the ZynqMP platform, zynqmp_get_error_info() is used to read out
error information. In this function, the pinf->col parameter is not
used (it is only used by the Zynq platform's zynq_get_error_info()). So
there's no need to print pinf->col on ZynqMP.In order to differentiate on which platform handle_error() is executed,
use DDR_ECC_INTR_SUPPORT as the check condition to distinguish between
Zynq and ZynqMP platforms.[ bp: Massage. ]
Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Sherry Sun
Signed-off-by: Borislav Petkov
Reviewed-by: Manish Narani
Link: https://lkml.kernel.org/r/1584365679-27443-1-git-send-email-sherry.sun@nxp.com
27 Feb, 2020
1 commit
-
handle_error() currently calls snprintf() a couple of times in
succession to output the message for a CE/UE, therefore overwriting each
part of the message which was formatted with the previous snprintf()
call. As a result, only the part of the message from the last snprintf()
call will be printed.The simplest and most effective way to fix this problem is to combine
the whole string into one which to supply to a single snprintf() call.[ bp: Massage. ]
Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
Signed-off-by: Sherry Sun
Signed-off-by: Borislav Petkov
Reviewed-by: James Morse
Cc: Manish Narani
Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com
20 Feb, 2020
1 commit
-
The driver supports error detection and correction on devices with an
ARM DMC-520 memory controller.Signed-off-by: Lei Wang
Signed-off-by: Shiping Ji
Signed-off-by: Borislav Petkov
Reviewed-by: James Morse
Link: https://lkml.kernel.org/r/83b48c70-dc06-d0d4-cae9-a2187fca628b@gmail.com
19 Feb, 2020
1 commit
-
This warning is output for every virtual CPU in a guest on an EPYC 2
system because kvm doesn't enable SMCA. Once is enough too.[ bp: Massage. ]
Signed-off-by: Prarit Bhargava
Signed-off-by: Borislav Petkov
Link: https://lkml.kernel.org/r/20200217134627.19765-1-prarit@redhat.com