28 Jan, 2017
5 commits
-
Match one of the devices in amd64_cpuids[] before loading the module.
This is an additional sanity check against users trying to load
amd64_edac_mod on unsupported systems.Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485537863-2707-9-git-send-email-Yazen.Ghannam@amd.com
[ Get rid of err_ret label, make it a bit more readable this way. ]
Signed-off-by: Borislav Petkov -
Having ECC disabled on a node doesn't necessarily mean that it's
disabled for the entire system. So let's return a non-failing code when
ECC is disabled on a node. This way we can skip initialization for the
node but still continue with the remaining nodes.After probing all instances, make sure we have at least one MC device
allocated.This issue is seen and fix tested on Fam15h and Fam17h MCM systems.
Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485537863-2707-8-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
Print the node number when informing that DRAM ECC is disabled so
that we can show which nodes have DRAM ECC disabled. Also, print more
detailed system information as edac_dbg(), so as to not bother general
users.Switch amd64_notice to amd64_info to match the message above it.
Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485537863-2707-5-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
We have a few functions that register/unregister an ECC error decoding
routine. These functions are called when we init/remove instances.
However, they are global and so don't need to be registered/unregistered
multiple times.So move them out of the init/remove instance functions and into the
module init/exit routines.Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485297149-13733-4-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
Jump to memory freeing routines when init_one_instance() fails.
Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1485297149-13733-3-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
16 Jan, 2017
1 commit
-
We should save the return code from probe_one_instance() so that it can
be returned from the module init function. Otherwise, we'll be returning
the -ENOMEM from above.Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1484322741-41884-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
04 Dec, 2016
1 commit
-
When the call to zalloc_cpumask_var() fails, returning "false" seems
improper. The real value of macro "false" is 0, and 0 means no error.
Return -ENOMEM instead.Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189071
Signed-off-by: Pan Bian
Cc: linux-edac
Link: http://lkml.kernel.org/r/1480831638-5361-1-git-send-email-bianpan201604@163.com
Signed-off-by: Borislav Petkov
01 Dec, 2016
1 commit
-
Prefix the warn and error macros with the respective string so that
callers don't have to say "Error" or "Warning". We save us string length
this way in the actual calls.While at it, shorten the calls in reserve_mc_sibling_devs().
Signed-off-by: Borislav Petkov
Cc: Dan Carpenter
Cc: Yazen Ghannam
30 Nov, 2016
5 commits
-
Add Fam17h to the list of families to autoload amd64_edac_mod.
Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-18-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
How we need to decode UMC errors is different from how we decode bus
errors, so let's define a new function for this. We also need a way to
determine the UMC channel since we're not guaranteed that there is a
fixed relation between channel and MCA bank.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1480359593-80369-1-git-send-email-Yazen.Ghannam@amd.com
[ Fold in decode_synd_reg(), simplify. ]
Signed-off-by: Borislav Petkov -
We need to determine the EDAC capabilities from all UMCs on the node. We
should only check UMCs that are enabled and make sure they all agree.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-15-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
The UMCs on Fam17h are independent memory controllers so we need to
read the capabilities from all UMCs and make sure they agree. Once
we determine what capabilities are available we should save them for
convenience.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1480431116-94683-1-git-send-email-Yazen.Ghannam@amd.com
[ Simplify f17h_determine_edac_ctl_cap(), preinit edac_mode in init_csrows(). ]
Signed-off-by: Borislav Petkov -
Read a few more UMC registers and provide debug output in order to be as
similar as possible to older AMD systems.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1480344621-14966-1-git-send-email-Yazen.Ghannam@amd.com
[ Remove unneeded K8 check and comments, fixup others. ]
Signed-off-by: Borislav Petkov
29 Nov, 2016
3 commits
-
Fam17h has new register offsets and fields for setting up the DRAM
scrubber so add support for this.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-17-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
Fam17h has a different set of registers and bitfields. Most of these
registers are read through SMN (System Management Network) rather
than PCI config space. Also, the derivation of various values is now
different.Update amd64_edac to read the appropriate registers and extract the
correct values for Fam17h.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-12-git-send-email-Yazen.Ghannam@amd.com
[ Save us the indentation level in read_mc_regs(), add defines ]
Signed-off-by: Borislav Petkov -
Fam17h needs PCI device functions 0 and 6 instead of 1 and 2 as on older
systems. Update struct amd64_pvt to hold the new functions and reserve
them if on Fam17h.Also, allocate an array of UMC structs within our newly allocated PVT
struct.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-11-git-send-email-Yazen.Ghannam@amd.com
[ init_one_instance() error handling, shorten lines, unbreak >80 cols lines. ]
Signed-off-by: Borislav Petkov
25 Nov, 2016
2 commits
-
Add a family type and associated ops for Fam17h. Define a struct to hold
all the UMC registers that we need. Make this a part of struct amd64_pvt
in order to maximize code reuse in the rest of the driver.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-10-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
Update the ecc_enabled() function to work on Fam17h. This entails
reading a different set of registers and using the SMN (System
Management Network) rather than PCI devices.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc: x86-ml
Link: http://lkml.kernel.org/r/1479423463-8536-9-git-send-email-Yazen.Ghannam@amd.com
[ Fixup ecc_en assignment and get_umc_base(). ]
Signed-off-by: Borislav Petkov
24 Nov, 2016
1 commit
-
It's not recommended for the OS to try and force-enable ECC checking.
This is considered a firmware task since it includes memory training,
etc, so don't change ECC settings on Fam17h or newer systems and inform
the user.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1479850816-1595-1-git-send-email-Yazen.Ghannam@amd.com
[ Put the "forcing" message in an else branch. ]
Signed-off-by: Borislav Petkov
21 Nov, 2016
3 commits
-
Currently, deferred errors are classified as correctable in EDAC. Add a
new error type for deferred errors so that they are correctly reported
to the user.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1479423463-8536-7-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
We only use __log_bus_error() to log DRAM ECC errors, so let's change
the name to reflect this. We'll also use this function for DRAM ECC
errors on Fam17h, but we'll call it from a different function than
decode_bus_error().Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1479423463-8536-6-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov -
AMD Fam17h will not be using PCI function 2 for EDAC, but will continue
to use function 3. So let's get the name of F3 instead of F2 to support
Fam17h and previous families.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1479423463-8536-5-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
21 Sep, 2016
1 commit
-
Reinstate driver autoloading now that PCI dependency is gone.
Signed-off-by: Yazen Ghannam
Cc: linux-edac
Link: http://lkml.kernel.org/r/1473984445-1726-2-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
08 Aug, 2016
1 commit
-
Fam15hMod60h systems are using the channel decode of Fam15hMod30h which
gives incorrect results. Fam15hMod60h systems should use the generic
channel decode method plus a couple more cases.Signed-off-by: Yazen Ghannam
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1470236355-30039-1-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov
16 Jun, 2016
1 commit
-
It is useless to do it if we're loaded on unsupported hardware so do
that only after we have detected at least 1 supported AMD northbridge.Signed-off-by: Borislav Petkov
18 May, 2016
1 commit
-
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
gitignore: fix wording
mfd: ab8500-debugfs: fix "between" in printk
memstick: trivial fix of spelling mistake on management
cpupowerutils: bench: fix "average"
treewide: Fix typos in printk
IB/mlx4: printk fix
pinctrl: sirf/atlas7: fix printk spelling
serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
w1: comment spelling s/minmum/minimum/
Blackfin: comment spelling s/divsor/divisor/
metag: Fix misspellings in comments.
ia64: Fix misspellings in comments.
hexagon: Fix misspellings in comments.
tools/perf: Fix misspellings in comments.
cris: Fix misspellings in comments.
c6x: Fix misspellings in comments.
blackfin: Fix misspelling of 'register' in comment.
avr32: Fix misspelling of 'definitions' in comment.
treewide: Fix typos in printk
Doc: treewide : Fix typos in DocBook/filesystem.xml
...
10 May, 2016
1 commit
-
- remove homegrown instances counting.
- take F3 PCI device from amd_nb caching instead of F2 which was used with the
PCI core.With those changes, the driver doesn't need to register a PCI driver and
relies on the northbridges caching which we do anyway on AMD.Signed-off-by: Borislav Petkov
Cc: Yazen Ghannam
27 Apr, 2016
1 commit
-
... and don't mislead users into thinking that the driver has loaded
successfully.Signed-off-by: Borislav Petkov
18 Apr, 2016
1 commit
-
This patch fix spelling typos found in printk
within various part of the kernel sources.Signed-off-by: Masanari Iida
Acked-by: Randy Dunlap
Signed-off-by: Jiri Kosina
25 Jan, 2016
1 commit
-
dct_sel_base_off is declared as a u64 but we're only using the lower 32
bits because of a shift wrapping bug. This can possibly truncate the
upper 16 bits of DctSelBaseOffset[47:26], causing us to misdecode the CS
row.Fixes: c8e518d5673d ('amd64_edac: Sanitize f10_get_base_addr_offset')
Signed-off-by: Dan Carpenter
Cc: Aravind Gopalakrishnan
Cc: linux-edac
Cc:
Link: http://lkml.kernel.org/r/20160120095451.GB19898@mwanda
Signed-off-by: Borislav Petkov
04 Nov, 2015
1 commit
-
Pull RAS changes from Ingo Molnar:
"The main system reliability related changes were from x86, but also
some generic RAS changes:- AMD MCE error injection subsystem enhancements. (Aravind
Gopalakrishnan)- Fix MCE and CPU hotplug interaction bug. (Ashok Raj)
- kcrash bootup robustness fix. (Baoquan He)
- kcrash cleanups. (Borislav Petkov)
- x86 microcode driver rework: simplify it by unmodularizing it and
other cleanups. (Borislav Petkov)"* 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
x86/mce: Add a Scalable MCA vendor flags bit
MAINTAINERS: Unify the microcode driver section
x86/microcode/intel: Move #ifdef DEBUG inside the function
x86/microcode/amd: Remove maintainers from comments
x86/microcode: Remove modularization leftovers
x86/microcode: Merge the early microcode loader
x86/microcode: Unmodularize the microcode driver
x86/mce: Fix thermal throttling reporting after kexec
kexec/crash: Say which char is the unrecognized
x86/setup/crash: Check memblock_reserve() retval
x86/setup/crash: Cleanup some more
x86/setup/crash: Remove alignment variable
x86/setup: Cleanup crashkernel reservation functions
x86/amd_nb, EDAC: Rename amd_get_node_id()
x86/setup: Do not reserve crashkernel high memory if low reservation failed
x86/microcode/amd: Do not overwrite final patch levels
x86/microcode/amd: Extract current patch level read to a function
x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
...
21 Oct, 2015
1 commit
-
This function doesn't give us the "Node ID" as the function name
suggests. Rather, it receives a PCI device as argument, checks
the available F3 PCI device IDs in the system and returns the
index of the matching Bus/Device IDs.Rename it to amd_pci_dev_to_node_id().
No functional change is introduced.
Suggested-by: Ingo Molnar
Signed-off-by: Aravind Gopalakrishnan
Signed-off-by: Borislav Petkov
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Mauro Carvalho Chehab
Cc: Peter Zijlstra
Cc: Suravee Suthikulpanit
Cc: Thomas Gleixner
Cc: linux-edac
Link: http://lkml.kernel.org/r/1445246268-26285-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar
29 Sep, 2015
1 commit
-
The scrub rate control register has moved to function 2 in PCI config
space and is at a different offset on family 0x15, models 0x60 and
later. The minimum recommended scrub rate has also changed. (Refer to
D18F2x1c9_dct[1:0][DramScrub] in Fam15hM60h BKDG).Adjust set_scrub_rate() and get_scrub_rate() functions to accommodate
this.Tested on F15hM60h, Fam15h, models 00h-0fh and Fam10h systems.
Signed-off-by: Aravind Gopalakrishnan
Cc: linux-edac
Link: http://lkml.kernel.org/r/1443440593-2316-2-git-send-email-Aravind.Gopalakrishnan@amd.com
[ Cleanup conditionals. ]
Signed-off-by: Borislav Petkov
20 May, 2015
1 commit
-
While testing asynchronous PCI probe on this driver I noticed it failed
because the driver checks if any of the PCI devices have been bound to
the driver after registering it, which obviously does not work if
probing is asynchronous.While there are patches and discussions on how the driver should behave
are ongoing, let's enforce synchronous probe for this driver for now.Reviewed-by: Tejun Heo
Signed-off-by: Luis R. Rodriguez
Signed-off-by: Dmitry Torokhov
Signed-off-by: Greg Kroah-Hartman
23 Feb, 2015
2 commits
-
... and do the proper thing using EDAC core facilities.
Cc: Daniel J Blueman
Signed-off-by: Borislav Petkov -
Instead of calling device_create_file() and device_remove_file()
manually, pass the static attribute groups with the new
edac_mc_add_mc_with_groups(). The conditional creation of inject sysfs
files is done by a proper is_visible callback.Signed-off-by: Takashi Iwai
Link: http://lkml.kernel.org/r/1423046938-18111-4-git-send-email-tiwai@suse.de
Signed-off-by: Borislav Petkov
17 Feb, 2015
1 commit
-
When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
the kernel fatally dereferences unallocated structures, see splat below;
this occurs on at least NumaConnect systems.Fix by checking if a memory controller info structure was found.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
IP: [] decode_bus_error+0x2f/0x2b0
PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in:
CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G D 3.19.0 #1
Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b 01/28/2015
task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
RIP: 0010:[] [] decode_bus_error+0x2f/0x2b0
RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
Stack:
0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
Call Trace:
? vprintk_default
? printk
amd_decode_mce
notifier_call_chain
atomic_notifier_call_chain
mce_log
machine_check_poll
mce_timer_fn
? mce_cpu_restart
call_timer_fn.isra.29
run_timer_softirq
__do_softirq
irq_exit
smp_apic_timer_interrupt
apic_timer_interrupt
? down_read_trylock
__do_page_fault
? __schedule
do_page_fault
page_faultSigned-off-by: Daniel J Blueman
Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
Cc: stable@vger.kernel.org
[ Boris: massage commit message ]
Signed-off-by: Borislav Petkov
05 Nov, 2014
1 commit
-
By popular demand, enable amd64_edac on 32-bit too.
Boris:
- update Kconfig text.
- add a warning on load which states that 32-bit configurations are unsupported.Signed-off-by: Tomasz Pala
Link: http://lkml.kernel.org/r/20141102102212.GA7034@polanet.pl
Signed-off-by: Borislav Petkov
30 Oct, 2014
1 commit
-
This patch adds support for ECC error decoding for F15h M60h processor.
Aside from the usual changes, the patch adds support for some new features
in the processor:
- DDR4(unbuffered, registered); LRDIMM DDR3 support
- relevant debug messages have been modified/added to report these
memory types
- new dbam_to_cs mappers
- if (F15h M60h && LRDIMM); we need a 'multiplier' value to find
cs_size. This multiplier value is obtained from the per-dimm
DCSM register. So, change the interface to accept a 'cs_mask_nr'
value to facilitate this calculation
- switch-casing determine_memory_type()
- done to cleanse the function of too many if-else statements
and improve readability
- This is now called early in read_mc_regs() to cache dram_typeMisc cleanup:
- amd64_pci_table[] is condensed by using PCI_VDEVICE macro.Testing details:
Tested the patch by injecting 'ECC' type errors using mce_amd_inj
and error decoding works fine.Signed-off-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1414617483-4941-1-git-send-email-Aravind.Gopalakrishnan@amd.com
[ Boris: determine_memory_type() cleanups ]
Signed-off-by: Borislav Petkov
23 Sep, 2014
1 commit
-
Rationale behind this change:
- F2x1xx addresses were stopped from being mapped explicitly to DCT1
from F15h (OR) onwards. They use _dct[0:1] mechanism to access the
registers. So we should move away from using address ranges to select
DCT for these families.
- On newer processors, the address ranges used to indicate DCT1 (0x140,
0x1a0) have different meanings than what is assumed currently.Changes introduced:
- amd64_read_dct_pci_cfg() now takes in dct value and uses it for
'selecting the dct'
- Update usage of the function. Keep in mind that different families
have specific handling requirements
- Remove [k8|f10]_read_dct_pci_cfg() as they don't do much different
from amd64_read_pci_cfg()
- Move the k8 specific check to amd64_read_pci_cfg
- Remove f15_read_dct_pci_cfg() and move logic to amd64_read_dct_pci_cfg()
- Remove now needless .read_dct_pci_cfgTesting:
- Tested on Fam 10h; Fam15h Models: 00h, 30h; Fam16h using 'EDAC_DEBUG'
and mce_amd_inj
- driver obtains info from F2x registers and caches it in pvt
structures correctly
- ECC decoding works fineSigned-off-by: Aravind Gopalakrishnan
Link: http://lkml.kernel.org/r/1410799058-3149-1-git-send-email-aravind.gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov