11 Jun, 2012
1 commit
-
Add a new tracepoint-based hardware events report method for
reporting Memory Controller events.Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.[1] http://lwn.net/Articles/416669/
We have several subsystems & methods for reporting hardware errors:
1) EDAC ("Error Detection and Correction"). In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
decodes errors reported via machine check bank registers in AMD
processors to the console log using printk();Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users.As part of a RAS subsystem, let's encapsulate the memory error hardware
events into a trace facility.The tracepoint printk will be displayed like:
mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]
Where:
[quant] is the quantity of errors
[error msg] is the driver-specific error message
(e. g. "memory read", "bus error", ...);
[location] is the location in terms of memory controller and
branch/channel/slot, channel/slot or csrow/channel;
[label] is the memory stick label;
[edac_mc detail] describes the address location of the error
and the syndrome;
[driver detail] is driver-specifig error message details,
when needed/provided (e. g. "area:DMA", ...)For example:
mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
Of course, any userspace tools meant to handle errors should not parse
the above data. They should, instead, use the binary fields provided by
the tracepoint, mapping them directly into their Management Information
Base.NOTE: The original patch was providing an additional mechanism for
MCA-based trace events that also contained MCA error register data.
However, as no agreement was reached so far for the MCA-based trace
events, for now, let's add events only for memory errors.
A latter patch is planned to change the tracepoint, for those types
of event.Cc: Aristeu Rozanski
Cc: Doug Thompson
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Signed-off-by: Mauro Carvalho Chehab
29 May, 2012
2 commits
-
Now that all drivers got converted to use the new ABI, we can
drop the old one.Acked-by: Chris Metcalf
Signed-off-by: Mauro Carvalho Chehab -
Change the EDAC internal representation to work with non-csrow
based memory controllers.There are lots of those memory controllers nowadays, and more
are coming. So, the EDAC internal representation needs to be
changed, in order to work with those memory controllers, while
preserving backward compatibility with the old ones.The edac core was written with the idea that memory controllers
are able to directly access csrows.This is not true for FB-DIMM and RAMBUS memory controllers.
Also, some recent advanced memory controllers don't present a per-csrows
view. Instead, they view memories as DIMMs, instead of ranks.So, change the allocation and error report routines to allow
them to work with all types of architectures.This will allow the removal of several hacks with FB-DIMM and RAMBUS
memory controllers.Also, several tests were done on different platforms using different
x86 drivers.TODO: a multi-rank DIMMs are currently represented by multiple DIMM
entries in struct dimm_info. That means that changing a label for one
rank won't change the same label for the other ranks at the same DIMM.
This bug is present since the beginning of the EDAC, so it is not a big
deal. However, on several drivers, it is possible to fix this issue, but
it should be a per-driver fix, as the csrow => DIMM arrangement may not
be equal for all. So, don't try to fix it here yet.I tried to make this patch as short as possible, preceding it with
several other patches that simplified the logic here. Yet, as the
internal API changes, all drivers need changes. The changes are
generally bigger in the drivers for FB-DIMMs.Cc: Aristeu Rozanski
Cc: Doug Thompson
Cc: Borislav Petkov
Cc: Mark Gross
Cc: Jason Uhlenkott
Cc: Tim Small
Cc: Ranganathan Desikan
Cc: "Arvind R."
Cc: Olof Johansson
Cc: Egor Martovetsky
Cc: Chris Metcalf
Cc: Michal Marek
Cc: Jiri Kosina
Cc: Joe Perches
Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Hitoshi Mitake
Cc: Andrew Morton
Cc: "Niklas Söderlund"
Cc: Shaohui Xie
Cc: Josh Boyer
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Mauro Carvalho Chehab
15 Dec, 2011
1 commit
-
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.Cc: Doug Thompson
Cc: Paul Gortmaker
Cc: Lucas De Marchi
Cc: Borislav Petkov
Signed-off-by: Kay Sievers
Signed-off-by: Greg Kroah-Hartman
01 Nov, 2011
1 commit
-
As we'll need to use those structs for trace functions, they should
be on a more public place. So, move struct mem_ctl_info & friends
to edac.h.No functional changes on this patch.
Signed-off-by: Mauro Carvalho Chehab
Signed-off-by: Doug Thompson
27 May, 2011
1 commit
-
synchronize_rcu() does the stuff as needed.
Signed-off-by: Lai Jiangshan
Cc: Doug Thompson
Cc: "Paul E. McKenney"
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Mar, 2011
1 commit
-
Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: Lucas De Marchi
14 Jan, 2011
1 commit
-
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
Documentation/trace/events.txt: Remove obsolete sched_signal_send.
writeback: fix global_dirty_limits comment runtime -> real-time
ppc: fix comment typo singal -> signal
drivers: fix comment typo diable -> disable.
m68k: fix comment typo diable -> disable.
wireless: comment typo fix diable -> disable.
media: comment typo fix diable -> disable.
remove doc for obsolete dynamic-printk kernel-parameter
remove extraneous 'is' from Documentation/iostats.txt
Fix spelling milisec -> ms in snd_ps3 module parameter description
Fix spelling mistakes in comments
Revert conflicting V4L changes
i7core_edac: fix typos in comments
mm/rmap.c: fix comment
sound, ca0106: Fix assignment to 'channel'.
hrtimer: fix a typo in comment
init/Kconfig: fix typo
anon_inodes: fix wrong function name in comment
fix comment typos concerning "consistent"
poll: fix a typo in comment
...Fix up trivial conflicts in:
- drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
- fs/ext4/ext4.hAlso fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
07 Jan, 2011
2 commits
-
Make the ->{get|set}_sdram_scrub_rate return the actual scrub rate
bandwidth it succeeded setting and remove superfluous arg pointer used
for that. A negative value returned still means that an error occurred
while setting the scrubrate. Document this for future reference.Signed-off-by: Borislav Petkov
-
Add a macro per printk level, shorten up error messages. Add relevant
information to KERN_INFO level. No functional change.Signed-off-by: Borislav Petkov
23 Dec, 2010
1 commit
-
Conflicts:
MAINTAINERS
arch/arm/mach-omap2/pm24xx.c
drivers/scsi/bfa/bfa_fcpim.cNeeded to update to apply fixes for which the old branch was too
outdated.
09 Dec, 2010
1 commit
-
This corrects the misprint introduced when moving '#if
PAGE_SHIFT' from i7core_edac.c to edac_core.h (commit
e9144601d364d5b81f3e63949337f8507eb58dca)Cc: Mauro Carvalho Chehab
Signed-off-by: Andrei Konovalov
Signed-off-by: Borislav Petkov
02 Nov, 2010
1 commit
-
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",Signed-off-by: Uwe Kleine-König
Signed-off-by: Jiri Kosina
24 Oct, 2010
3 commits
-
Signed-off-by: Mauro Carvalho Chehab
-
There are two groups of sysfs attributes: one for rdimm and another
for udimm. Instead of changing dynamically the unique static struct
for handling udimm's, declare two vars and make them constant.This avoids the risk of having two or more memory controllers, each
needing a different set of attributes.While here, use const on all places where it is applicable.
Signed-off-by: Mauro Carvalho Chehab
edac_core: use const for constant sysfs arguments
Signed-off-by: Mauro Carvalho Chehab
-
With multi-sockets, more than one edac pci handler is enabled. Be sure to
un-register all instances.Signed-off-by: Mauro Carvalho Chehab
03 Aug, 2010
2 commits
-
Fortify the interface to not accept negative values, remove
memctrl_int_store() as a result. Also, sanitize bandwidth setting by
making the argument a simple u32 instead of strange u32 pointer being
passed around for no obvious reason. Then, fix error handling and teach
it to return proper error values. Finally, make code more readable,
simplify debug messages.Cc: Mauro Carvalho Chehab
Cc: Arthur Jones
Signed-off-by: Borislav Petkov
Acked-by: Doug Thompson -
This option differs from EDAC_DEBUG only by printing the file and
line of where the debug statement is placed, which contains unneeded
information. So remove it.Signed-off-by: Borislav Petkov
Acked-by: Doug Thompson
10 May, 2010
3 commits
-
Current code only works when there's just one memory
controller, since we need one kobj for each instance.Signed-off-by: Mauro Carvalho Chehab
-
Signed-off-by: Mauro Carvalho Chehab
-
Currently, all sysfs nodes are stored at /sys/.*/mc. (regex)
However, sometimes it is needed to create attribute groups.This patch extends edac_core to allow groups creation.
Signed-off-by: Mauro Carvalho Chehab
08 Dec, 2009
1 commit
-
Instead of using deeply-nested conditionals for dumping the DIMM type in
debug mode, add a strings array of the supported DIMM types.This is useful in cases where an edac driver supports multiple DRAM
types and is only defined in debug builds.Signed-off-by: Borislav Petkov
21 Sep, 2009
1 commit
-
trivial: fix typo "for for" in multiple files
Signed-off-by: Anand Gadiyar
Signed-off-by: Jiri Kosina
01 Jul, 2009
1 commit
-
Since some new MPC85xx SOCs support DDR3 memory now, so add DDR3 memory
type for MPC85xx EDAC.Signed-off-by: Yang Shi
Cc: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jun, 2009
1 commit
-
Add edac_device_alloc_index(), because for MAPLE platform there may
exist several EDAC driver modules that could make use of
edac_device_ctl_info structure at the same time. The index allocation
for these structures should be taken care of by EDAC core.[akpm@linux-foundation.org: cleanups]
Signed-off-by: Harry Ciao
Cc: Doug Thompson
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Kumar Gala
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Jun, 2009
1 commit
-
This shortens debugfX() calls a bit.
Reviewed-by: Mauro Carvalho Chehab
CC: Doug Thompson
Signed-off-by: Borislav Petkov
14 Apr, 2009
1 commit
-
Fix the edac local pci_write_bits32 to properly note the 'escape' mask if
all ones in a 32-bit word.Currently no consumer of this function uses that mask, so there is no
danger to existing code.Signed-off-by: Jeff Haran
Signed-off-by: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Apr, 2009
2 commits
-
Add edac_pci_alloc_index(), because for MAPLE platform there may exist
several EDAC driver modules that could make use of edac_pci_ctl_info
structure at the same time. The index allocation for these structures
should be taken care of by EDAC core.Signed-off-by: Harry Ciao
Cc: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A patch for making a debugging information more verbose for use in
development debugging.By enabling the new option "More verbose debugging", information about
source file and line number will be added to debugging message.This is sample output,
EDAC MC0: Giving out device to 'e7xxx_edac' 'E7205': DEV 0000:00:00.0
EDAC DEBUG: in drivers/edac/edac_pci.c, line at 48: edac_pci_alloc_ctl_info()
EDAC DEBUG: in drivers/edac/edac_pci.c, line at 334: edac_pci_add_device()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^Signed-off-by: Hitoshi Mitake
Signed-off-by: Doug Thompson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Aug, 2008
1 commit
-
This patch lets the files using linux/version.h match the files that
#include it.Signed-off-by: Adrian Bunk
Signed-off-by: Linus Torvalds
06 May, 2008
1 commit
-
Commit 06916639e2fed9ee475efef2747a1b7429f8fe76 ("driver-core: add
dev_name() to help transition away from using bus_id") added a static
inline dev_name() and used it in dev_printk.Unfortunately, drivers/edac/edac_core.h defines a macro called
dev_name(). Rename the latter.Diagnosis by Tony Breeds and Michael Ellerman.
Signed-off-by: Stephen Rothwell
Acked-by: Doug Thompson
Signed-off-by: Linus Torvalds
08 Feb, 2008
1 commit
-
Add the definitions for the Rambus XDR memory type used by the Cell processor.
It's a pre-requisite for the followup Cell EDAC patch.Signed-off-by: Benjamin Herrenschmidt
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Oct, 2007
1 commit
-
define global BIT macro
move all local BIT defines to the new globally define macro.
Signed-off-by: Jiri Slaby
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Kumar Gala
Cc: Dmitry Torokhov
Cc: Jeff Garzik
Cc: James Bottomley
Cc: "Antonino A. Daplas"
Cc: Russell King
Acked-by: Ralf Baechle
Cc: "John W. Linville"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Sep, 2007
1 commit
-
When EDAC is configured for EDAC DEBUGGING, the debug printk output level
was set TOO high (EMERG). This patch brings it down to a DEBUG levelSigned-off-by: Doug Thompson
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Jul, 2007
6 commits
-
Some simple fixes to properly reference counter values from the block
attribute level of edac_device objects. Properly sequencing the array pointer
was added, resulting in correct identification of block level attributes from
their base class functions.Added more verbose debug statement for event tracking.
Also during some corner testing, found a bug in the store/show sequence
of operations for the block attribute/controls management.An old intermediate structure for 'blocks' was still in the processing
pipeline. This patch removes that old structure and correctly utilizes the
new struct edac_dev_sysfs_block_attribute for passing control from the sysfs
to the low level store/show function of the edac driver.Now the proper kobj pointer to passed downward to the store/show
functions.Signed-off-by: Doug Thompson
Cc: Greg KH
Cc: Alan Cox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
With feedback, this patch corrects operation of the kobject release operation
on kobjects, attributes and controls for the edac_device.Cc: Alan Cox alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson
Acked-by: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch refactors the 'releasing' of kobjects for the edac_mc type of
device. The correct pattern of kobject release is followed.As internal kobjs are allocated they bump a ref count on the top level kobj.
It in turn has a module ref count on the edac_core module. When internal
kobjects are released, they dec the ref count on the top level kobj. When the
top level kobj reaches zero, it decrements the ref count on the edac_core
object, allow it to be unloaded, as all resources have all now been released.Cc: Alan Cox alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson
Acked-by: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refactoring of sysfs code necessitated the refactoring of the
edac_device_alloc() and edac_device_add_device() apis, of moving the index
value to the alloc() function. This patch alters the in tree drivers to
utilize this new api signature.Having the index value performed later created a chicken-and-the-egg issue.
Moving it to the alloc() function allows for creating the necessary sysfs
entries with the proper index numberCc: Alan Cox alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refactoring of sysfs code necessitated the refactoring of the edac_mc_alloc()
and edac_mc_add_mc() apis, of moving the index value to the alloc() function.
This patch alters the in tree drivers to utilize this new api signature.Having the index value performed later created a chicken-and-the-egg issue.
Moving it to the alloc() function allows for creating the necessary sysfs
entries with the proper index numberCc: Alan Cox alan@lxorguk.ukuu.org.uk
Signed-off-by: Doug Thompson
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch fixes and enhances the driver level set of sysfs attributes that
can be added to the 'block' level of an edac_device type of driver.There is a controller information structure, which contains one or more
instances of device. Each instance will have one or more blocks of device
specific counters. This patch fixes the ability to have more detailed
attributes/controls for each of the 'blocks', providing for the addition of
controls/attributes from the low level driver to user space via sysfs.Cc: Alan Cox alan@lxorguk.ukuu.org.uk
Signed-off-by: Douglas Thompson
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds