Eric Lee / smarc-fsl-linux-kernel

30 Jul, 2012

1 commit

c2078e4c9 Merge branch 'devel' ... Browse Code »

* devel: (33 commits)
edac i5000, i5400: fix pointer math in i5000_get_mc_regs()
edac: allow specifying the error count with fake_inject
edac: add support for Calxeda highbank L2 cache ecc
edac: add support for Calxeda highbank memory controller
edac: create top-level debugfs directory
sb_edac: properly handle error count
i7core_edac: properly handle error count
edac: edac_mc_handle_error(): add an error_count parameter
edac: remove arch-specific parameter for the error handler
amd64_edac: Don't pass driver name as an error parameter
edac_mc: check for allocation failure in edac_mc_alloc()
edac: Increase version to 3.0.0
edac_mc: Cleanup per-dimm_info debug messages
edac: Convert debugfX to edac_dbg(X,
edac: Use more normal debugging macro style
edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs
Edac: Add ABI Documentation for the new device nodes
edac: move documentation ABI to ABI/testing/sysfs-devices-edac
i7core_edac: change the mem allocation scheme to make Documentation/kobject.txt happy
edac: change the mem allocation scheme to make Documentation/kobject.txt happy
...

Mauro Carvalho Chehab
2012-07-30 08:11:05 +0800

27 Jun, 2012

4 commits

f58d0dee0 edac i5000, i5400: fix pointer math in i5000_get_mc_regs() ... Browse Code »

"pvt->ambase" is a u64 datatype. The intent here is to fill the first
half in the first call to pci_read_config_dword() and the other half in
the second. Unfortunately the pointer math is wrong so we set the wrong
data.

Signed-off-by: Dan Carpenter
Signed-off-by: Mauro Carvalho Chehab

Dan Carpenter
2012-06-27 20:08:40 +0800
38ced28b2 edac: allow specifying the error count with fake_inject ... Browse Code »

In order to test if the error counters are properly incremented,
add a way to specify how many errors were generated by a trace.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-27 20:01:30 +0800
69154d069 edac: add support for Calxeda highbank L2 cache ecc ... Browse Code »

Add support for L2 ECC on Calxeda highbank platform.

Signed-off-by: Rob Herring
Signed-off-by: Mauro Carvalho Chehab

Rob Herring
2012-06-27 20:01:29 +0800
a1b01edb2 edac: add support for Calxeda highbank memory controller ... Browse Code »

Add support for memory controller on Calxeda Highbank platforms. Highbank
platforms support a single 4GB mini-DIMM with 1-bit correction and 2-bit
detection.

Signed-off-by: Rob Herring
Signed-off-by: Mauro Carvalho Chehab

Rob Herring
2012-06-27 20:00:57 +0800

12 Jun, 2012

24 commits

e7930ba49 edac: create top-level debugfs directory ... Browse Code »

Create a single, top-level "edac" directory for debugfs. An "mc[0-N]"
directory is then created for each memory controller. Individual drivers
can create additional entries such as h/w error injection control.

Signed-off-by: Rob Herring
Signed-off-by: Mauro Carvalho Chehab

Rob Herring
2012-06-12 23:15:49 +0800
c10538396 sb_edac: properly handle error count ... Browse Code »

Instead of reporting the error count via driver-specific details,
use the new way provided by edac_mc_handle_error.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 23:15:49 +0800
00d183392 i7core_edac: properly handle error count ... Browse Code »

Instead of generating a burst of errors or reporting the error
count via driver-specific details, use the new way provided by
edac_mc_handle_error.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 23:15:48 +0800
9eb07a7fb edac: edac_mc_handle_error(): add an error_count parameter ... Browse Code »

In order to avoid loosing error events, it is desirable to group
error events together and generate a single trace for several identical
errors.

The trace API already allows reporting multiple errors. Change the
handle_error function to also allow that.

The changes at the drivers were made by this small script:

$file .=$_ while (<>);
$file =~ s/(edac_mc_handle_error)\s*\(([^\,]+)\,([^\,]+)\,/$1($2,$3, 1,/g;
print $file;

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 23:15:47 +0800
03f7eae80 edac: remove arch-specific parameter for the error handler ... Browse Code »

Remove the arch-dependent parameter, as it were not used,
as the MCE tracepoint weren't implemented. It probably doesn't
make sense to have an MCE-specific tracepoint, as this will
cost more bytes at the tracepoint, and tracepoint is not free.

The changes at the EDAC drivers were done by this small perl script:

$file .=$_ while (<>);
$file =~ s/(edac_mc_handle_error)\s*$([^\;]+)\,([^\,$]+)\s*\)/$1($2)/g;
print $file;

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:52 +0800
075f30901 amd64_edac: Don't pass driver name as an error parameter ... Browse Code »

The EDAC driver name doesn't help to handle EDAC errors. So,
remove it from the EDAC error messages, preserving only the
error_message.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:51 +0800
08a4a1369 edac_mc: check for allocation failure in edac_mc_alloc() ... Browse Code »

Add a check here for if kzalloc() failed.

Signed-off-by: Dan Carpenter
Signed-off-by: Mauro Carvalho Chehab

Dan Carpenter
2012-06-12 00:23:51 +0800
5156a5f4e edac: Increase version to 3.0.0 ... Browse Code »

There were lots of changes introduced to justify renaming it to
3.0.0:

- EDAC core were redesigned to represent all types of
memory controllers;

- EDAC API were redesigned to properly represent the memory
controller hierarchy;

- a tracepoint-based API were added to report memory errors.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:50 +0800
6e84d359b edac_mc: Cleanup per-dimm_info debug messages ... Browse Code »

The edac_mc_alloc() routine allocates one dimm_info device for all
possible memories, including the non-filled ones. The debug messages
there are somewhat confusing. So, cleans them, by moving the code
that prints the memory location to edac_mc, and using it on both
edac_mc_sysfs and edac_mc.

Also, only dumps information when DIMM/ranks are actually
filled.

After this patch, a dimm-based memory controller will print the debug
info as:

[ 1011.380027] EDAC DEBUG: edac_mc_dump_csrow: csrow->csrow_idx = 0
[ 1011.380029] EDAC DEBUG: edac_mc_dump_csrow: csrow = ffff8801169be000
[ 1011.380031] EDAC DEBUG: edac_mc_dump_csrow: csrow->first_page = 0x0
[ 1011.380032] EDAC DEBUG: edac_mc_dump_csrow: csrow->last_page = 0x0
[ 1011.380034] EDAC DEBUG: edac_mc_dump_csrow: csrow->page_mask = 0x0
[ 1011.380035] EDAC DEBUG: edac_mc_dump_csrow: csrow->nr_channels = 3
[ 1011.380037] EDAC DEBUG: edac_mc_dump_csrow: csrow->channels = ffff8801149c2840
[ 1011.380039] EDAC DEBUG: edac_mc_dump_csrow: csrow->mci = ffff880117426000
[ 1011.380041] EDAC DEBUG: edac_mc_dump_channel: channel->chan_idx = 0
[ 1011.380042] EDAC DEBUG: edac_mc_dump_channel: channel = ffff8801149c2860
[ 1011.380044] EDAC DEBUG: edac_mc_dump_channel: channel->csrow = ffff8801169be000
[ 1011.380046] EDAC DEBUG: edac_mc_dump_channel: channel->dimm = ffff88010fe90400
...
[ 1011.380095] EDAC DEBUG: edac_mc_dump_dimm: dimm0: channel 0 slot 0 mapped as virtual row 0, chan 0
[ 1011.380097] EDAC DEBUG: edac_mc_dump_dimm: dimm = ffff88010fe90400
[ 1011.380099] EDAC DEBUG: edac_mc_dump_dimm: dimm->label = 'CPU#0Channel#0_DIMM#0'
[ 1011.380101] EDAC DEBUG: edac_mc_dump_dimm: dimm->nr_pages = 0x40000
[ 1011.380103] EDAC DEBUG: edac_mc_dump_dimm: dimm->grain = 8
[ 1011.380104] EDAC DEBUG: edac_mc_dump_dimm: dimm->nr_pages = 0x40000
...

(a rank-based memory controller would print, instead of "dimm?", "rank?"
on the above debug info)

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:49 +0800
956b9ba15 edac: Convert debugfX to edac_dbg(X, ... Browse Code »

Use a more common debugging style.

Remove __FILE__ uses, add missing newlines,
coalesce formats and align arguments.

Signed-off-by: Joe Perches
Signed-off-by: Mauro Carvalho Chehab

Joe Perches
2012-06-12 00:23:49 +0800
7e881856e edac: Use more normal debugging macro style ... Browse Code »

Convert macros to a simpler style and enforce appropriate
format checking when not CONFIG_EDAC_DEBUG.

Use fmt and __VA_ARGS__, neaten macros.

Move some string arrays to the debugfx uses and remove the
now unnecessary CONFIG_EDAC_DEBUG variable block definitions.

Signed-off-by: Joe Perches
Signed-off-by: Mauro Carvalho Chehab

Joe Perches
2012-06-12 00:23:48 +0800
dd23cd6eb edac: Don't add __func__ or __FILE__ for debugf[0-9] msgs ... Browse Code »

The debug macro already adds that. Most of the work here was
made by this small script:

$f .=$_ while (<>);

$f =~ s/(debugf[0-9]\s*$\s*)__FILE__\s*": /\1"/g;
$f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*/\1/g;
$f =~ s/(debugf[0-9]\s*\(\s*)__FILE__\s*"MC: /\1"/g;

$f =~ s/(debugf[0-9]\s*\(\")\%s[\:\,\($]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
$f =~ s/(debugf[0-9]\s*$\")\%s[\:\,\($]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;
$f =~ s/(debugf[0-9]\s*$\"MC\:\s*)\%s[\:\,\($]*\s*([^\"]*\s*[^\)]+)__func__\s*\,\s*/\1\2/g;
$f =~ s/(debugf[0-9]\s*$\"MC\:\s*)\%s[\:\,\($]*\s*([^\"]*\s*[^\)]+),\s*__func__\s*\)/\1\2)/g;

$f =~ s/\"MC\: \\n\"/"MC:\\n"/g;

print $f;

After running the script, manual cleanups were done to fix it the remaining
places.

While here, removed the __LINE__ on most places, as it doesn't actually give
useful info on most places.

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:47 +0800
356f0a308 i7core_edac: change the mem allocation scheme to make Documentation/kobject.txt happy ... Browse Code »

Kernel kobjects have rigid rules: each container object should be
dynamically allocated, and can't be allocated into a single kmalloc.

EDAC never obeyed this rule: it has a single malloc function that
allocates all needed data into a single kzalloc.

As this is not accepted anymore, change the allocation schema of the
EDAC *_info structs to enforce this Kernel standard.

Cc: Aristeu Rozanski
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:46 +0800
de3910eb7 edac: change the mem allocation scheme to make Documentation/kobject.txt happy ... Browse Code »

Kernel kobjects have rigid rules: each container object should be
dynamically allocated, and can't be allocated into a single kmalloc.

EDAC never obeyed this rule: it has a single malloc function that
allocates all needed data into a single kzalloc.

As this is not accepted anymore, change the allocation schema of the
EDAC *_info structs to enforce this Kernel standard.

Acked-by: Chris Metcalf
Cc: Aristeu Rozanski
Cc: Doug Thompson
Cc: Greg K H
Cc: Borislav Petkov
Cc: Mark Gross
Cc: Tim Small
Cc: Ranganathan Desikan
Cc: "Arvind R."
Cc: Olof Johansson
Cc: Egor Martovetsky
Cc: Michal Marek
Cc: Jiri Kosina
Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Hitoshi Mitake
Cc: Andrew Morton
Cc: Shaohui Xie
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:45 +0800
e39f4ea9b edac: Only expose csrows/channels on legacy API if they're populated ... Browse Code »

This patch actually fixes a bug with the legacy API, where, at the
same csrow, some channels may have different DIMMs. This can happen
on FB-DIMM/RAMBUS and modern Intel controllers.

This is the case, for example, of Nehalem machines:

$ ./edac-ctl --layout
+-----------------------------------+
| mc0 |
| channel0 | channel1 | channel2 |
-------+-----------------------------------+
slot2: | 0 MB | 0 MB | 0 MB |
slot1: | 1024 MB | 0 MB | 0 MB |
slot0: | 1024 MB | 1024 MB | 1024 MB |
-------+-----------------------------------+

Before this patch, non-filled memories were shown. Now, only what's
filled is there:

grep . /sys/devices/system/edac/mc/mc0/csrow*/ch?*
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch0_dimm_label:CPU#0Channel#0_DIMM#0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_dimm_label:CPU#0Channel#0_DIMM#1
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_dimm_label:CPU#0Channel#1_DIMM#0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_dimm_label:CPU#0Channel#2_DIMM#0

Thanks-to: Aristeu Rozanski Filho
Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:44 +0800
fd63312df edac: Move grain/dtype/edac_type calculus to be out of channel loop ... Browse Code »

The 3e7bddc changeset (edac: move dimm properties to struct memset_info)
moved the calculus inside a loop. However, at those stuff are common to
all channels, on several drivers, it is better to put the calculus
outside the loop, to optimize the code.

Reported-by: Aristeu Rozanski Filho
Reviewed-by: Aristeu Rozanski
Cc: Mark Gross
Cc: Doug Thompson
Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:44 +0800
452a6bf95 edac: Add debufs nodes to allow doing fake error inject ... Browse Code »

Sometimes, it is useful to have a mechanism that generates fake
errors, in order to test the EDAC core code, and the userspace
tools.

Provide such mechanism by adding a few debugfs nodes.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:43 +0800
8ad6c78a6 edac: add a sysfs node to report the maximum location for the system ... Browse Code »

The userspace tools need to know what's the maximum location on each
system, as it helps to create nice maps showing how the memory was
filled at the system.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:43 +0800
199747106 edac: add a new per-dimm API and make the old per-virtual-rank API obsolete ... Browse Code »

The old EDAC API is broken. It only works fine for systems manufatured
before 2005 and for AMD 64. The reason is that it forces all memory
controller drivers to discover rank info.

Also, it doesn't allow grouping the several ranks into a DIMM.

So, what almost all modern drivers do is to create a fake virtual-rank
information, and use it to cheat the EDAC core to accept the driver.

While this works if the user has enough time to discover what DIMM slot
corresponds to each "virtual-rank" information, it prevents EDAC usage
for users with less available time. It also makes life hard for vendors
that may want to provide a table with their motherboards to the userspace
tool (edac-utils) as each driver has its own logic for the virtual
mapping.

So, the old API should be removed, in favor of a more flexible API that
allows newer drivers to not lie to the EDAC core.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Borislav Petkov
Cc: Randy Dunlap
Cc: Josh Boyer
Cc: Hui Wang
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:42 +0800
d90c00896 edac: Get rid of the old kobj's from the edac mc code ... Browse Code »

Now that al users for the old kobj raw access are gone,
we can get rid of the legacy kobj-based structures and
data.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Michal Marek
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:41 +0800
5c4cdb5ae i7core_edac: convert it to use struct device ... Browse Code »

Instead of relying on a complex logic inside the edac core to create
a "device tree-like" sysfs struct, just use device_add.

Reviewed-by: Aristeu Rozanski
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:41 +0800
c56087595 amd64_edac: convert sysfs logic to use struct device ... Browse Code »

Now that the EDAC core supports struct device, there's no sense
on having any logic at the EDAC core to simulate it. So, instead
of adding such logic there, change the logic at amd64_edac to
use it.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:40 +0800
ba004239e mpc85xx_edac: convert sysfs logic to use struct device ... Browse Code »

Now that the EDAC core supports struct device, there's no sense on
having any logic at the EDAC core to simulate it. So, instead of adding
such logic there, change the logic at mpc85xx_edac to use it

compile-tested only.

Reviewed-by: Aristeu Rozanski
Cc: Andrew Morton
Cc: Shaohui Xie
Cc: Jiri Kosina
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:39 +0800
7a623c039 edac: rewrite the sysfs code to use struct device ... Browse Code »

The EDAC subsystem uses the old struct sysdev approach,
creating all nodes using the raw sysfs API. This is bad,
as the API is deprecated.

As we'll be changing the EDAC API, let's first port the existing
code to struct device.

There's one drawback on this patch: driver-specific sysfs
nodes, used by mpc85xx_edac, amd64_edac and i7core_edac
won't be created anymore. While it would be possible to
also port the device-specific code, that would mix kobj with
struct device, with is not recommended. Also, it is easier and nicer
to move the code to the drivers, instead, as the core can get rid
of some complex logic that just emulates what the device_add()
and device_create_file() already does.

The next patches will convert the driver-specific code to use
the device-specific calls. Then, the remaining bits of the old
sysfs API will be removed.

NOTE: a per-MC bus is required, otherwise devices with more than
one memory controller will hit a bug like the one below:

[ 819.094946] EDAC DEBUG: find_mci_by_dev: find_mci_by_dev()
[ 819.094948] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device() idx=1
[ 819.094952] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device(): creating device mc1
[ 819.094967] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device creating dimm0, located at channel 0 slot 0
[ 819.094984] ------------[ cut here ]------------
[ 819.100142] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xc1/0xf0()
[ 819.107282] Hardware name: S2600CP
[ 819.111078] sysfs: cannot create duplicate filename '/bus/edac/devices/dimm0'
[ 819.119062] Modules linked in: sb_edac(+) edac_core ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm microcode pcspkr iTCO_wdt iTCO_vendor_support igb i2c_i801 i2c_core sg ioatdma dca sr_mod cdrom sd_mod crc_t10dif ahci libahci isci libsas libata scsi_transport_sas scsi_mod wmi dm_mod [last unloaded: scsi_wait_scan]
[ 819.175748] Pid: 10902, comm: modprobe Not tainted 3.3.0-0.11.el7.v12.2.x86_64 #1
[ 819.184113] Call Trace:
[ 819.186868] [] warn_slowpath_common+0x7f/0xc0
[ 819.193573] [] warn_slowpath_fmt+0x46/0x50
[ 819.200000] [] sysfs_add_one+0xc1/0xf0
[ 819.206025] [] sysfs_do_create_link+0x135/0x220
[ 819.212944] [] ? sysfs_create_group+0x13/0x20
[ 819.219656] [] sysfs_create_link+0x13/0x20
[ 819.226109] [] bus_add_device+0xe6/0x1b0
[ 819.232350] [] device_add+0x2db/0x460
[ 819.238300] [] edac_create_dimm_object+0x84/0xf0 [edac_core]
[ 819.246460] [] edac_create_sysfs_mci_device+0xe8/0x290 [edac_core]
[ 819.255215] [] edac_mc_add_mc+0x5a/0x2c0 [edac_core]
[ 819.262611] [] sbridge_register_mci+0x1bc/0x279 [sb_edac]
[ 819.270493] [] sbridge_probe+0xef/0x175 [sb_edac]
[ 819.277630] [] ? pm_runtime_enable+0x58/0x90
[ 819.284268] [] local_pci_probe+0x5c/0xd0
[ 819.290508] [] __pci_device_probe+0xf1/0x100
[ 819.297117] [] pci_device_probe+0x3a/0x60
[ 819.303457] [] really_probe+0x73/0x270
[ 819.309496] [] driver_probe_device+0x4e/0xb0
[ 819.316104] [] __driver_attach+0xab/0xb0
[ 819.322337] [] ? driver_probe_device+0xb0/0xb0
[ 819.329151] [] bus_for_each_dev+0x56/0x90
[ 819.335489] [] driver_attach+0x1e/0x20
[ 819.341534] [] bus_add_driver+0x1b0/0x2a0
[ 819.347884] [] ? 0xffffffffa0346fff
[ 819.353641] [] driver_register+0x76/0x140
[ 819.359980] [] ? printk+0x51/0x53
[ 819.365524] [] ? 0xffffffffa0346fff
[ 819.371291] [] __pci_register_driver+0x56/0xd0
[ 819.378096] [] sbridge_init+0x54/0x1000 [sb_edac]
[ 819.385231] [] do_one_initcall+0x3f/0x170
[ 819.391577] [] sys_init_module+0xbe/0x230
[ 819.397926] [] system_call_fastpath+0x16/0x1b
[ 819.404633] ---[ end trace 1654fdd39556689f ]---

This happens because the bus is not being properly initialized.
Instead of putting the memory sub-devices inside the memory controller,
it is putting everything under the same directory:

$ tree /sys/bus/edac/
/sys/bus/edac/
├── devices
│ ├── all_channel_counts -> ../../../devices/system/edac/mc/mc0/all_channel_counts
│ ├── csrow0 -> ../../../devices/system/edac/mc/mc0/csrow0
│ ├── csrow1 -> ../../../devices/system/edac/mc/mc0/csrow1
│ ├── csrow2 -> ../../../devices/system/edac/mc/mc0/csrow2
│ ├── dimm0 -> ../../../devices/system/edac/mc/mc0/dimm0
│ ├── dimm1 -> ../../../devices/system/edac/mc/mc0/dimm1
│ ├── dimm3 -> ../../../devices/system/edac/mc/mc0/dimm3
│ ├── dimm6 -> ../../../devices/system/edac/mc/mc0/dimm6
│ ├── inject_addrmatch -> ../../../devices/system/edac/mc/mc0/inject_addrmatch
│ ├── mc -> ../../../devices/system/edac/mc
│ └── mc0 -> ../../../devices/system/edac/mc/mc0
├── drivers
├── drivers_autoprobe
├── drivers_probe
└── uevent

On a multi-memory controller system, the names "csrow%d" and "dimm%d"
should be under "mc%d", and not at the main hierarchy level.

So, we need to create a per-MC bus, in order to have its own namespace.

Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Greg K H
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-12 00:23:30 +0800

11 Jun, 2012

6 commits

8447c4d15 edac: Do alignment logic properly in edac_align_ptr() ... Browse Code »

The logic was checking the sizeof the structure being allocated to
determine whether an alignment fixup was required. This isn't right;
what we actually care about is the alignment of the actual pointer that's
about to be returned. This became an issue recently because struct
edac_mc_layer has a size that is not zero modulo eight, so we were
taking the correctly-aligned pointer and forcing it to be misaligned.
On Tile this caused an alignment exception.

Signed-off-by: Chris Metcalf
Signed-off-by: Mauro Carvalho Chehab

Chris Metcalf
2012-06-11 23:43:16 +0800
fd687502d edac: Rename the parent dev to pdev ... Browse Code »

As EDAC doesn't use struct device itself, it created a parent dev
pointer called as "pdev". Now that we'll be converting it to use
struct device, instead of struct devsys, this needs to be fixed.

No functional changes.

Reviewed-by: Aristeu Rozanski
Acked-by: Chris Metcalf
Cc: Doug Thompson
Cc: Borislav Petkov
Cc: Mark Gross
Cc: Jason Uhlenkott
Cc: Tim Small
Cc: Ranganathan Desikan
Cc: "Arvind R."
Cc: Olof Johansson
Cc: Egor Martovetsky
Cc: Michal Marek
Cc: Jiri Kosina
Cc: Joe Perches
Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Hitoshi Mitake
Cc: Andrew Morton
Cc: "Niklas Söderlund"
Cc: Shaohui Xie
Cc: Josh Boyer
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-11 22:56:06 +0800
53f2d0289 RAS: Add a tracepoint for reporting memory controller events ... Browse Code »

Add a new tracepoint-based hardware events report method for
reporting Memory Controller events.

Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.

[1] http://lwn.net/Articles/416669/

We have several subsystems & methods for reporting hardware errors:

1) EDAC ("Error Detection and Correction"). In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.

2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.

3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
decodes errors reported via machine check bank registers in AMD
processors to the console log using printk();

Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users.

As part of a RAS subsystem, let's encapsulate the memory error hardware
events into a trace facility.

The tracepoint printk will be displayed like:

mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]

Where:
[quant] is the quantity of errors
[error msg] is the driver-specific error message
(e. g. "memory read", "bus error", ...);
[location] is the location in terms of memory controller and
branch/channel/slot, channel/slot or csrow/channel;
[label] is the memory stick label;
[edac_mc detail] describes the address location of the error
and the syndrome;
[driver detail] is driver-specifig error message details,
when needed/provided (e. g. "area:DMA", ...)

For example:

mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

Of course, any userspace tools meant to handle errors should not parse
the above data. They should, instead, use the binary fields provided by
the tracepoint, mapping them directly into their Management Information
Base.

NOTE: The original patch was providing an additional mechanism for
MCA-based trace events that also contained MCA error register data.
However, as no agreement was reached so far for the MCA-based trace
events, for now, let's add events only for memory errors.
A latter patch is planned to change the tracepoint, for those types
of event.

Cc: Aristeu Rozanski
Cc: Doug Thompson
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-06-11 22:55:52 +0800
b9bc5ddb1 mpc85xx_edac: fix error: too few arguments to function 'edac_mc_alloc' ... Browse Code »

commit ca0907b "edac: Remove the legacy EDAC ABI" broke mpc85xx_edac
in the following manner:

mpc85xx_edac.c:983:35: error: too few arguments to function 'edac_mc_alloc'

this patch puts back the missing 'layers' argument.

[mchehab@redhat.com: As Ben sent a similar fix, I added his SOB on this patch]
Signed-off-by: Kim Phillips
Signed-off-by: Ben Collins
Signed-off-by: Mauro Carvalho Chehab

Kim Phillips
2012-06-11 22:49:51 +0800
2cbb587d3 edac: fix the error about memory type detection on SandyBridge ... Browse Code »

On SandyBridge, DDRIOA(Dev: 17 Func: 0 Offset: 328) is used
to detect whether DIMM is RDIMM/LRDIMM, not TA(Dev: 15 Func: 0).

Signed-off-by: Chen Gong
Signed-off-by: Mauro Carvalho Chehab

Chen Gong
2012-06-11 22:49:51 +0800
e35fca479 edac: avoid mce decoding crash after edac driver unloaded ... Browse Code »

Some edac drivers register themselves as mce decoders via
notifier_chain. But in current notifier_chain implementation logic,
it doesn't accept same notifier registered twice. If so, it will be
wrong when adding/removing the element from the list. For example,
on one SandyBridge platform, remove module sb_edac and then trigger
one error, it will hit oops because it has no mce decoder registered
but related notifier_chain still points to an invalid callback
function. Here is an example:

Call Trace:
[] atomic_notifier_call_chain+0x1a/0x20
[] mce_log+0x46/0x180
[] apei_mce_report_mem_error+0x4a/0x60
[] ghes_do_proc+0x192/0x210
[] ghes_proc+0x46/0x70
[] ghes_notify_sci+0x48/0x80
[] notifier_call_chain+0x55/0x80
[] __blocking_notifier_call_chain+0x5a/0x80
[] ? acpi_os_wait_events_complete+0x23/0x23
[] blocking_notifier_call_chain+0x16/0x20
[] acpi_hed_notify+0x19/0x1b
[] acpi_device_notify+0x19/0x1b
[] acpi_ev_notify_dispatch+0x67/0x7f
[] acpi_os_execute_deferred+0x29/0x36
[] process_one_work+0x132/0x450
[] worker_thread+0x17b/0x3c0
[] ? manage_workers+0x120/0x120
[] kthread+0x9e/0xb0
[] kernel_thread_helper+0x4/0x10
[] ? kthread_freezable_should_stop+0x70/0x70
[] ? gs_change+0x13/0x13
Code: f3 49 89 d4 45 85 ed 4d 89 c6 48 8b 0f 74 48 48 85 c9 75 17 eb 41
0f 1f 80 00 00 00 00 41 83 ed 01 4c 89 f9 74 22 4d 85 ff 74 1d 8b
79 08 4c 89 e2 48 89 de 48 89 cf ff 11 4d 85 f6 74 04 41
RIP [] notifier_call_chain+0x46/0x80
RSP
CR2: ffffffffa01af838
---[ end trace 0100930068e73e6f ]---
BUG: unable to handle kernel paging request at fffffffffffffff8
IP: [] kthread_data+0x10/0x20
PGD 1a0d067 PUD 1a0e067 PMD 0
Oops: 0000 [#2] SMP

Only i7core_edac and sb_edac have such issues because they have more
than one memory controller which means they have to register mce
decoder many times.

Cc: # 3.2 and upper
Signed-off-by: Chen Gong
Signed-off-by: Mauro Carvalho Chehab

Chen Gong
2012-06-11 22:49:51 +0800

31 May, 2012

1 commit

bbd771474 Merge branch 'x86/trampoline' into x86/urgent ... Browse Code »

x86/trampoline contains an urgent commit which is necessarily on a
newer baseline.

Signed-off-by: H. Peter Anvin

H. Peter Anvin
2012-05-31 03:11:32 +0800

30 May, 2012

2 commits

403e1c5b7 Merge branch 'x86/mce' into x86/urgent ... Browse Code »

Merge in these fixlets.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-05-30 20:12:06 +0800
87a5af24e Merge git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac ... Browse Code »

Pull EDAC internal API changes from Mauro Carvalho Chehab:
"This changeset is the first part of a series of patches that fixes the
EDAC sybsystem. On this set, it changes the Kernel EDAC API in order
to properly represent the Intel i3/i5/i7, Xeon 3xxx/5xxx/7xxx, and
Intel E5-xxxx memory controllers.

The EDAC core used to assume that:

- the DRAM chip select pin is directly accessed by the memory
controller

- when multiple channels are used, they're all filled with the
same type of memory.

None of the above premises is true on Intel memory controllers since
2002, when RAMBUS and FB-DIMMs were introduced, and Advanced Memory
Buffer or by some similar technologies hides the direct access to the
DRAM pins.

So, the existing drivers for those chipsets had to lie to the EDAC
core, in general telling that just one channel is filled. That
produces some hard to understand error messages like:

EDAC MC0: CE row 3, channel 0, label "DIMM1": 1 Unknown error(s): memory read error on FATAL area : cpu=0 Err=0008:00c2 (ch=2), addr = 0xad1f73480 => socket=0, Channel=0(mask=2), rank=1

The location information there (row3 channel 0) is completely bogus:
it has no physical meaning, and are just some random values that the
driver uses to talk with the EDAC core. The error actually happened
at CPU socket 0, channel 0, slot 1, but this is not reported anywhere,
as the EDAC core doesn't know anything about the memory layout. So,
only advanced users that know how the EDAC driver works and that tests
their systems to see how DIMMs are mapped can actually benefit for
such error logs.

This patch series fixes the error report logic, in order to allow the
EDAC to expose the memory architecture used by them to the EDAC core.
So, as the EDAC core now understands how the memory is organized, it
can provide an useful report:

EDAC MC0: CE memory read error on DIMM1 (channel:0 slot:1 page:0x364b1b offset:0x600 grain:32 syndrome:0x0 - count:1 area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:4)

The location of the DIMM where the error happened is reported by "MC0"
(cpu socket #0), at "channel:0 slot:1" location, and matches the
physical location of the DIMM.

There are two remaining issues not covered by this patch series:

- The EDAC sysfs API will still report bogus values. So,
userspace tools like edac-utils will still use the bogus data;

- Add a new tracepoint-based way to get the binary information
about the errors.

Those are on a second series of patches (also at -next), but will
probably miss the train for 3.5, due to the slow review process."

Fix up trivial conflict (due to spelling correction of removed code) in
drivers/edac/edac_device.c

* git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (42 commits)
i7core: fix ranks information at the per-channel struct
i5000: Fix the fatal error handling
i5100_edac: Fix a warning when compiled with 32 bits
i82975x_edac: Test nr_pages earlier to save a few CPU cycles
e752x_edac: provide more info about how DIMMS/ranks are mapped
i5000_edac: Fix the logic that retrieves memory information
i5400_edac: improve debug messages to better represent the filled memory
edac: Cleanup the logs for i7core and sb edac drivers
edac: Initialize the dimm label with the known information
edac: Remove the legacy EDAC ABI
x38_edac: convert driver to use the new edac ABI
tile_edac: convert driver to use the new edac ABI
sb_edac: convert driver to use the new edac ABI
r82600_edac: convert driver to use the new edac ABI
ppc4xx_edac: convert driver to use the new edac ABI
pasemi_edac: convert driver to use the new edac ABI
mv64x60_edac: convert driver to use the new edac ABI
mpc85xx_edac: convert driver to use the new edac ABI
i82975x_edac: convert driver to use the new edac ABI
i82875p_edac: convert driver to use the new edac ABI
...

Linus Torvalds
2012-05-30 09:32:37 +0800

29 May, 2012

2 commits

0bf09e829 i7core: fix ranks information at the per-channel struct ... Browse Code »

There is a flag at the per-channel struct that indicates if there are
any 4R dimm on it. The way the presence of this flag were reported
is not ok, as it might give the false idea that the channel were filled
with 2R memories:

[ 580.588701] EDAC DEBUG: get_dimm_config: Ch1 phy rd1, wr1 (0x063f7431): 2 ranks, UDIMMs
[ 580.588704] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400

(in this case, just one 1R memory is filled on channel 1)

So, use a better way to represent the per-channel ranks information.
After the patch, it will show:

[ 2002.233978] EDAC DEBUG: get_dimm_config: Ch0 phy rd0, wr0 (0x063f7431): UDIMMs
[ 2002.233982] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
[ 2002.233988] EDAC DEBUG: get_dimm_config: dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400

(in this case, there isn't any 4R memories)

Reported-by: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-05-29 06:13:55 +0800
486dfb163 i5000: Fix the fatal error handling ... Browse Code »

The fatal error channel bits point to a single channel, and not
to a range of channels. Fix the code to properly report it,
instead of printing messages like:
kernel: EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)

Signed-off-by: Mauro Carvalho Chehab

Mauro Carvalho Chehab
2012-05-29 06:13:54 +0800