12 Jun, 2012
5 commits
-
Now that al users for the old kobj raw access are gone,
we can get rid of the legacy kobj-based structures and
data.Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Michal Marek
Signed-off-by: Mauro Carvalho Chehab -
Instead of relying on a complex logic inside the edac core to create
a "device tree-like" sysfs struct, just use device_add.Reviewed-by: Aristeu Rozanski
Signed-off-by: Mauro Carvalho Chehab -
Now that the EDAC core supports struct device, there's no sense
on having any logic at the EDAC core to simulate it. So, instead
of adding such logic there, change the logic at amd64_edac to
use it.Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
Now that the EDAC core supports struct device, there's no sense on
having any logic at the EDAC core to simulate it. So, instead of adding
such logic there, change the logic at mpc85xx_edac to use itcompile-tested only.
Reviewed-by: Aristeu Rozanski
Cc: Andrew Morton
Cc: Shaohui Xie
Cc: Jiri Kosina
Signed-off-by: Mauro Carvalho Chehab -
The EDAC subsystem uses the old struct sysdev approach,
creating all nodes using the raw sysfs API. This is bad,
as the API is deprecated.As we'll be changing the EDAC API, let's first port the existing
code to struct device.There's one drawback on this patch: driver-specific sysfs
nodes, used by mpc85xx_edac, amd64_edac and i7core_edac
won't be created anymore. While it would be possible to
also port the device-specific code, that would mix kobj with
struct device, with is not recommended. Also, it is easier and nicer
to move the code to the drivers, instead, as the core can get rid
of some complex logic that just emulates what the device_add()
and device_create_file() already does.The next patches will convert the driver-specific code to use
the device-specific calls. Then, the remaining bits of the old
sysfs API will be removed.NOTE: a per-MC bus is required, otherwise devices with more than
one memory controller will hit a bug like the one below:[ 819.094946] EDAC DEBUG: find_mci_by_dev: find_mci_by_dev()
[ 819.094948] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device() idx=1
[ 819.094952] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device(): creating device mc1
[ 819.094967] EDAC DEBUG: edac_create_sysfs_mci_device: edac_create_sysfs_mci_device creating dimm0, located at channel 0 slot 0
[ 819.094984] ------------[ cut here ]------------
[ 819.100142] WARNING: at fs/sysfs/dir.c:481 sysfs_add_one+0xc1/0xf0()
[ 819.107282] Hardware name: S2600CP
[ 819.111078] sysfs: cannot create duplicate filename '/bus/edac/devices/dimm0'
[ 819.119062] Modules linked in: sb_edac(+) edac_core ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun kvm microcode pcspkr iTCO_wdt iTCO_vendor_support igb i2c_i801 i2c_core sg ioatdma dca sr_mod cdrom sd_mod crc_t10dif ahci libahci isci libsas libata scsi_transport_sas scsi_mod wmi dm_mod [last unloaded: scsi_wait_scan]
[ 819.175748] Pid: 10902, comm: modprobe Not tainted 3.3.0-0.11.el7.v12.2.x86_64 #1
[ 819.184113] Call Trace:
[ 819.186868] [] warn_slowpath_common+0x7f/0xc0
[ 819.193573] [] warn_slowpath_fmt+0x46/0x50
[ 819.200000] [] sysfs_add_one+0xc1/0xf0
[ 819.206025] [] sysfs_do_create_link+0x135/0x220
[ 819.212944] [] ? sysfs_create_group+0x13/0x20
[ 819.219656] [] sysfs_create_link+0x13/0x20
[ 819.226109] [] bus_add_device+0xe6/0x1b0
[ 819.232350] [] device_add+0x2db/0x460
[ 819.238300] [] edac_create_dimm_object+0x84/0xf0 [edac_core]
[ 819.246460] [] edac_create_sysfs_mci_device+0xe8/0x290 [edac_core]
[ 819.255215] [] edac_mc_add_mc+0x5a/0x2c0 [edac_core]
[ 819.262611] [] sbridge_register_mci+0x1bc/0x279 [sb_edac]
[ 819.270493] [] sbridge_probe+0xef/0x175 [sb_edac]
[ 819.277630] [] ? pm_runtime_enable+0x58/0x90
[ 819.284268] [] local_pci_probe+0x5c/0xd0
[ 819.290508] [] __pci_device_probe+0xf1/0x100
[ 819.297117] [] pci_device_probe+0x3a/0x60
[ 819.303457] [] really_probe+0x73/0x270
[ 819.309496] [] driver_probe_device+0x4e/0xb0
[ 819.316104] [] __driver_attach+0xab/0xb0
[ 819.322337] [] ? driver_probe_device+0xb0/0xb0
[ 819.329151] [] bus_for_each_dev+0x56/0x90
[ 819.335489] [] driver_attach+0x1e/0x20
[ 819.341534] [] bus_add_driver+0x1b0/0x2a0
[ 819.347884] [] ? 0xffffffffa0346fff
[ 819.353641] [] driver_register+0x76/0x140
[ 819.359980] [] ? printk+0x51/0x53
[ 819.365524] [] ? 0xffffffffa0346fff
[ 819.371291] [] __pci_register_driver+0x56/0xd0
[ 819.378096] [] sbridge_init+0x54/0x1000 [sb_edac]
[ 819.385231] [] do_one_initcall+0x3f/0x170
[ 819.391577] [] sys_init_module+0xbe/0x230
[ 819.397926] [] system_call_fastpath+0x16/0x1b
[ 819.404633] ---[ end trace 1654fdd39556689f ]---This happens because the bus is not being properly initialized.
Instead of putting the memory sub-devices inside the memory controller,
it is putting everything under the same directory:$ tree /sys/bus/edac/
/sys/bus/edac/
├── devices
│ ├── all_channel_counts -> ../../../devices/system/edac/mc/mc0/all_channel_counts
│ ├── csrow0 -> ../../../devices/system/edac/mc/mc0/csrow0
│ ├── csrow1 -> ../../../devices/system/edac/mc/mc0/csrow1
│ ├── csrow2 -> ../../../devices/system/edac/mc/mc0/csrow2
│ ├── dimm0 -> ../../../devices/system/edac/mc/mc0/dimm0
│ ├── dimm1 -> ../../../devices/system/edac/mc/mc0/dimm1
│ ├── dimm3 -> ../../../devices/system/edac/mc/mc0/dimm3
│ ├── dimm6 -> ../../../devices/system/edac/mc/mc0/dimm6
│ ├── inject_addrmatch -> ../../../devices/system/edac/mc/mc0/inject_addrmatch
│ ├── mc -> ../../../devices/system/edac/mc
│ └── mc0 -> ../../../devices/system/edac/mc/mc0
├── drivers
├── drivers_autoprobe
├── drivers_probe
└── ueventOn a multi-memory controller system, the names "csrow%d" and "dimm%d"
should be under "mc%d", and not at the main hierarchy level.So, we need to create a per-MC bus, in order to have its own namespace.
Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Cc: Greg K H
Signed-off-by: Mauro Carvalho Chehab
11 Jun, 2012
3 commits
-
No functional changes. Just comment improvements.
Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
As EDAC doesn't use struct device itself, it created a parent dev
pointer called as "pdev". Now that we'll be converting it to use
struct device, instead of struct devsys, this needs to be fixed.No functional changes.
Reviewed-by: Aristeu Rozanski
Acked-by: Chris Metcalf
Cc: Doug Thompson
Cc: Borislav Petkov
Cc: Mark Gross
Cc: Jason Uhlenkott
Cc: Tim Small
Cc: Ranganathan Desikan
Cc: "Arvind R."
Cc: Olof Johansson
Cc: Egor Martovetsky
Cc: Michal Marek
Cc: Jiri Kosina
Cc: Joe Perches
Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Hitoshi Mitake
Cc: Andrew Morton
Cc: "Niklas Söderlund"
Cc: Shaohui Xie
Cc: Josh Boyer
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Mauro Carvalho Chehab -
Add a new tracepoint-based hardware events report method for
reporting Memory Controller events.Part of the description bellow is shamelessly copied from Tony
Luck's notes about the Hardware Error BoF during LPC 2010 [1].
Tony, thanks for your notes and discussions to generate the
h/w error reporting requirements.[1] http://lwn.net/Articles/416669/
We have several subsystems & methods for reporting hardware errors:
1) EDAC ("Error Detection and Correction"). In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user level
programs decodes into somewhat human readable format.3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
decodes errors reported via machine check bank registers in AMD
processors to the console log using printk();Each of these mechanisms has a band of followers ... and none
of them appear to meet all the needs of all users.As part of a RAS subsystem, let's encapsulate the memory error hardware
events into a trace facility.The tracepoint printk will be displayed like:
mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]
Where:
[quant] is the quantity of errors
[error msg] is the driver-specific error message
(e. g. "memory read", "bus error", ...);
[location] is the location in terms of memory controller and
branch/channel/slot, channel/slot or csrow/channel;
[label] is the memory stick label;
[edac_mc detail] describes the address location of the error
and the syndrome;
[driver detail] is driver-specifig error message details,
when needed/provided (e. g. "area:DMA", ...)For example:
mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)
Of course, any userspace tools meant to handle errors should not parse
the above data. They should, instead, use the binary fields provided by
the tracepoint, mapping them directly into their Management Information
Base.NOTE: The original patch was providing an additional mechanism for
MCA-based trace events that also contained MCA error register data.
However, as no agreement was reached so far for the MCA-based trace
events, for now, let's add events only for memory errors.
A latter patch is planned to change the tracepoint, for those types
of event.Cc: Aristeu Rozanski
Cc: Doug Thompson
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Signed-off-by: Mauro Carvalho Chehab
29 May, 2012
32 commits
-
There is a flag at the per-channel struct that indicates if there are
any 4R dimm on it. The way the presence of this flag were reported
is not ok, as it might give the false idea that the channel were filled
with 2R memories:[ 580.588701] EDAC DEBUG: get_dimm_config: Ch1 phy rd1, wr1 (0x063f7431): 2 ranks, UDIMMs
[ 580.588704] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400(in this case, just one 1R memory is filled on channel 1)
So, use a better way to represent the per-channel ranks information.
After the patch, it will show:[ 2002.233978] EDAC DEBUG: get_dimm_config: Ch0 phy rd0, wr0 (0x063f7431): UDIMMs
[ 2002.233982] EDAC DEBUG: get_dimm_config: dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
[ 2002.233988] EDAC DEBUG: get_dimm_config: dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400(in this case, there isn't any 4R memories)
Reported-by: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
The fatal error channel bits point to a single channel, and not
to a range of channels. Fix the code to properly report it,
instead of printing messages like:
kernel: EDAC MC0: INTERNAL ERROR: channel-b out of range (4 >= 4)Signed-off-by: Mauro Carvalho Chehab
-
drivers/edac/i5100_edac.c: In function ‘i5100_init_csrows’:
drivers/edac/i5100_edac.c:862:3: warning: format ‘%zd’ expects argument of type ‘signed size_t’, but argument 5 has type ‘long unsigned int’ [-Wformat]Reviewed-by: Aristeu Rozanski
Cc: "Niklas Söderlund"
Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
Avoid test nr_pages twice, and initializing some data that won't
be used.Cleanup patch only.
Reported-by: Aristeu Rozanski Filho
Reviewed-by: Aristeu Rozanski
Cc: Ranganathan Desikan
Cc: "Arvind R."
Signed-off-by: Mauro Carvalho Chehab -
No funtional changes here. Only the comments got updated.
Reviewed-by: Aristeu Rozanski
Cc: Mark Gross
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
The logic there is broken: it basically creates two csrows for
each DIMM and assumes that all DIMM's are dual rank. Only one of
the csrows will contain the entire DIMM size. If single rank
memories are found, they'll be marked with 0 bytes.The check if the AMB is present were also wrong.
Yet, as the error reports don't use the memory size in order to
credit an error to the right DIMM, that part of the driver seems
to work. That's why probably nobody detected the issue yet.After this patch, the memory layout is now properly reported,
when debug mode is enabled, and the number of ranks per dimm is
now shown:calculate_dimm_size: ----------------------------------------------------------
calculate_dimm_size: slot 3 0 MB | 0 MB | 0 MB | 0 MB |
calculate_dimm_size: slot 2 0 MB | 0 MB | 0 MB | 0 MB |
calculate_dimm_size: ----------------------------------------------------------
calculate_dimm_size: slot 1 0 MB | 0 MB | 0 MB | 0 MB |
calculate_dimm_size: slot 0 512 MB 1R| 512 MB 1R| 512 MB 1R| 512 MB 1R|
calculate_dimm_size: ----------------------------------------------------------
calculate_dimm_size: channel 0 | channel 1 | channel 2 | channel 3 |
calculate_dimm_size: branch 0 | branch 1 |(1R above means that all memories on my test machine are single-ranked)
Reviewed-by: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
Improves the debug output message, in order to better represent the
memory controller hierarchy, when outputing the debug messages.No functional changes when debug is disabled.
Reviewed-by: Aristeu Rozanski
Signed-off-by: Mauro Carvalho Chehab -
Remove some information that it is duplicated at the MCE log,
and don't have much usage for the error. Those data will be
added again, when creating a trace function that outputs both
memory errors and MCE fields.Cc: Aristeu Rozanski
Signed-off-by: Mauro Carvalho Chehab -
While userspace doesn't fill the dimm labels, add there the dimm location,
as described by the used memory model. This could eventually match what
is described at the dmidecode, making easier for people to identify the
memory.For example, on an Intel motherboard where the DMI table is reliable,
the first memory stick is described as:Memory Device
Array Handle: 0x0029
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 2048 MB
Form Factor: DIMM
Set: 1
Locator: A1_DIMM0
Bank Locator: A1_Node0_Channel0_Dimm0
Type:
Type Detail: Synchronous
Speed: 800 MHz
Manufacturer: A1_Manufacturer0
Serial Number: A1_SerNum0
Asset Tag: A1_AssetTagNum0
Part Number: A1_PartNum0The memory named as "A1_DIMM0" is physically located at the first
memory controller (node 0), at channel 0, dimm slot 0.After this patch, the memory label will be filled with:
/sys/devices/system/edac/mc/csrow0/ch0_dimm_label:mc#0channel#0slot#0And (after the new EDAC API patches) as:
/sys/devices/system/edac/mc/mc0/dimm0/dimm_label:mc#0channel#0slot#0So, even if the memory label is not initialized on userspace, an useful
information with the error location is filled there, expecially since
several systems/motherboards are provided with enough info to map from
channel/slot (or branch/channel/slot) into the DIMM label. So, letting the
EDAC core fill it by default is a good thing.It should noticed that, as the label filling happens at the
edac_mc_alloc(), drivers can override it to better describe the memories
(and some actually do it).Cc: Aristeu Rozanski
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
Now that all drivers got converted to use the new ABI, we can
drop the old one.Acked-by: Chris Metcalf
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Acked-by: Chris Metcalf
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Signed-off-by: Mauro Carvalho Chehab
-
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Tim Small
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Josh Boyer
Cc: Jiri Kosina
Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Olof Johansson
Cc: Egor Martovetsky
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Andrew Morton
Cc: Shaohui Xie
Cc: Jiri Kosina
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Ranganathan Desikan
Cc: "Arvind R."
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Signed-off-by: Mauro Carvalho Chehab
-
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Michal Marek
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Tim Small
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Signed-off-by: Mauro Carvalho Chehab
-
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Signed-off-by: Mauro Carvalho Chehab
-
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Signed-off-by: Mauro Carvalho Chehab
-
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: "Niklas Söderlund"
Cc: Borislav Petkov
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Hitoshi Mitake
Cc: Borislav Petkov
Cc: Andrew Morton
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Jason Uhlenkott
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Mark Gross
Cc: Doug Thompson
Signed-off-by: Mauro Carvalho Chehab -
The legacy edac ABI is going to be removed. Port the driver to use
and benefit from the new API functionality.Cc: Dmitry Eremin-Solenikov
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Signed-off-by: Mauro Carvalho Chehab