07 Sep, 2013

1 commit

  • Pull Tile arch updates from Chris Metcalf:
    "These changes bring in a bunch of new functionality that has been
    maintained internally at Tilera over the last year, plus other stray
    bits of work that I've taken into the tile tree from other folks.

    The changes include some PCI root complex work, interrupt-driven
    console support, support for performing fast-path unaligned data
    fixups by kernel-based JIT code generation, CONFIG_PREEMPT support,
    vDSO support for gettimeofday(), a serial driver for the tilegx
    on-chip UART, KGDB support, more optimized string routines, support
    for ftrace and kprobes, improved ASLR, and many bug fixes.

    We also remove support for the old TILE64 chip, which is no longer
    buildable"

    * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: (85 commits)
    tile: refresh tile defconfig files
    tile: rework
    tile PCI RC: make default consistent DMA mask 32-bit
    tile: add null check for kzalloc in tile/kernel/setup.c
    tile: make __write_once a synonym for __read_mostly
    tile: remove support for TILE64
    tile: use asm-generic/bitops/builtin-*.h
    tile: eliminate no-op "noatomichash" boot argument
    tile: use standard tile_bundle_bits type in traps.c
    tile: simplify code referencing hypervisor API addresses
    tile: change to in comments
    tile: mark pcibios_init() as __init
    tile: check for correct compiler earlier in asm-offsets.c
    tile: use standard 'generic-y' model for
    tile: use asm-generic version of
    tile PCI RC: add comment about "PCI hole" problem
    tile: remove DEBUG_EXTRA_FLAGS kernel config option
    tile: add virt_to_kpte() API and clean up and document behavior
    tile: support FRAME_POINTER
    tile: support reporting Tilera hypervisor statistics
    ...

    Linus Torvalds
     

27 Aug, 2013

2 commits

  • dct_base and dct_limit obtain 32 bit register values when they read
    their respective pci config space registers. A left shift beyond 32 bits
    will cause them to wrap around. Similar case for chan_addr as can be
    seen from the bug report (link below). In the patch, we rectify this by
    casting chan_addr to u64 and by comparing dct_base and dct_limit against
    properly shifted sys_addr in order to compare the correct bits.

    Reported-by: Dan Carpenter
    Signed-off-by: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/20130819132302.GA12171@elgon.mountain
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan
     
  • Basically we want to cover all 0x0-0xf models, i.e. Orochi and later.

    Cc: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/20130819192321.GF4165@pd.tnic
    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

15 Aug, 2013

1 commit


14 Aug, 2013

1 commit

  • The struct should be terminated by using empty braces in order to
    fix the following sparse warning.

    drivers/edac/cpc925_edac.c:792:10: warning: Using plain integer as NULL pointer

    Signed-off-by: Jingoo Han
    [ drop obvious comment ]
    Signed-off-by: Borislav Petkov

    Jingoo Han
     

12 Aug, 2013

2 commits

  • Now that we cache (family, model, stepping) locally, use them instead of
    boot_cpu_data.

    No functionality change.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • On newer models, support has been included for upto 4 DCT's, however,
    only DCT0 and DCT3 are currently configured (cf BKDG Section 2.10).
    Also, the routing DRAM Requests algorithm is different for F15h M30h.
    Thus it is cleaner to use a brand new function rather than adding quirks
    to the more generic f1x_match_to_this_node(). Refer to "2.10.5 DRAM
    Routing Requests" in the BKDG for further info.

    Tested on Fam15h M30h with ECC turned on using mce_amd_inj facility and
    verified to be functionally correct.

    While at it, verify if erratum workarounds for E505 and E637 still hold.
    From email conversations within AMD, the current status of the errata
    is:

    * Erratum 505: fixed in model 0x1, stepping 0x1 and later.
    * Erratum 637: not fixed.

    Signed-off-by: Aravind Gopalakrishnan
    [ Cleanups, corrections ]
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan
     

09 Aug, 2013

2 commits

  • Make a local function static in order to fix the following sparse
    warning:

    drivers/edac/x38_edac.c:252:14: warning: symbol 'x38_map_mchbar' was not declared. Should it be static?

    Signed-off-by: Jingoo Han
    [ Boris: Correct commit message ]
    Signed-off-by: Borislav Petkov

    Jingoo Han
     
  • This local symbol is used only in this file.
    Fix the following sparse warnings:

    drivers/edac/i3200_edac.c:264:14: warning: symbol 'i3200_map_mchbar' was not declared. Should it be static?

    Signed-off-by: Jingoo Han
    Signed-off-by: Borislav Petkov

    Jingoo Han
     

29 Jul, 2013

1 commit

  • It can happen that configurations are running in a single-channel mode
    even with a dual-channel memory controller, by, say, putting the DIMMs
    only on the one channel and leaving the other empty. This causes a
    problem in init_csrows which implicitly assumes that when the second
    channel is enabled, i.e. channel 1, the struct dimm hierarchy will be
    present. Which is not.

    So always allocate two channels unconditionally.

    This provides for the nice side effect that the data structures are
    initialized so some day, when memory hotplug is supported, it should
    just work out of the box when all of a sudden a second channel appears.

    Reported-and-tested-by: Roger Leigh
    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

24 Jul, 2013

2 commits

  • The usage of strict_strtol() is not preferred, because strict_strtol()
    is obsolete. Thus, kstrtol() should be used.

    Signed-off-by: Jingoo Han
    Signed-off-by: Borislav Petkov

    Jingoo Han
     
  • Fix the following:

    BUG: key ffff88043bdd0330 not in .data!
    ------------[ cut here ]------------
    WARNING: at kernel/lockdep.c:2987 lockdep_init_map+0x565/0x5a0()
    DEBUG_LOCKS_WARN_ON(1)
    Modules linked in: glue_helper sb_edac(+) edac_core snd acpi_cpufreq lrw gf128mul ablk_helper iTCO_wdt evdev i2c_i801 dcdbas button cryptd pcspkr iTCO_vendor_support usb_common lpc_ich mfd_core soundcore mperf processor microcode
    CPU: 2 PID: 599 Comm: modprobe Not tainted 3.10.0 #1
    Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A08 01/24/2013
    0000000000000009 ffff880439a1d920 ffffffff8160a9a9 ffff880439a1d958
    ffffffff8103d9e0 ffff88043af4a510 ffffffff81a16e11 0000000000000000
    ffff88043bdd0330 0000000000000000 ffff880439a1d9b8 ffffffff8103dacc
    Call Trace:
    dump_stack
    warn_slowpath_common
    warn_slowpath_fmt
    lockdep_init_map
    ? trace_hardirqs_on_caller
    ? trace_hardirqs_on
    debug_mutex_init
    __mutex_init
    bus_register
    edac_create_sysfs_mci_device
    edac_mc_add_mc
    sbridge_probe
    pci_device_probe
    driver_probe_device
    __driver_attach
    ? driver_probe_device
    bus_for_each_dev
    driver_attach
    bus_add_driver
    driver_register
    __pci_register_driver
    ? 0xffffffffa0010fff
    sbridge_init
    ? 0xffffffffa0010fff
    do_one_initcall
    load_module
    ? unset_module_init_ro_nx
    SyS_init_module
    tracesys
    ---[ end trace d24a70b0d3ddf733 ]---
    EDAC MC0: Giving out device to 'sbridge_edac.c' 'Sandy Bridge Socket#0': DEV 0000:3f:0e.0
    EDAC sbridge: Driver loaded.

    What happens is that bus_register needs a statically allocated lock_key
    because the last is handed in to lockdep. However, struct mem_ctl_info
    embeds struct bus_type (the whole struct, not a pointer to it) and the
    whole thing gets dynamically allocated.

    Fix this by using a statically allocated struct bus_type for the MC bus.

    Signed-off-by: Borislav Petkov
    Acked-by: Mauro Carvalho Chehab
    Cc: Markus Trippelsdorf
    Cc: stable@kernel.org # v3.10
    Signed-off-by: Tony Luck

    Borislav Petkov
     

18 Jul, 2013

1 commit


14 Jul, 2013

1 commit

  • Pull MIPS updates from Ralf Baechle:
    "MIPS updates:

    - All the things that didn't make 3.10.
    - Removes the Windriver PPMC platform. Nobody will miss it.
    - Remove a workaround from kernel/irq/irqdomain.c which was there
    exclusivly for MIPS. Patch by Grant Likely.
    - More small improvments for the SEAD 3 platform
    - Improvments on the BMIPS / SMP support for the BCM63xx series.
    - Various cleanups of dead leftovers.
    - Platform support for the Cavium Octeon-based EdgeRouter Lite.

    Two large KVM patchsets didn't make it for this pull request because
    their respective authors are vacationing"

    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (124 commits)
    MIPS: Kconfig: Add missing MODULES dependency to VPE_LOADER
    MIPS: BCM63xx: CLK: Add dummy clk_{set,round}_rate() functions
    MIPS: SEAD3: Disable L2 cache on SEAD-3.
    MIPS: BCM63xx: Enable second core SMP on BCM6328 if available
    MIPS: BCM63xx: Add SMP support to prom.c
    MIPS: define write{b,w,l,q}_relaxed
    MIPS: Expose missing pci_io{map,unmap} declarations
    MIPS: Malta: Update GCMP detection.
    Revert "MIPS: make CAC_ADDR and UNCAC_ADDR account for PHYS_OFFSET"
    MIPS: APSP: Remove
    SSB: Kconfig: Amend SSB_EMBEDDED dependencies
    MIPS: microMIPS: Fix improper definition of ISA exception bit.
    MIPS: Don't try to decode microMIPS branch instructions where they cannot exist.
    MIPS: Declare emulate_load_store_microMIPS as a static function.
    MIPS: Fix typos and cleanup comment
    MIPS: Cleanup indentation and whitespace
    MIPS: BMIPS: support booting from physical CPU other than 0
    MIPS: Only set cpu_has_mmips if SYS_SUPPORTS_MICROMIPS
    MIPS: GIC: Fix gic_set_affinity infinite loop
    MIPS: Don't save/restore OCTEON wide multiplier state on syscalls.
    ...

    Linus Torvalds
     

04 Jul, 2013

1 commit


11 Jun, 2013

1 commit

  • CAVIUM_OCTEON_SOC most place we used to use CPU_CAVIUM_OCTEON. This
    allows us to CPU_CAVIUM_OCTEON in places where we have no OCTEON SOC.

    Remove CAVIUM_OCTEON_SIMULATOR as it doesn't really do anything, we can
    get the same configuration with CAVIUM_OCTEON_SOC.

    Signed-off-by: David Daney
    Cc: linux-mips@linux-mips.org
    Cc: linux-ide@vger.kernel.org
    Cc: linux-edac@vger.kernel.org
    Cc: linux-i2c@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: spi-devel-general@lists.sourceforge.net
    Cc: devel@driverdev.osuosl.org
    Cc: linux-usb@vger.kernel.org
    Acked-by: Greg Kroah-Hartman
    Acked-by: Wolfram Sang
    Acked-by: Mauro Carvalho Chehab
    Patchwork: https://patchwork.linux-mips.org/patch/5295/
    Signed-off-by: Ralf Baechle

    David Daney
     

08 Jun, 2013

2 commits


04 Jun, 2013

1 commit

  • Ever since commit 45f035ab9b8f ("CONFIG_HOTPLUG should be always on"),
    it has been basically impossible to build a kernel with CONFIG_HOTPLUG
    turned off. Remove all the remaining references to it.

    Cc: Russell King
    Cc: Doug Thompson
    Cc: Bjorn Helgaas
    Cc: Steven Whitehouse
    Cc: Arnd Bergmann
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Andrew Morton
    Signed-off-by: Stephen Rothwell
    Acked-by: Mauro Carvalho Chehab
    Acked-by: Hans Verkuil
    Signed-off-by: Greg Kroah-Hartman

    Stephen Rothwell
     

21 May, 2013

1 commit


10 May, 2013

1 commit


09 May, 2013

1 commit

  • I get the following warning on boot:

    ------------[ cut here ]------------
    WARNING: at drivers/base/core.c:575 device_create_file+0x9a/0xa0()
    Hardware name: -[8737R2A]-
    Write permission without 'store'
    ...

    Drilling down, this is related to dynamic channel ce_count attribute
    files sporting a S_IWUSR mode without a ->store() function. Looking
    around, it appears that they aren't supposed to have a ->store()
    function. So remove the bogus write permission to get rid of the
    warning.

    Signed-off-by: Srivatsa S. Bhat
    Cc: Mauro Carvalho Chehab
    Cc: # 3.[89]
    [ shorten commit message ]
    Signed-off-by: Borislav Petkov

    Srivatsa S. Bhat
     

01 May, 2013

1 commit

  • Pull edac fixes from Mauro Carvalho Chehab:
    "Two edac fixes:

    - i7300_edac currently reports a wrong number of DIMMs when the
    memory controller is in single channel mode

    - on some Sandy Bridge machines, the EDAC driver bails out as one of
    the PCI IDs used by the driver is hidden by BIOS. As the driver
    uses it only to detect the type of memory, make it optional at the
    driver"

    * 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
    edac: sb_edac.c should not require prescence of IMC_DDRIO device
    i7300_edac: Fix memory detection in single mode

    Linus Torvalds
     

29 Apr, 2013

2 commits

  • The Sandy Bridge EDAC driver uses a register in the IMC_DDRIO CSR
    space to determine the type of DIMMs (registered or unregistered).
    But this device does not exist on some single socket Sandy Bridge
    servers. While the type of DIMMs is nice to know, it is not essential
    for this driver's other functions. So it seems harsh to have it
    refuse to load at all when it cannot find this device.

    Make the check for this device be optional. If it isn't present
    just report the memory type as "MEM_UNKNOWN".

    Signed-off-by: Tony Luck
    Signed-off-by: Mauro Carvalho Chehab

    Luck, Tony
     
  • When the machine is on single mode, only branch 0 channel 0
    is valid. However, the code is not honouring it:

    [ 1952.639341] EDAC DEBUG: i7300_get_mc_regs: Memory controller operating on single mode
    ...
    [ 1952.639351] EDAC DEBUG: i7300_init_csrows: AMB-present CH0 = 0x1:
    [ 1952.639353] EDAC DEBUG: i7300_init_csrows: AMB-present CH1 = 0x0:
    [ 1952.639355] EDAC DEBUG: i7300_init_csrows: AMB-present CH2 = 0x0:
    [ 1952.639358] EDAC DEBUG: i7300_init_csrows: AMB-present CH3 = 0x0:
    ...
    [ 1952.639360] EDAC DEBUG: decode_mtr: MTR0 CH0: DIMMs are Present (mtr)
    [ 1952.639362] EDAC DEBUG: decode_mtr: WIDTH: x8
    [ 1952.639363] EDAC DEBUG: decode_mtr: ELECTRICAL THROTTLING is enabled
    [ 1952.639364] EDAC DEBUG: decode_mtr: NUMBANK: 4 bank(s)
    [ 1952.639366] EDAC DEBUG: decode_mtr: NUMRANK: single
    [ 1952.639367] EDAC DEBUG: decode_mtr: NUMROW: 16,384 - 14 rows
    [ 1952.639368] EDAC DEBUG: decode_mtr: NUMCOL: 1,024 - 10 columns
    [ 1952.639370] EDAC DEBUG: decode_mtr: SIZE: 512 MB
    [ 1952.639371] EDAC DEBUG: decode_mtr: ECC code is 8-byte-over-32-byte SECDED+ code
    [ 1952.639373] EDAC DEBUG: decode_mtr: Scrub algorithm for x8 is on enhanced mode
    [ 1952.639374] EDAC DEBUG: decode_mtr: MTR0 CH1: DIMMs are Present (mtr)
    [ 1952.639376] EDAC DEBUG: decode_mtr: WIDTH: x8
    [ 1952.639377] EDAC DEBUG: decode_mtr: ELECTRICAL THROTTLING is enabled
    [ 1952.639379] EDAC DEBUG: decode_mtr: NUMBANK: 4 bank(s)
    [ 1952.639380] EDAC DEBUG: decode_mtr: NUMRANK: single
    [ 1952.639381] EDAC DEBUG: decode_mtr: NUMROW: 16,384 - 14 rows
    [ 1952.639383] EDAC DEBUG: decode_mtr: NUMCOL: 1,024 - 10 columns
    [ 1952.639384] EDAC DEBUG: decode_mtr: SIZE: 512 MB
    [ 1952.639385] EDAC DEBUG: decode_mtr: ECC code is 8-byte-over-32-byte SECDED+ code
    [ 1952.639387] EDAC DEBUG: decode_mtr: Scrub algorithm for x8 is on enhanced mode
    ...
    [ 1952.639449] EDAC DEBUG: print_dimm_size: channel 0 | channel 1 | channel 2 | channel 3 |
    [ 1952.639451] EDAC DEBUG: print_dimm_size: -------------------------------------------------------------
    [ 1952.639453] EDAC DEBUG: print_dimm_size: csrow/SLOT 0 512 MB | 512 MB | 0 MB | 0 MB |
    [ 1952.639456] EDAC DEBUG: print_dimm_size: csrow/SLOT 1 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639458] EDAC DEBUG: print_dimm_size: csrow/SLOT 2 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639460] EDAC DEBUG: print_dimm_size: csrow/SLOT 3 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639462] EDAC DEBUG: print_dimm_size: csrow/SLOT 4 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639464] EDAC DEBUG: print_dimm_size: csrow/SLOT 5 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639466] EDAC DEBUG: print_dimm_size: csrow/SLOT 6 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639468] EDAC DEBUG: print_dimm_size: csrow/SLOT 7 0 MB | 0 MB | 0 MB | 0 MB |
    [ 1952.639470] EDAC DEBUG: print_dimm_size: -------------------------------------------------------------

    Instead of detecting a single memory at channel 0, it is showing
    twice the memory.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

19 Apr, 2013

1 commit


25 Mar, 2013

1 commit


16 Mar, 2013

2 commits

  • Both mci.mem_is_per_rank and mci.csbased denote the same thing: the
    memory controller is csrows based. Merge both fields into one.

    There's no need for the driver to actually fill it, as the core detects
    it by checking if one of the layers has the csrows type as part of the
    memory hierarchy:

    if (layers[i].type == EDAC_MC_LAYER_CHIP_SELECT)
    per_rank = true;

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Borislav Petkov

    Mauro Carvalho Chehab
     
  • We were filling the csrow size with a wrong value. 16a528ee3975 ("EDAC:
    Fix csrow size reported in sysfs") tried to address the issue. It fixed
    the report with the old API but not with the new one. Correct it for the
    new API too.

    Signed-off-by: Mauro Carvalho Chehab
    [ make it a per-csrow accounting regardless of ->channel_count ]
    Signed-off-by: Borislav Petkov

    Mauro Carvalho Chehab
     

05 Mar, 2013

1 commit


01 Mar, 2013

1 commit

  • Pull EDAC fixes and ghes-edac from Mauro Carvalho Chehab:
    "For:

    - Some fixes at edac drivers (i7core_edac, sb_edac, i3200_edac);
    - error injection support for i5100, when EDAC debug is enabled;
    - fix edac when it is loaded builtin (early init for the subsystem);
    - a "Firmware First" EDAC driver, allowing ghes to report errors via
    EDAC (ghes-edac).

    With regards to ghes-edac, this fixes a longstanding BZ at Red Hat
    that happens with Nehalem and Sandy Bridge CPUs: when both GHES and
    i7core_edac or sb_edac are running, the error reports are
    unpredictable, as both BIOS and OS race to access the registers. With
    ghes-edac, the EDAC core will refuse to register any other concurrent
    memory error driver.

    This patchset moves the ghes struct definitions to a separate header
    file (include/acpi/ghes.h) and adds 3 hooks at apei/ghes.c to
    register/unregister and to report errors via ghes-edac. Those changes
    were acked by ghes driver maintainer (Huang)."

    * 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (30 commits)
    i5100_edac: convert to use simple_open()
    ghes_edac: fix to use list_for_each_entry_safe() when delete list items
    ghes_edac: Fix RAS tracing
    ghes_edac: Make it compliant with UEFI spec 2.3.1
    ghes_edac: Improve driver's printk messages
    ghes_edac: Don't credit the same memory dimm twice
    ghes_edac: do a better job of filling EDAC DIMM info
    ghes_edac: add support for reporting errors via EDAC
    ghes_edac: Register at EDAC core the BIOS report
    ghes: add the needed hooks for EDAC error report
    ghes: move structures/enum to a header file
    edac: add support for error type "Info"
    edac: add support for raw error reports
    edac: reduce stack pressure by using a pre-allocated buffer
    edac: lock module owner to avoid error report conflicts
    edac: remove proc_name from mci structure
    edac: add a new memory layer type
    edac: initialize the core earlier
    edac: better report error conditions in debug mode
    i5100_edac: Remove two checkpatch warnings
    ...

    Linus Torvalds
     

26 Feb, 2013

9 commits

  • This removes an open coded simple_open() function and
    replaces file operations references to the function
    with simple_open() instead.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Mauro Carvalho Chehab

    Wei Yongjun
     
  • Since we will remove items off the list using list_del() we need
    to use a safe version of the list_for_each_entry() macro aptly named
    list_for_each_entry_safe().

    Signed-off-by: Wei Yongjun
    Signed-off-by: Mauro Carvalho Chehab

    Wei Yongjun
     
  • With the current version of CPER, there's no way to associate an
    error with the memory error. So, the error location in EDAC
    layers is unused.

    As CPER has its own idea about memory architectural layers, just
    output whatever is there inside the driver's detail at the RAS
    tracepoint.

    The EDAC location keeps untouched, in the case that, in some future,
    we could actually map the error into the dimm labels.

    Now, the error message:

    [ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
    [ 72.396627] {1}[Hardware Error]: APEI generic hardware error status
    [ 72.396628] {1}[Hardware Error]: severity: 2, corrected
    [ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
    [ 72.396632] {1}[Hardware Error]: flags: 0x01
    [ 72.396634] {1}[Hardware Error]: primary
    [ 72.396635] {1}[Hardware Error]: section_type: memory error
    [ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
    [ 72.396638] {1}[Hardware Error]: node: 3
    [ 72.396639] {1}[Hardware Error]: card: 0
    [ 72.396640] {1}[Hardware Error]: module: 0
    [ 72.396641] {1}[Hardware Error]: device: 0
    [ 72.396643] {1}[Hardware Error]: error_type: 18, unknown
    [ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

    Is properly represented on the trace event:

    kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)

    Tested on a 4 sockets E5-4650 Sandy Bridge machine.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • The UEFI spec defines the memory error types ans the bits that
    validate each field on the memory error record, at
    Appendix N om items N.2.5 (Memory Error Section) and
    N.2.11 (Error Status). Make the error description compliant with
    it, only showing the valid fields.

    The EDAC error log is now properly reporting the error:

    [ 281.556854] mce: [Hardware Error]: Machine check events logged
    [ 281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
    [ 281.557044] {2}[Hardware Error]: APEI generic hardware error status
    [ 281.557046] {2}[Hardware Error]: severity: 2, corrected
    [ 281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
    [ 281.557050] {2}[Hardware Error]: flags: 0x01
    [ 281.557052] {2}[Hardware Error]: primary
    [ 281.557053] {2}[Hardware Error]: section_type: memory error
    [ 281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
    [ 281.557056] {2}[Hardware Error]: node: 3
    [ 281.557057] {2}[Hardware Error]: card: 0
    [ 281.557058] {2}[Hardware Error]: module: 1
    [ 281.557059] {2}[Hardware Error]: device: 0
    [ 281.557061] {2}[Hardware Error]: error_type: 18, unknown
    [ 281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
    [ 281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

    Tested on a 4 CPUs E5-4650 Sandy Bridge machine.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Provide a better infrastructure for printk's inside the driver:
    - use edac_dbg() for debug messages;
    - standardize the usage of pr_info();
    - provide warning about the risk of relying on this
    driver.

    While here, changes the size of a fake memory to 1 page. This is
    as good or as bad as 1000 pages, but it is easier for userspace to
    detect, as I don't expect that any machine implementing GHES would
    provide just 1 page available ;)

    Signed-off-by: Mauro Carvalho Chehab

    Conflicts:
    drivers/edac/ghes_edac.c

    Mauro Carvalho Chehab
     
  • On my tests on a 4xE5-4650 CPU's system, the GHES
    EDAC driver is called twice. As the SMBIOS DMI enumeration
    call will seek for the entire DIMM sockets in the system, on
    this machine, equipped with 128 GB of RAM, the memory is
    displayed twice:

    +-----------------------+
    | mc0 | mc1 |
    ----------+-----------------------+
    memory45: | 8192 MB | 8192 MB |
    memory44: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory43: | 0 MB | 0 MB |
    memory42: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory41: | 0 MB | 0 MB |
    memory40: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory39: | 8192 MB | 8192 MB |
    memory38: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory37: | 0 MB | 0 MB |
    memory36: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory35: | 0 MB | 0 MB |
    memory34: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory33: | 8192 MB | 8192 MB |
    memory32: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory31: | 0 MB | 0 MB |
    memory30: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory29: | 0 MB | 0 MB |
    memory28: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory27: | 8192 MB | 8192 MB |
    memory26: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory25: | 0 MB | 0 MB |
    memory24: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory23: | 0 MB | 0 MB |
    memory22: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory21: | 8192 MB | 8192 MB |
    memory20: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory19: | 0 MB | 0 MB |
    memory18: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory17: | 0 MB | 0 MB |
    memory16: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory15: | 8192 MB | 8192 MB |
    memory14: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory13: | 0 MB | 0 MB |
    memory12: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory11: | 0 MB | 0 MB |
    memory10: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory9: | 8192 MB | 8192 MB |
    memory8: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory7: | 0 MB | 0 MB |
    memory6: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory5: | 0 MB | 0 MB |
    memory4: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory3: | 8192 MB | 8192 MB |
    memory2: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory1: | 0 MB | 0 MB |
    memory0: | 8192 MB | 8192 MB |
    ----------+-----------------------+

    Total sum of 256 GB.

    As there's no reliable way to credit DIMMS to the right memory
    controller, just put everything on memory controller 0 (with should
    always exist).

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Instead of just faking a random value for the DIMM data, get
    the information that it is available via DMI table.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Now that the EDAC core is capable of just forward the errors via
    the userspace API, add a report mechanism for the GHES errors.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Register GHES at EDAC MC core, in order to avoid other
    drivers to also handle errors and mangle with error data.

    The edac core will warrant that just one driver will be used,
    so the first one to register (BIOS first) will be the one that
    will be reporting the hardware errors.

    For now, the EDAC driver does nothing but to register at the
    EDAC core, preventing the hardware-driven mechanism to
    interfere with GHES.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab