12 Nov, 2014

1 commit

  • My static checker complains because the "e->location" has up to 256
    characters but we are copying it into the "pvt->detail_location" which
    only has space for 240 characters. That's not counting the surrounding
    text and the "e->other_detail" string which can be over 80 characters
    long.

    I am not familiar with this code but presumably it normally works.
    Let's add a limit though for safety.

    Signed-off-by: Dan Carpenter
    Acked-by: Mauro Carvalho Chehab
    Link: http://lkml.kernel.org/r/20140801082514.GD28869@mwanda
    Signed-off-by: Borislav Petkov

    Dan Carpenter
     

07 Feb, 2014

1 commit


24 Oct, 2013

2 commits

  • In latest UEFI spec(by now it's 2.4) there are some new
    fields for memory error reporting. Add these new fields for
    ghes_edac interface.

    Signed-off-by: Chen, Gong
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Tony Luck

    Chen, Gong
     
  • In latest UEFI spec(by now it is 2.4) memory error definition
    for CPER (UEFI 2.4 Appendix N Common Platform Error Record)
    adds some new fields. These fields help people to locate
    memory error to an actual DIMM location.

    Original-author: Tony Luck
    Signed-off-by: Chen, Gong
    Reviewed-by: Borislav Petkov
    Reviewed-by: Mauro Carvalho Chehab
    Acked-by: Naveen N. Rao
    Signed-off-by: Tony Luck

    Chen, Gong
     

26 Feb, 2013

8 commits

  • Since we will remove items off the list using list_del() we need
    to use a safe version of the list_for_each_entry() macro aptly named
    list_for_each_entry_safe().

    Signed-off-by: Wei Yongjun
    Signed-off-by: Mauro Carvalho Chehab

    Wei Yongjun
     
  • With the current version of CPER, there's no way to associate an
    error with the memory error. So, the error location in EDAC
    layers is unused.

    As CPER has its own idea about memory architectural layers, just
    output whatever is there inside the driver's detail at the RAS
    tracepoint.

    The EDAC location keeps untouched, in the case that, in some future,
    we could actually map the error into the dimm labels.

    Now, the error message:

    [ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
    [ 72.396627] {1}[Hardware Error]: APEI generic hardware error status
    [ 72.396628] {1}[Hardware Error]: severity: 2, corrected
    [ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
    [ 72.396632] {1}[Hardware Error]: flags: 0x01
    [ 72.396634] {1}[Hardware Error]: primary
    [ 72.396635] {1}[Hardware Error]: section_type: memory error
    [ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
    [ 72.396638] {1}[Hardware Error]: node: 3
    [ 72.396639] {1}[Hardware Error]: card: 0
    [ 72.396640] {1}[Hardware Error]: module: 0
    [ 72.396641] {1}[Hardware Error]: device: 0
    [ 72.396643] {1}[Hardware Error]: error_type: 18, unknown
    [ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

    Is properly represented on the trace event:

    kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)

    Tested on a 4 sockets E5-4650 Sandy Bridge machine.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • The UEFI spec defines the memory error types ans the bits that
    validate each field on the memory error record, at
    Appendix N om items N.2.5 (Memory Error Section) and
    N.2.11 (Error Status). Make the error description compliant with
    it, only showing the valid fields.

    The EDAC error log is now properly reporting the error:

    [ 281.556854] mce: [Hardware Error]: Machine check events logged
    [ 281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
    [ 281.557044] {2}[Hardware Error]: APEI generic hardware error status
    [ 281.557046] {2}[Hardware Error]: severity: 2, corrected
    [ 281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
    [ 281.557050] {2}[Hardware Error]: flags: 0x01
    [ 281.557052] {2}[Hardware Error]: primary
    [ 281.557053] {2}[Hardware Error]: section_type: memory error
    [ 281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
    [ 281.557056] {2}[Hardware Error]: node: 3
    [ 281.557057] {2}[Hardware Error]: card: 0
    [ 281.557058] {2}[Hardware Error]: module: 1
    [ 281.557059] {2}[Hardware Error]: device: 0
    [ 281.557061] {2}[Hardware Error]: error_type: 18, unknown
    [ 281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
    [ 281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)

    Tested on a 4 CPUs E5-4650 Sandy Bridge machine.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Provide a better infrastructure for printk's inside the driver:
    - use edac_dbg() for debug messages;
    - standardize the usage of pr_info();
    - provide warning about the risk of relying on this
    driver.

    While here, changes the size of a fake memory to 1 page. This is
    as good or as bad as 1000 pages, but it is easier for userspace to
    detect, as I don't expect that any machine implementing GHES would
    provide just 1 page available ;)

    Signed-off-by: Mauro Carvalho Chehab

    Conflicts:
    drivers/edac/ghes_edac.c

    Mauro Carvalho Chehab
     
  • On my tests on a 4xE5-4650 CPU's system, the GHES
    EDAC driver is called twice. As the SMBIOS DMI enumeration
    call will seek for the entire DIMM sockets in the system, on
    this machine, equipped with 128 GB of RAM, the memory is
    displayed twice:

    +-----------------------+
    | mc0 | mc1 |
    ----------+-----------------------+
    memory45: | 8192 MB | 8192 MB |
    memory44: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory43: | 0 MB | 0 MB |
    memory42: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory41: | 0 MB | 0 MB |
    memory40: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory39: | 8192 MB | 8192 MB |
    memory38: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory37: | 0 MB | 0 MB |
    memory36: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory35: | 0 MB | 0 MB |
    memory34: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory33: | 8192 MB | 8192 MB |
    memory32: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory31: | 0 MB | 0 MB |
    memory30: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory29: | 0 MB | 0 MB |
    memory28: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory27: | 8192 MB | 8192 MB |
    memory26: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory25: | 0 MB | 0 MB |
    memory24: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory23: | 0 MB | 0 MB |
    memory22: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory21: | 8192 MB | 8192 MB |
    memory20: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory19: | 0 MB | 0 MB |
    memory18: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory17: | 0 MB | 0 MB |
    memory16: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory15: | 8192 MB | 8192 MB |
    memory14: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory13: | 0 MB | 0 MB |
    memory12: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory11: | 0 MB | 0 MB |
    memory10: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory9: | 8192 MB | 8192 MB |
    memory8: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory7: | 0 MB | 0 MB |
    memory6: | 8192 MB | 8192 MB |
    ----------+-----------------------+
    memory5: | 0 MB | 0 MB |
    memory4: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory3: | 8192 MB | 8192 MB |
    memory2: | 0 MB | 0 MB |
    ----------+-----------------------+
    memory1: | 0 MB | 0 MB |
    memory0: | 8192 MB | 8192 MB |
    ----------+-----------------------+

    Total sum of 256 GB.

    As there's no reliable way to credit DIMMS to the right memory
    controller, just put everything on memory controller 0 (with should
    always exist).

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Instead of just faking a random value for the DIMM data, get
    the information that it is available via DMI table.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Now that the EDAC core is capable of just forward the errors via
    the userspace API, add a report mechanism for the GHES errors.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Register GHES at EDAC MC core, in order to avoid other
    drivers to also handle errors and mangle with error data.

    The edac core will warrant that just one driver will be used,
    so the first one to register (BIOS first) will be the one that
    will be reporting the hardware errors.

    For now, the EDAC driver does nothing but to register at the
    EDAC core, preventing the hardware-driven mechanism to
    interfere with GHES.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab