27 Apr, 2017

1 commit


10 Apr, 2017

10 commits

  • Change them to have the edac_ prefix.

    No functionality change.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Move the remaining functionality to edac_mc.c. Convert "edac_report=" to
    a module parameter.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Remove the old URLs.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Move all the EDAC core functionality behind CONFIG_EDAC and get rid of
    that indirection. Update defconfigs which had it.

    While at it, fix dependencies such that EDAC depends on RAS for the
    tracepoints.

    Signed-off-by: Borislav Petkov
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Chris Metcalf
    Cc: linux-edac@vger.kernel.org

    Borislav Petkov
     
  • ... and this happens only when CONFIG_RAS is enabled.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • ... as part of moving stuff away from edac_stub.c

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • ... and the glue around it. It is not needed anymore.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Use mc_devices list instead to check whether we have EDAC driver
    instances successfully registered with EDAC core.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Apparently, some machines used to report DRAM errors through a PCI SERR
    NMI. This is why we have a call into EDAC in the NMI handler. See

    c0d121720220 ("drivers/edac: add new nmi rescan").

    From looking at the patch above, that's two drivers: e752x_edac.c and
    e7xxx_edac.c. Now, I wanna say those are old machines which are probably
    decommissioned already.

    Tony says that "[t]the newest CPU supported by either of those drivers
    is the Xeon E7520 (a.k.a. "Nehalem") released in Q1'2010. Possibly some
    folks are still using these ... but people that hold onto h/w for 7
    years generally cling to old s/w too ... so I'd guess it unlikely that
    we will get complaints for breaking these in upstream."

    So even if there is a small number still in use, we did load EDAC with
    edac_op_state == EDAC_OPSTATE_POLL by default (we still do, in fact)
    which means a default EDAC setup without any parameters supplied on the
    command line or otherwise would never even log the error in the NMI
    handler because we're polling by default:

    inline int edac_handler_set(void)
    {
    if (edac_op_state == EDAC_OPSTATE_POLL)
    return 0;

    return atomic_read(&edac_handlers);
    }

    So, long story short, I'd like to get rid of that nastiness called
    edac_stub.c and confine all the EDAC drivers solely to drivers/edac/. If
    we ever have to do stuff like that again, it should be notifiers we're
    using and not some insanity like this one.

    Signed-off-by: Borislav Petkov
    Acked-by: Thomas Gleixner
    Cc: Tony Luck

    Borislav Petkov
     
  • ... like the rest of the file.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

07 Apr, 2017

2 commits

  • Remove unused code reserved for upcoming CPUs.

    Reported-by: Dan Carpenter
    Signed-off-by: Sergey Temerkhanov
    Cc: David Daney
    Cc: Jan.Glauber@cavium.com
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170406113834.17153-1-s.temerkhanov@gmail.com
    Signed-off-by: Borislav Petkov

    Sergey Temerkhanov
     
  • Shift the node number by 3 bits instead of 8 allowing proper functioning
    with default EDAC_MAX_MCS.

    Signed-off-by: Sergey Temerkhanov
    Cc: David Daney
    Cc: Jan.Glauber@cavium.com
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170406113755.17082-1-s.temerkhanov@gmail.com
    Signed-off-by: Borislav Petkov

    Sergey Temerkhanov
     

06 Apr, 2017

1 commit

  • The peripherals' RAS functionality only exist on the Arria10 SoCFPGA.
    The Cyclone5 initialization generates EDAC warnings when the peripherals
    aren't found in the device tree. Fix by checking for Arria10 in the init
    functions.

    Signed-off-by: Thor Thayer
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1491415262-5018-1-git-send-email-thor.thayer@linux.intel.com
    Signed-off-by: Borislav Petkov

    Thor Thayer
     

05 Apr, 2017

1 commit

  • Fix a typo that disabled the MCI interrupts using the wrong bitmask.

    Signed-off-by: Jan Glauber
    Cc: David Daney
    Cc: Ralf Baechle
    Cc: Sergey Temerkhanov
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170405102739.6301-1-jglauber@cavium.com
    Signed-off-by: Borislav Petkov

    Jan Glauber
     

27 Mar, 2017

1 commit

  • Add support for Cavium ThunderX EDAC capable on-chip peripherals, namely
    the DRAM controller (LMC), cache coherent processor interconnect (CCPI)
    and level 2 cache blocks (L2C-TAD, L2C-MCI, L2C-CBC)

    Signed-off-by: Sergey Temerkhanov
    Cc: David.Daney@cavium.com
    Cc: Jan.Glauber@cavium.com
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170324222837.60583-1-s.temerkhanov@gmail.com
    Signed-off-by: Borislav Petkov

    Sergey Temerkhanov
     

26 Mar, 2017

1 commit


23 Mar, 2017

2 commits

  • Provide debugfs function stubs when EDAC_DEBUG is not enabled so that we
    don't fail the build:

    drivers/edac/pnd2_edac.c: In function ‘pnd2_init’:
    drivers/edac/pnd2_edac.c:1521:2: error: implicit declaration of function ‘setup_pnd2_debug’ [-Werror=implicit-function-declaration]
    setup_pnd2_debug();
    ^
    drivers/edac/pnd2_edac.c: In function ‘pnd2_exit’:
    drivers/edac/pnd2_edac.c:1529:2: error: implicit declaration of function ‘teardown_pnd2_debug’ [-Werror=implicit-function-declaration]
    teardown_pnd2_debug();
    ^

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • The debugfs.c functionality relies on DEBUG_FS so select it.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

16 Mar, 2017

1 commit

  • Initial target for this driver is the Intel Apollo Lake platform and
    Denverton micro-server, they use the same internal memory controller IP
    called Pondicherry2.

    Memory controller registers are not in PCI config space like earlier
    Intel memory controllers. For Apollo Lake platform they are accessed via
    a "side-band" interface, for Denverton micro-server they are access via
    PCI config space and memory map I/O. This driver is for Apollo Lake and
    Denverton, but only the Denverton is fully enabled while we wait for the
    sideband driver.

    Apollo lake driver and initial cut at Denverton driver by Tony Luck.
    Extensive cleanup, refactoring and basic verification by Qiuxu Zhuo.

    Signed-off-by: Tony Luck
    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170308174539.14432-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     

09 Mar, 2017

1 commit

  • The MTR_DRAM_WIDTH macro returns the data width. It is sometimes used
    as if it returned a boolean true if the width if 8. Fix the tests where
    MTR_DRAM_WIDTH is misused.

    Signed-off-by: Jérémy Lefaure
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170309011809.8340-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Borislav Petkov

    Jérémy Lefaure
     

07 Mar, 2017

1 commit


21 Feb, 2017

1 commit

  • Pull RAS updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Assign notifier chain priorities for all RAS related handlers to
    make the ordering explicit (Borislav Petkov)

    - Improve the AMD MCA banks sysfs output (Yazen Ghannam)

    - Various cleanups and restructuring of the x86 RAS code (Borislav
    Petkov)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
    x86/ras: Get rid of mce_process_work()
    EDAC/mce/amd: Dump TSC value
    EDAC/mce/amd: Unexport amd_decode_mce()
    x86/ras/amd/inj: Change dependency
    x86/ras: Flip the TSC-adding logic
    x86/ras/amd: Make sysfs names of banks more user-friendly
    x86/ras/therm_throt: Do not log a fake MCE for thermal events
    x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

    Linus Torvalds
     

16 Feb, 2017

1 commit

  • Currently, the IPID and Syndrome are printed on the same line as the
    Address. There are cases when we can have a valid Syndrome but not a
    valid Address.

    For example, the MCA_SYND register can be used to hold more detailed
    error info that the hardware folks can use. It's not just DRAM ECC
    syndromes. There are some error types that aren't related to memory that
    may have valid syndromes, like some errors related to links in the Data
    Fabric, etc.

    In these cases, the IPID and Syndrome are not printed at the same log
    level as the rest of the stanza, so users won't see them on the console.

    Console:
    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Dmesg:
    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    , Syndrome: 0x000000010b404000, IPID: 0x0001002e00000002
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Print the IPID first and on a new line. The IPID should always be
    printed on SMCA systems. The Syndrome will then be printed with the IPID
    and at the same log level when valid:

    [Hardware Error]: CPU:16 (17:1:0) MC22_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd82000000002080b
    [Hardware Error]: IPID: 0x0001002e00000002, Syndrome: 0x000000010b404000
    [Hardware Error]: Power, Interrupts, etc. Extended Error Code: 2

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1487192182-2474-1-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

14 Feb, 2017

1 commit


10 Feb, 2017

1 commit

  • Fix the following sparse warnings:

    drivers/edac/fsl_ddr_edac.c:148:1: warning:
    symbol 'dev_attr_inject_data_hi' was not declared. Should it be static?
    drivers/edac/fsl_ddr_edac.c:150:1: warning:
    symbol 'dev_attr_inject_data_lo' was not declared. Should it be static?
    drivers/edac/fsl_ddr_edac.c:152:1: warning:
    symbol 'dev_attr_inject_ctrl' was not declared. Should it be static?

    Signed-off-by: Wei Yongjun
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170209150424.15124-1-weiyj.lk@gmail.com
    Signed-off-by: Borislav Petkov

    Wei Yongjun
     

03 Feb, 2017

1 commit

  • The L2 cache controller on the T2080 SoC has similar capabilities to the
    others already supported by the mpc85xx_edac driver. Add it to the list
    of compatible devices.

    Signed-off-by: Chris Packham
    Acked-by: Johannes Thumshirn
    Acked-by: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: devicetree@vger.kernel.org
    Cc: linux-edac
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20170201231624.28843-1-chris.packham@alliedtelesis.co.nz
    Signed-off-by: Borislav Petkov

    Chris Packham
     

28 Jan, 2017

8 commits

  • Match one of the devices in amd64_cpuids[] before loading the module.
    This is an additional sanity check against users trying to load
    amd64_edac_mod on unsupported systems.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-9-git-send-email-Yazen.Ghannam@amd.com
    [ Get rid of err_ret label, make it a bit more readable this way. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Having ECC disabled on a node doesn't necessarily mean that it's
    disabled for the entire system. So let's return a non-failing code when
    ECC is disabled on a node. This way we can skip initialization for the
    node but still continue with the remaining nodes.

    After probing all instances, make sure we have at least one MC device
    allocated.

    This issue is seen and fix tested on Fam15h and Fam17h MCM systems.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-8-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • We need to know if any MC devices have been allocated.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-7-git-send-email-Yazen.Ghannam@amd.com
    [ Prettify text. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • amd64_{debug,notice} don't have any users, so remove them.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-6-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Print the node number when informing that DRAM ECC is disabled so
    that we can show which nodes have DRAM ECC disabled. Also, print more
    detailed system information as edac_dbg(), so as to not bother general
    users.

    Switch amd64_notice to amd64_info to match the message above it.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-5-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • We have a few functions that register/unregister an ECC error decoding
    routine. These functions are called when we init/remove instances.
    However, they are global and so don't need to be registered/unregistered
    multiple times.

    So move them out of the init/remove instance functions and into the
    module init/exit routines.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-4-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Jump to memory freeing routines when init_one_instance() fails.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-3-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Users may not be familiar with the concept of deferred errors. There is
    no action for users to take on this type of error, so give more context
    in the error message to make this more clear.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-2-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

26 Jan, 2017

1 commit

  • REDMEMB[17] is the ECC_Locator bit, which, when set, identifies the
    CS[3:2] as the simbols in error. And thus the second channel.

    The macro computing it was wrong so get rid of it (it was used at one
    place only) and get rid of the conditional too. Generates better code
    this way anyway.

    Signed-off-by: Borislav Petkov
    Reported-by: David Binderman
    Reviewed-by: Mauro Carvalho Chehab

    Borislav Petkov
     

24 Jan, 2017

3 commits

  • Assign all notifiers on the MCE decode chain a priority so that they get
    called in the correct order.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • Dump the TSC value of the time when the MCE got logged.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     
  • It is not used outside of the driver anymore.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

23 Jan, 2017

1 commit

  • Function sbridge_register_mci() sets pvt->info.show_interleave_mode
    to knl_show_interleave_mode() on Knight's Landing and
    show_interleave_mode() anywhere else.

    Merge show_interleave_mode() and knl_show_interleave_mode() in a single
    implementation and use it without an indirect function pointer.

    Signed-off-by: Nicolas Iooss
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170122172806.10412-1-nicolas.iooss_linux@m4x.org
    [ Call it get_intlv_mode_str(). ]
    Signed-off-by: Borislav Petkov

    Nicolas Iooss