09 Mar, 2018

1 commit

  • commit bf8486709ac7fad99e4040dea73fe466c57a4ae1 upstream.

    Commit

    3286d3eb906c ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")

    decreased NUM_CHANNELS from 8 to 4, but this is not enough for Knights
    Landing which supports up to 6 channels.

    This caused out-of-bounds writes to pvt->mirror_mode and pvt->tolm
    variables which don't pay critical role on KNL code path, so the memory
    corruption wasn't causing any visible driver failures.

    The easiest way of fixing it is to change NUM_CHANNELS to 6. Do that.

    An alternative solution would be to restructure the KNL part of the
    driver to 2MC/3channel representation.

    Reported-by: Dan Carpenter
    Signed-off-by: Anna Karbownik
    Cc: Mauro Carvalho Chehab
    Cc: Tony Luck
    Cc: jim.m.snow@intel.com
    Cc: krzysztof.paliswiat@intel.com
    Cc: lukasz.odzioba@intel.com
    Cc: qiuxu.zhuo@intel.com
    Cc: linux-edac
    Cc:
    Fixes: 3286d3eb906c ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")
    Link: http://lkml.kernel.org/r/1519312693-4789-1-git-send-email-anna.karbownik@intel.com
    [ Massage commit message. ]
    Signed-off-by: Borislav Petkov
    Signed-off-by: Greg Kroah-Hartman

    Anna Karbownik
     

10 Dec, 2017

1 commit

  • [ Upstream commit a8e9b186f153a44690ad0363a56716e7077ad28c ]

    Add missing break statement in order to prevent the code from falling
    through.

    Signed-off-by: Gustavo A. R. Silva
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20171016174029.GA19757@embeddedor.com
    Signed-off-by: Borislav Petkov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     

21 Nov, 2017

1 commit

  • commit 15cc3ae001873845b5d842e212478a6570c7d938 upstream.

    Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
    server (DELL PowerEdge 730xd):

    EDAC sbridge: Some needed devices are missing
    EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
    EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Failed to register device with error -19.

    The refactored sb_edac driver creates the IMC1 (the 2nd memory
    controller) if any IMC1 device is present. In this case only
    HA1_TA of IMC1 was present, but the driver expected to find
    HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.

    The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
    Zhang inserted one DIMM per channel for each CPU, and did random error
    address injection test with this patch:

    4024 addresses fell in TOLM hole area
    12715 addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
    12774 addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
    12798 addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
    12913 addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
    12674 addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
    12686 addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
    12882 addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
    12934 addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
    106400 addresses were injected totally.

    The test result shows that all the 4 channels belong to IMC0 per CPU, so
    the server really only has one IMC per CPU.

    In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
    implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
    used and should be ignored.

    Thus, do not create a second memory controller if the key HA1 is absent.

    [1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
    [2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

    Reported-and-tested-by: Yi Zhang
    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
    [ Massage commit message. ]
    Signed-off-by: Borislav Petkov
    Signed-off-by: Greg Kroah-Hartman

    Qiuxu Zhuo
     

02 Aug, 2017

1 commit

  • Basically, there are full memory mirroring and address range partial
    memory mirroring (supported by Haswell EX and Broadwell EX) modes.

    a) In full memory mirroring, the memory behind each memory controller
    is mirrored, i.e. the memory is split into two identical mirrors
    (primary and secondary), half of the memory is reserved for redundancy.

    b) In address range partial memory mirroring, the memory size (range)
    of primary and secondary behind each memory controller can be user
    defined by the TAD0 register. The rest of memory ranges defined by
    TAD1/TAD2/... in that memory controller are non-mirrored.

    For more detail on memory mirroring, see the following link written by Tony Luck:

    https://01.org/lkp/blogs/tonyluck/2016/address-range-partial-memory-mirroring-linux

    Currently the sb_edac driver only supports address decoding in full
    memory mirroring and non-mirroring modes. In address range partial
    memory mirroring mode, it may fail to decode an address that falls in a
    non-mirroring area (the following was one of this kind of failed logs).

    mce: Uncorrected hardware memory error in user-access at 566d53a400
    Memory failure: 0x566d53a: Killing einj_mem_uc:4647 due to hardware memory corruption
    Memory failure: 0x566d53a: recovery action for dirty LRU page: Recovered
    mce: [Hardware Error]: Machine check events logged
    EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
    EDAC sbridge MC1: CPU 48: Machine Check Event: 0 Bank 7: ec00000000010090
    EDAC sbridge MC1: TSC 4b914aa5a99dab
    EDAC sbridge MC1: ADDR 566d53a400
    EDAC sbridge MC1: MISC 1443a0c86
    EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1499712764 SOCKET 2 APIC 80
    EDAC MC1: 0 UE Can't discover the memory rank for ch addr 0x7fb54e900 on any memory ( page:0x0 offset:0x0 grain:32)
    mce: [Hardware Error]: Machine check events logged

    Therefore, classify memory mirroring modes and make the address decoding
    in address range partial memory mode correct.

    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170730180651.30060-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

17 Jul, 2017

1 commit

  • It is a write-only variable so get rid of it.

    Signed-off-by: Borislav Petkov
    Acked-by: Robert Richter
    Acked-by: Michal Simek
    Acked-by: Thor Thayer
    Acked-by: Tony Luck
    Cc: Mark Gross
    Cc: Tim Small
    Cc: Ranganathan Desikan
    Cc: "Arvind R."
    Cc: Jason Baron
    Cc: "Sören Brinkmann"
    Cc: Ralf Baechle
    Cc: David Daney
    Cc: Loc Ho
    Cc: linux-edac@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-mips@linux-mips.org

    Borislav Petkov
     

14 Jun, 2017

1 commit

  • Xiaolong Ye reported the following failure on Broadwell D server:

    EDAC sbridge: Some needed devices are missing
    EDAC MC: Removed device 0 for sbridge_edac.c Broadwell SrcID#0_Ha#0: DEV 0000:ff:12.0
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Failed to register device with error -19.

    Broadwell D (only IMC0 per socket) and Broadwell X (IMC0 and IMC1 per
    socket) use the same PCI device IDs for IMC0 per socket, then they
    share pci_dev_descr_broadwell_table (n_imcs_per_sock=2). In this case,
    Broadwell D wrongly creates the nonexistent SOCK EDAC memory controller
    and reports above error messages, since it has no IMC1 per socket.

    Avoid creating the nonexistent SOCK memory controller.

    Reported-and-tested-by: Xiaolong Ye
    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170608113351.25323-1-qiuxu.zhuo@intel.com
    [ Massage. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

25 May, 2017

8 commits

  • Collapse 'case:' in *_mci_bind_devs() and update driver version from
    1.1.1 to 1.1.2.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000934.87971-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • This is based on previous work by Patrick Geary, see Link.

    Additional cleanups ontop:

    - Remove the code to read MCMTR from pci_ha1_ta and CHN_TO_HA macro,
    now that TA0 and TA1 are unified.

    - Remove get_pdev_same_bus(), since in get_dimm_config() the
    variable "pvt->pci_ta" for KNL is also ready, we can simply use
    pci_read_config_dword(pvt->pci_ta, KNL_MCMTR, &pvt->info.mcmtr) to read
    MCMTR.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: https://lkml.kernel.org/r/57884350.1030401@supermicro.com
    Link: http://lkml.kernel.org/r/20170523000910.87925-1-qiuxu.zhuo@intel.com
    [ Make __populate_dimms() return int. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • We don't need this quirk anymore now that the EDAC memory controller
    representation matches the hardware.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000834.87881-1-qiuxu.zhuo@intel.com
    [ Commit message. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • ... to slim down get_dimm_config().

    No functionality change.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • It is called "sb_edac.c" now.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Tony pointed out: "currently the driver pretends there is one big
    8-channel memory controller per socket instead of 2 4-channel
    controllers. This is fine with all memory controller populated with
    symmetrical DIMM configurations, but runs into difficulties on
    asymmetrical setups".

    Restructure the driver to assign an EDAC memory controller to each real
    h/w memory controller to resolve the issue.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000731.87793-1-qiuxu.zhuo@intel.com
    [ Break some lines at convenient points. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • EDAC assigns logical memory controller numbers in the order that we find
    memory controllers, which depends on which PCI bus they are on. Some
    systems end up with MC0 on socket0, others (e.g Haswell) have MC0 on
    socket3.

    All this is made more confusing for users because we use the string
    "Socket" while generating names for memory controllers, but the number
    that we attach there is the memory controller number. E.g.

    EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell Socket#0: DEV 0000:ff:12.0 (INTERRUPT)

    Change the names to say "SrcID#%d" (where the number we use is read from
    the h/w associated with the memory controller instead of some logical
    number internal to the EDAC driver). New message:

    EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell SrcID#3: DEV 0000:ff:12.0 (INTERRUPT)

    Reported-by: Andrey Korolyov
    Reported-by: Patrick Geary
    Signed-off-by: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000603.87748-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     
  • Each of the PCI device IDs belongs to a CPU socket, or to one of the
    integrated memory controllers. Provide an enum to specify the domain of
    each, and distinguish the resource number in each domain: the number
    of the PCI device IDs per integrated memory controller/socket, and the
    number of integrated memory controllers per socket.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000533.87704-1-qiuxu.zhuo@intel.com
    [ Realign pci_dev_descr_knl members. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

10 Apr, 2017

1 commit


21 Feb, 2017

1 commit

  • Pull RAS updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Assign notifier chain priorities for all RAS related handlers to
    make the ordering explicit (Borislav Petkov)

    - Improve the AMD MCA banks sysfs output (Yazen Ghannam)

    - Various cleanups and restructuring of the x86 RAS code (Borislav
    Petkov)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
    x86/ras: Get rid of mce_process_work()
    EDAC/mce/amd: Dump TSC value
    EDAC/mce/amd: Unexport amd_decode_mce()
    x86/ras/amd/inj: Change dependency
    x86/ras: Flip the TSC-adding logic
    x86/ras/amd: Make sysfs names of banks more user-friendly
    x86/ras/therm_throt: Do not log a fake MCE for thermal events
    x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

    Linus Torvalds
     

24 Jan, 2017

1 commit

  • Assign all notifiers on the MCE decode chain a priority so that they get
    called in the correct order.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

23 Jan, 2017

1 commit

  • Function sbridge_register_mci() sets pvt->info.show_interleave_mode
    to knl_show_interleave_mode() on Knight's Landing and
    show_interleave_mode() anywhere else.

    Merge show_interleave_mode() and knl_show_interleave_mode() in a single
    implementation and use it without an indirect function pointer.

    Signed-off-by: Nicolas Iooss
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170122172806.10412-1-nicolas.iooss_linux@m4x.org
    [ Call it get_intlv_mode_str(). ]
    Signed-off-by: Borislav Petkov

    Nicolas Iooss
     

15 Dec, 2016

1 commit


19 Oct, 2016

2 commits

  • Add Knights Mill (KNM) to the list of CPU models supported by sb_edac.

    Signed-off-by: Piotr Luc
    Reviewed-by: Dave Hansen
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20161013153105.2517-6-piotr.luc@intel.com
    Signed-off-by: Borislav Petkov

    Piotr Luc
     
  • We now have symbolic names for a bunch of Intel CPU models via
    asm/intel-family.h. The original conversion missed the EDAC drivers.
    Convert them.

    Signed-off-by: Dave Hansen
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20160929204321.9FAE5F84@viggo.jf.intel.com
    [ Remove comment, macro name is descriptive enough. ]
    Signed-off-by: Borislav Petkov

    Dave Hansen
     

05 Oct, 2016

1 commit

  • Pull EDAC updates from Borislav Petkov:
    "A lot of movement in the EDAC tree this time around, coarse summary
    below:

    - Altera Arria10 enablement of NAND, DMA, USB, QSPI and SD-MMC FIFO
    buffers (Thor Thayer)

    - split the memory controller part out of mpc85xx and share it with a
    new Freescale ARM Layerscape driver (York Sun)

    - amd64_edac fixes (Yazen Ghannam)

    - misc cleanups, refactoring and fixes all over the place"

    * tag 'edac_for_4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (37 commits)
    EDAC, altera: Add IRQ Flags to disable IRQ while handling
    EDAC, altera: Correct EDAC IRQ error message
    EDAC, amd64: Autoload module using x86_cpu_id
    EDAC, sb_edac: Remove NULL pointer check on array pci_tad
    EDAC: Remove NO_IRQ from powerpc-only drivers
    EDAC, fsl_ddr: Fix error return code in fsl_mc_err_probe()
    EDAC, fsl_ddr: Add entry to MAINTAINERS
    EDAC: Move Doug Thompson to CREDITS
    EDAC, I3000: Orphan driver
    EDAC, fsl_ddr: Replace simple_strtoul() with kstrtoul()
    EDAC, layerscape: Add Layerscape EDAC support
    EDAC, fsl_ddr: Fix IRQ dispose warning when module is removed
    EDAC, fsl_ddr: Add support for little endian
    EDAC, fsl_ddr: Add missing DDR DRAM types
    EDAC, fsl_ddr: Rename macros and names
    EDAC, fsl-ddr: Separate FSL DDR driver from MPC85xx
    EDAC, mpc85xx: Replace printk() with pr_* format
    EDAC, mpc85xx: Drop setting/clearing RFXE bit in HID1
    EDAC, altera: Rename MC trigger to common name
    EDAC, altera: Rename device trigger to common name
    ...

    Linus Torvalds
     

13 Sep, 2016

1 commit

  • pvt->pci_tad is a NUM_CHANNELS array of struct pci_dev pointers and
    hence cannot be NULL, so the NULL pointer check on pci_tad is redundant.
    Remove it.

    Signed-off-by: Colin Ian King
    Acked-by: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20160908083801.14766-1-colin.king@canonical.com
    Signed-off-by: Borislav Petkov

    Colin Ian King
     

08 Aug, 2016

1 commit

  • On Intel Xeon Phi Knights Landing processor family the channels of the
    memory controller have untypical arrangement - MC0 is mapped to CH3,4,5
    and MC1 is mapped to CH0,1,2. This causes the EDAC driver to report the
    channel name incorrectly.

    We missed this change earlier, so the code already contains similar
    comment, but the translation function is incorrect.

    Without this patch:
    errors in DIMM_A and DIMM_D were reported in DIMM_D
    errors in DIMM_B and DIMM_E were reported in DIMM_E
    errors in DIMM_C and DIMM_F were reported in DIMM_F

    Correct this.

    Hubert Chrzaniuk:
    - rebased to 4.8
    - comments and code cleanup

    Fixes: d0cdf9003140 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
    Reviewed-by: Tony Luck
    Cc: Mauro Carvalho Chehab
    Cc: Hubert Chrzaniuk
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Cc: lukasz.odzioba@intel.com
    Cc: mchehab@kernel.org
    Cc: # v4.5..
    Link: http://lkml.kernel.org/r/1469231089-22837-1-git-send-email-lukasz.odzioba@intel.com
    Signed-off-by: Lukasz Odzioba
    [ Boris: Simplify a bit by removing char mc. ]
    Signed-off-by: Borislav Petkov

    Lukasz Odzioba
     

16 Jul, 2016

1 commit

  • In commit 2c1ea4c700af ("EDAC, sb_edac: Use cpu family/model in driver
    detection") I broke Knights Landing because I failed to notice that it
    called a wrapper macro "sbridge_get_all_devices_knl" instead of
    "sbridge_get_all_devices" like all the other types.

    Now that we include the processor type in the pci_id_table structure we
    can skip the wrappers and just have the sbridge_get_all_devices() check
    the type to decide whether to allow duplicate devices and controllers to
    have registers spread across buses.

    Fixes: 2c1ea4c700af ("EDAC, sb_edac: Use cpu family/model in driver detection")
    Tested-by: Lukasz Odzioba
    Acked-by: Aristeu Rozanski
    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Tony Luck
     

03 Jun, 2016

2 commits

  • In commit

    2c1ea4c700af ("EDAC, sb_edac: Use cpu family/model in driver detection")

    we switched from using PCI ids to determine which platform we are
    running on to using CPU model instead.

    I forgot that Broadwell-DE has its own distinct model number different
    from Broadwell-EP or -EX.

    Fixing this isn't just adding a line to the array of cpuids - the
    exising code assumed a 1:1 mapping between entries in that array and the
    "enum type" values. Added the type to pci_id_table structure to remove
    this dependency and allows two Broadwell cpu models.

    Signed-off-by: Tony Luck
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Fixes: 2c1ea4c700af ("EDAC, sb_edac: Use cpu family/model in driver detection")
    Link: http://lkml.kernel.org/r/b3cffe40dec6dfe0235a5d52a504f0ba86a07ce7.1464902605.git.tony.luck@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     
  • Broadwell made a small change to the rank target register moving the
    target rank ID field up from bits 16:19 to bits 20:23.

    Also found that the offset field grew by one bit in the IVY_BRIDGE to
    HASWELL transition, so fix the RIR_OFFSET() macro too.

    Signed-off-by: Tony Luck
    Cc: stable@vger.kernel.org # v3.19+
    Cc: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/2943fb819b1f7e396681165db9c12bb3df0e0b16.1464735623.git.tony.luck@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     

17 May, 2016

1 commit

  • Pull EDAC updates from Borislav Petkov:
    "It was pretty busy in EDAC land this time:

    - Altera Arria10 L2 cache and On-Chip RAM ECC handling (Thor Thayer)

    - Remove ad-hoc buffering of MCE records in sb_edac and i7core_edac
    (Tony Luck)

    - Do not register sb_edac with pci_register_driver() (Tony Luck)

    - Add support for Skylake to ie31200_edac (Jason Baron)

    - Do not register amd64_edac with pci_register_driver() (Borislav
    Petkov)

    ... plus the usual round of cleanups and fixes all over the place"

    * tag 'edac_for_4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (25 commits)
    EDAC, amd64_edac: Drop pci_register_driver() use
    EDAC, ie31200_edac: Add Skylake support
    EDAC, sb_edac: Use cpu family/model in driver detection
    EDAC, i7core: Remove double buffering of error records
    EDAC, amd64_edac: Issue driver banner only on success
    ARM: socfpga: Initialize Arria10 OCRAM ECC on startup
    EDAC: Increment correct counter in edac_inc_ue_error()
    EDAC, sb_edac: Remove double buffering of error records
    EDAC: Fix used after kfree() error in edac_unregister_sysfs()
    EDAC, altera: Avoid unused function warnings
    EDAC, altera: Remove useless casts
    ARM: socfpga: Enable Arria10 OCRAM ECC on startup
    EDAC, altera: Add Arria10 OCRAM ECC support
    Documentation: dt: socfpga: Add Altera Arria10 OCRAM binding
    EDAC, altera: Make OCRAM ECC dependency check generic
    EDAC, altera: Add register offset for ECC Enable
    EDAC, altera: Extract error inject operations to a struct fops
    ARM: socfpga: Enable Arria10 L2 cache ECC on startup
    EDAC, altera: Add Arria10 L2 Cache ECC handling
    Documentation, dt, socfpga: Add Altera Arria10 L2 cache binding
    ...

    Linus Torvalds
     

03 May, 2016

1 commit

  • Instead of picking a random PCI ID from the dozen or so we need to
    access, just use x86_match_cpu() to pick based on CPU model number. The
    choosing of PCI devices has been problematic in the past, see

    11249e739929 ("sb_edac: Fix detection on SNB machines")

    which fixed problems introduced by

    d0585cd815fa ("sb_edac: Claim a different PCI device").

    This is especially ugly if future hardware might not even have
    EDAC-relevant registers in PCI config space and we would still be
    required to choose some "random" PCI devices to scan for just so our
    driver loads.

    Is this cleaner/clearer? It deletes much more code than it adds. Only
    tested on Broadwell. The driver loads/unloads and loads again. Still
    decodes errors too.

    Signed-off-by: Tony Luck
    Suggested-by: Borislav Petkov
    Signed-off-by: Borislav Petkov

    Tony Luck
     

29 Apr, 2016

1 commit

  • Both of these drivers can return NOTIFY_BAD, but this terminates
    processing other callbacks that were registered later on the chain.
    Since the driver did nothing to log the error it seems wrong to prevent
    other interested parties from seeing it. E.g. neither of them had even
    bothered to check the type of the error to see if it was a memory error
    before the return NOTIFY_BAD.

    Signed-off-by: Tony Luck
    Acked-by: Aristeu Rozanski
    Acked-by: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc:
    Link: http://lkml.kernel.org/r/72937355dd92318d2630979666063f8a2853495b.1461864507.git.tony.luck@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     

23 Apr, 2016

1 commit

  • In the bad old days the functions from x86_mce_decoder_chain could be
    called in machine check context. So we used to carefully copy them and
    defer processing until later. But in

    f29a7aff4bd60 ("x86/mce: Avoid potential deadlock due to printk() in MCE context")

    we switched the logging code to save the record in a genpool, and call
    the functions that registered to be notified later from a work queue.

    So drop all the double buffering and do all the work we want to do as
    soon as sbridge_mce_check_error() is called.

    Signed-off-by: Tony Luck
    Cc: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: patrickg@supermicro.com
    Link: http://lkml.kernel.org/r/100025611cd780d9bca72792b2b2146760da53e0.1460756761.git.tony.luck@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     

22 Apr, 2016

2 commits

  • Haswell and Broadwell can be configured to hash the channel
    interleave function using bits [27:12] of the physical address.

    On those processor models we must check to see if hashing is
    enabled (bit21 of the HASWELL_HASYSDEFEATURE2 register) and
    act accordingly.

    Based on a patch by patrickg

    Tested-by: Patrick Geary
    Signed-off-by: Tony Luck
    Acked-by: Mauro Carvalho Chehab
    Cc: Aristeu Rozanski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-edac@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Tony Luck
     
  • In commit:

    eb1af3b71f9d ("Fix computation of channel address")

    I switched the "sck_way" variable from holding the log2 value read
    from the h/w to instead be the actual number. Unfortunately it
    is needed in log2 form when used to shift the address.

    Tested-by: Patrick Geary
    Signed-off-by: Tony Luck
    Acked-by: Mauro Carvalho Chehab
    Cc: Aristeu Rozanski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-edac@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: eb1af3b71f9d ("Fix computation of channel address")
    Signed-off-by: Ingo Molnar

    Tony Luck
     

15 Mar, 2016

1 commit

  • Pull RAS updates from Ingo Molnar:
    "Various RAS updates:

    - AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
    MCA) error decoding support (Aravind Gopalakrishnan)

    - x86 memcpy_mcsafe() support, to enable smart(er) hardware error
    recovery in NVDIMM drivers, based on an extension of the x86
    exception handling code. (Tony Luck)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    EDAC/sb_edac: Fix computation of channel address
    x86/mm, x86/mce: Add memcpy_mcsafe()
    x86/mce/AMD: Document some functionality
    x86/mce: Clarify comments regarding deferred error
    x86/mce/AMD: Fix logic to obtain block address
    x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
    x86/mce: Move MCx_CONFIG MSR definitions
    x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
    x86/mm: Expand the exception table logic to allow new handling options
    x86/mce/AMD: Set MCAX Enable bit
    x86/mce/AMD: Carve out threshold block preparation
    x86/mce/AMD: Fix LVT offset configuration for thresholding
    x86/mce/AMD: Reduce number of blocks scanned per bank
    x86/mce/AMD: Do not perform shared bank check for future processors
    x86/mce: Fix order of AMD MCE init function call

    Linus Torvalds
     

11 Mar, 2016

1 commit

  • Large memory Haswell-EX systems with multiple DIMMs per channel were
    sometimes reporting the wrong DIMM.

    Found three problems:

    1) Debug printouts for socket and channel interleave were not interpreting
    the register fields correctly. The socket interleave field is a 2^X
    value (0=1, 1=2, 2=4, 3=8). The channel interleave is X+1 (0=1, 1=2,
    2=3. 3=4).

    2) Actual use of the socket interleave value didn't interpret as 2^X

    3) Conversion of address to channel address was complicated, and wrong.

    Signed-off-by: Tony Luck
    Acked-by: Aristeu Rozanski
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Mauro Carvalho Chehab
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-edac@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Luck, Tony
     

08 Mar, 2016

1 commit

  • Correct a typo introduced by

    d0cdf9003140 ("EDAC, sb_edac: Add Knights Landing (Xeon Phi gen 2) support")

    As a result under some configurations DIMMs were not correctly
    recognized. Problem affects only Xeon Phi architecture.

    Signed-off-by: Hubert Chrzaniuk
    Acked-by: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Link: http://lkml.kernel.org/r/1457361045-26221-1-git-send-email-hubert.chrzaniuk@intel.com
    Signed-off-by: Borislav Petkov

    Hubert Chrzaniuk
     

11 Dec, 2015

1 commit

  • Knights Landing does not come with register that could be used to fetch
    DIMM width. However the value is fixed for this architecture so it can
    be hardcoded.

    Signed-off-by: Hubert Chrzaniuk
    Cc: Doug Thompson
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Link: http://lkml.kernel.org/r/1449840082-18673-1-git-send-email-hubert.chrzaniuk@intel.com
    Signed-off-by: Borislav Petkov

    Hubert Chrzaniuk
     

06 Dec, 2015

3 commits

  • Knights Landing is the next generation architecture for HPC market.

    KNL introduces concept of a tile and CHA - Cache/Home Agent for memory
    accesses.

    Some things are fixed in KNL:
    () There's single DIMM slot per channel
    () There's 2 memory controllers with 3 channels each, however,
    from EDAC standpoint, it is presented as single memory controller
    with 6 channels. In order to represent 2 MCs w/ 3 CH, it would
    require major redesign of EDAC core driver.

    Basically, two functionalities are added/extended:
    () during driver initialization KNL topology is being recognized, i.e.
    which channels are populated with what DIMM sizes
    (knl_get_dimm_capacity function)
    () handle MCE errors - channel swizzling

    Reviewed-by: Tony Luck
    Signed-off-by: Jim Snow
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Link: http://lkml.kernel.org/r/1449136134-23706-5-git-send-email-hubert.chrzaniuk@intel.com
    [ Rebase to 4.4-rc3. ]
    Signed-off-by: Hubert Chrzaniuk
    Signed-off-by: Borislav Petkov

    Jim Snow
     
  • Add options to sbridge_get_all_devices() to allow for duplicate device
    IDs and devices that are scattered across mulitple PCI buses.

    Signed-off-by: Jim Snow
    Acked-by: Tony Luck
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Link: http://lkml.kernel.org/r/1449136134-23706-4-git-send-email-hubert.chrzaniuk@intel.com
    [ Rebase to 4.4-rc3. ]
    Signed-off-by: Hubert Chrzaniuk
    Signed-off-by: Borislav Petkov

    Jim Snow
     
  • SAD limit, interleave mode and DRAM related functionalities are now
    virtualized, so that overriding them is easier.

    Signed-off-by: Jim Snow
    Acked-by: Tony Luck
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Cc: lukasz.anaczkowski@intel.com
    Link: http://lkml.kernel.org/r/1449136134-23706-3-git-send-email-hubert.chrzaniuk@intel.com
    [ Rebase to 4.4-rc3. ]
    Signed-off-by: Hubert Chrzaniuk
    Signed-off-by: Borislav Petkov

    Jim Snow