13 Oct, 2020

1 commit

  • Pull EDAC updates from Borislav Petkov:

    - Add Amazon's Annapurna Labs memory controller EDAC driver (Talel
    Shenhar)

    - New AMD CPUs support (Yazen Ghannam)

    - The usual misc fixes and cleanups all over the subsystem

    * tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
    EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh
    EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location
    EDAC/aspeed: Use module_platform_driver() to simplify
    EDAC, sb_edac: Simplify switch statement
    EDAC/ti: Fix handling of platform_get_irq() error
    EDAC/aspeed: Fix handling of platform_get_irq() error
    EDAC/i5100: Fix error handling order in i5100_init_one()
    EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara
    EDAC/socfpga: Transfer SoCFPGA EDAC maintainership
    EDAC/thunderx: Make symbol lmc_dfs_ents static
    EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver
    dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding
    EDAC/mce_amd: Add new error descriptions for existing types
    EDAC: Replace HTTP links with HTTPS ones

    Linus Torvalds
     

09 Sep, 2020

1 commit

  • clang static analyzer reports this problem

    sb_edac.c:959:2: warning: Undefined or garbage value
    returned to caller
    return type;
    ^~~~~~~~~~~

    This is a false positive.

    However by initializing the type to DEV_UNKNOWN the 3 case can be
    removed from the switch, saving a comparison and jump.

    Signed-off-by: Tom Rix
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200907153225.7294-1-trix@redhat.com

    Tom Rix
     

18 Aug, 2020

1 commit

  • IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
    the stack as part of machine check delivery is valid or not.

    Various drivers copied a code fragment that uses the RIPV bit to
    determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
    or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
    RIPV is set as "FATAL").

    Reverse the tests so that the error is marked fatal when RIPV is not set.

    Reported-by: Gabriele Paoloni
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com

    Tony Luck
     

17 Aug, 2020

1 commit

  • Rationale:
    Reduces attack surface on kernel devs opening the links for MITM
    as HTTPS traffic is much harder to manipulate.

    Deterministic algorithm:
    For each file:
    If not .svg:
    For each line:
    If doesn't contain `\bxmlns\b`:
    For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`:
    If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`:
    If both the HTTP and HTTPS versions
    return 200 OK and serve the same content:
    Replace HTTP with HTTPS.

    [ bp: Merge all EDAC patches into a single one. ]

    Signed-off-by: Alexander A. Klimov
    Signed-off-by: Borislav Petkov
    Acked-by: Tero Kristo # ti_edac
    Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de

    Alexander A. Klimov
     

14 Apr, 2020

2 commits

  • When acpi_extlog was added, we were worried that the same error would
    be reported more than once by different subsystems. But in the ensuing
    years I've seen complaints that people could not find an error log
    (because this mechanism suppressed the log they were looking for).

    Rip it all out. People are smart enough to notice the same address from
    different reporting mechanisms.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com

    Tony Luck
     
  • If the handler took any action to log or deal with the error, set a bit
    in mce->kflags so that the default handler on the end of the machine
    check chain can see what has been done.

    Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
    skip over errors already processed by CEC.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com

    Tony Luck
     

25 Mar, 2020

1 commit

  • The new macro set has a consistent namespace and uses C99 initializers
    instead of the grufty C89 ones.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200320131509.673579000@linutronix.de

    Thomas Gleixner
     

09 Nov, 2019

1 commit

  • The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info.
    Clean up this interface to only pass the mci struct and replace this
    macro with a new function edac_get_dimm().

    Also introduce an edac_get_dimm_by_index() function for later use.
    This allows it to get a DIMM pointer only by a given index. This can
    be useful if the DIMM's position within the layers of the memory
    controller or the exact size of the layers are unknown.

    Small style changes made for some hunks after applying the semantic
    patch.

    Semantic patch used:

    @@ expression mci, a, b,c; @@

    -EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c)
    +edac_get_dimm(mci, a, b, c)

    [ bp: Touchups. ]

    Signed-off-by: Robert Richter
    Signed-off-by: Borislav Petkov
    Reviewed-by: Mauro Carvalho Chehab
    Cc: "linux-edac@vger.kernel.org"
    Cc: James Morse
    Cc: Jason Baron
    Cc: Qiuxu Zhuo
    Cc: Tero Kristo
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com

    Robert Richter
     

01 Oct, 2019

1 commit

  • There are several vars unused on this driver, probably because
    it was a modified copy of another driver. Get rid of them.

    drivers/edac/sb_edac.c: In function ‘knl_get_dimm_capacity’:
    drivers/edac/sb_edac.c:1343:16: warning: variable ‘sad_size’ set but not used [-Wunused-but-set-variable]
    1343 | u64 sad_base, sad_size, sad_limit = 0;
    | ^~~~~~~~
    drivers/edac/sb_edac.c: In function ‘sbridge_mce_output_error’:
    drivers/edac/sb_edac.c:2955:8: warning: variable ‘type’ set but not used [-Wunused-but-set-variable]
    2955 | char *type, *optype, msg[256];
    | ^~~~
    drivers/edac/sb_edac.c: In function ‘sbridge_unregister_mci’:
    drivers/edac/sb_edac.c:3203:22: warning: variable ‘pvt’ set but not used [-Wunused-but-set-variable]
    3203 | struct sbridge_pvt *pvt;
    | ^~~
    At top level:
    drivers/edac/sb_edac.c:266:18: warning: ‘correrrthrsld’ defined but not used [-Wunused-const-variable=]
    266 | static const u32 correrrthrsld[] = {
    | ^~~~~~~~~~~~~
    drivers/edac/sb_edac.c:257:18: warning: ‘correrrcnt’ defined but not used [-Wunused-const-variable=]
    257 | static const u32 correrrcnt[] = {
    | ^~~~~~~~~~

    Acked-by: Borislav Petkov
    Acked-by: Tony Luck
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

28 Aug, 2019

1 commit

  • Currently big microservers have _XEON_D while small microservers have
    _X, Make it uniformly: _D.

    for i in `git grep -l "\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*_\(X\|XEON_D\)"`
    do
    sed -i -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*ATOM.*\)_X/\1_D/g' \
    -e 's/\(\(INTEL_FAM6_\|VULNWL_INTEL\|INTEL_CPU_FAM6\).*\)_XEON_D/\1_D/g' ${i}
    done

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Tony Luck
    Cc: x86@kernel.org
    Cc: Dave Hansen
    Cc: Thomas Gleixner
    Cc: Borislav Petkov
    Link: https://lkml.kernel.org/r/20190827195122.677152989@infradead.org

    Peter Zijlstra
     

21 Jun, 2019

1 commit

  • The variable tad_base is being set to a value that is never read and is
    being over-written on the next iteration of a for-loop. This assignment
    is therefore redundant and can be removed.

    Signed-off-by: Colin Ian King
    Signed-off-by: Borislav Petkov
    Acked-by: Tony Luck
    Cc: James Morse
    Cc: kernel-janitors@vger.kernel.org
    Cc: linux-edac
    Cc: Mauro Carvalho Chehab
    Cc: Qiuxu Zhuo
    Link: https://lkml.kernel.org/r/20190508224201.27120-1-colin.king@canonical.com

    Colin Ian King
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this file may be distributed under the terms of the gnu general
    public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 9 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Reviewed-by: Richard Fontana
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070034.395589349@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 Sep, 2018

1 commit

  • The count of errors is picked up from bits 52:38 of the machine check
    bank status register. But this is the count of *corrected* errors. If an
    uncorrected error is being logged, the h/w sets this field to 0. Which
    means that when edac_mc_handle_error() is called, the EDAC core will
    carefully add zero to the appropriate uncorrected error counts.

    Signed-off-by: Tony Luck
    [ Massage commit message. ]
    Signed-off-by: Borislav Petkov
    Cc: stable@vger.kernel.org
    Cc: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180928213934.19890-1-tony.luck@intel.com

    Tony Luck
     

23 Sep, 2018

1 commit

  • The {i3200|i7core|sb|skx}_edac drivers show DIMM capacity using the
    wrong unit symbol: 'Mb' - megabit. Fix them by replacing 'Mb' with
    'MiB' - mebibyte.

    [Tony: These are all "edac_dbg()" messages, so this won't break scripts
    that parse console logs.]

    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Acked-by: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180919003433.16475-1-tony.luck@intel.com

    Qiuxu Zhuo
     

15 Sep, 2018

1 commit

  • A static checker gave the following warnings:

    drivers/edac/sb_edac.c:1030 ibridge_get_ha() warn: signedness bug returning '(-22)'
    drivers/edac/sb_edac.c:1037 knl_get_ha() warn: signedness bug returning '(-22)'

    Both because the functions are declared to return a "u8", but try to
    return -EINVAL for the error case.

    Fix by returning 0xff (since the caller doesn't look at, or pass on, the
    return value).

    Reported-by: Dan Carpenter
    Signed-off-by: Tony Luck
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180914201905.GA30946@agluck-desk
    Signed-off-by: Borislav Petkov

    Luck, Tony
     

11 Sep, 2018

2 commits

  • sb_edac sometimes reports the wrong DIMM for a memory error found by
    the patrol scrubber. That is because the hardware provides only a 4KB
    page-aligned address for the error case.

    This means that the EDAC driver will point at the DIMM matching offset
    0x0 in the 4KB page, but because of interleaving across channels and
    ranks, the actual DIMM involved may be different if the error is on some
    other cache line within the page.

    Therefore, reconstruct the socket/iMC/channel information from the "mce"
    structure passed to the EDAC driver. The DIMM cannot be determined, so
    pass "dimm=-1" to the EDAC core. It will report that all the DIMMs on
    that channel may be affected.

    Signed-off-by: Qiuxu Zhuo
    Cc: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180907230828.13901-3-tony.luck@intel.com
    [ Improve comments on the functions to convert bank number
    to memory controller number. Minor cleanup to commit message. ]
    Signed-off-by: Tony Luck
    [ Massage commit message more. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • Users of the mce_register_decode_chain() are called for every logged
    error. EDAC drivers should check:

    1) Is this a memory error? [bit 7 in status register]
    2) Is there a valid address? [bit 58 in status register]
    3) Is the address a system address? [bitfield 8:6 in misc register]

    The sb_edac driver performed test "1" twice. Waited far too long to
    perform check "2". Didn't do check "3" at all.

    Fix it by moving the test for valid address from
    sbridge_mce_output_error() into sbridge_mce_check_error() and add a test
    for the type immediately after. Delete the redundant check for the type
    of the error from sbridge_mce_output_error().

    Signed-off-by: Qiuxu Zhuo
    Cc: Aristeu Rozanski
    Cc: Mauro Carvalho Chehab
    Cc: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180907230828.13901-2-tony.luck@intel.com
    [ Re-word commit message. ]
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

03 Sep, 2018

1 commit

  • Replace custom grown macro with generic INTEL_CPU_FAM6() one.

    No functional change intended.

    Signed-off-by: Andy Shevchenko
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180831082341.72363-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Borislav Petkov

    Andy Shevchenko
     

25 Jul, 2018

1 commit

  • Extend the driver to check whether segment number and bus number matches
    when deciding how to group memory controller PCI devices to CPU sockets.

    Signed-off-by: Masayoshi Mizuma
    Reviewed-by: Tony Luck
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180724190213.26359-1-msys.mizuma@gmail.com
    [ Cleanup commit message. ]
    Signed-off-by: Borislav Petkov

    Masayoshi Mizuma
     

17 Mar, 2018

1 commit

  • In preparation for enabling -Wvla, remove VLA and replace it with a
    fixed-length array instead.

    Also, remove max_interleave as it is no longer needed.

    Reviewed-by: Mauro Carvalho Chehab
    Signed-off-by: Gustavo A. R. Silva
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20180314182131.GA25259@embeddedgus
    Signed-off-by: Borislav Petkov

    Gustavo A. R. Silva
     

23 Feb, 2018

1 commit

  • Commit

    3286d3eb906c ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")

    decreased NUM_CHANNELS from 8 to 4, but this is not enough for Knights
    Landing which supports up to 6 channels.

    This caused out-of-bounds writes to pvt->mirror_mode and pvt->tolm
    variables which don't pay critical role on KNL code path, so the memory
    corruption wasn't causing any visible driver failures.

    The easiest way of fixing it is to change NUM_CHANNELS to 6. Do that.

    An alternative solution would be to restructure the KNL part of the
    driver to 2MC/3channel representation.

    Reported-by: Dan Carpenter
    Signed-off-by: Anna Karbownik
    Cc: Mauro Carvalho Chehab
    Cc: Tony Luck
    Cc: jim.m.snow@intel.com
    Cc: krzysztof.paliswiat@intel.com
    Cc: lukasz.odzioba@intel.com
    Cc: qiuxu.zhuo@intel.com
    Cc: linux-edac
    Cc:
    Fixes: 3286d3eb906c ("EDAC, sb_edac: Drop NUM_CHANNELS from 8 back to 4")
    Link: http://lkml.kernel.org/r/1519312693-4789-1-git-send-email-anna.karbownik@intel.com
    [ Massage commit message. ]
    Signed-off-by: Borislav Petkov

    Anna Karbownik
     

19 Oct, 2017

1 commit


11 Oct, 2017

1 commit

  • When figuring out the size of the DIMMs and the cluster mode is SNC2 or SNC4 the
    current algorithm ignores the contribution of some of the channels resulting in
    EDAC never knowing of the existence of some DIMMs attached to such channels (thus
    sysfs is not populated).

    Instead of selectively iterating from 0 to interlv_ways when looking for all the
    participants in the interleave, do an exhaustive search and iterate from 0 to
    KNL_MAX_CHANNELS. The algorithm is already smart enough to consider participants
    only one time.

    This works fine in all KNL cluster modes and even when there are missing DIMMs
    as the contribution of those channels is 0.

    Signed-off-by: Luis Felipe Sandoval Castro
    Acked-by: Tony Luck
    Cc: Mauro Carvalho Chehab
    Cc: arozansk@redhat.com
    Cc: linux-edac
    Cc: qiuxu.zhuo@intel.com
    Link: http://lkml.kernel.org/r/1506606882-90521-1-git-send-email-luis.felipe.sandoval.castro@intel.com
    Signed-off-by: Borislav Petkov

    Luis Felipe Sandoval Castro
     

27 Sep, 2017

1 commit

  • Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
    server (DELL PowerEdge 730xd):

    EDAC sbridge: Some needed devices are missing
    EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
    EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Failed to register device with error -19.

    The refactored sb_edac driver creates the IMC1 (the 2nd memory
    controller) if any IMC1 device is present. In this case only
    HA1_TA of IMC1 was present, but the driver expected to find
    HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.

    The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
    Zhang inserted one DIMM per channel for each CPU, and did random error
    address injection test with this patch:

    4024 addresses fell in TOLM hole area
    12715 addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
    12774 addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
    12798 addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
    12913 addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
    12674 addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
    12686 addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
    12882 addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
    12934 addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
    106400 addresses were injected totally.

    The test result shows that all the 4 channels belong to IMC0 per CPU, so
    the server really only has one IMC per CPU.

    In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
    implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
    used and should be ignored.

    Thus, do not create a second memory controller if the key HA1 is absent.

    [1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
    [2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdf

    Reported-and-tested-by: Yi Zhang
    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Fixes: e2f747b1f42a ("EDAC, sb_edac: Assign EDAC memory controller per h/w controller")
    Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
    [ Massage commit message. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

25 Sep, 2017

1 commit

  • Change x86 EDAC platform drivers to verify the module owner at the
    beginning of their module init functions. This allows them to fail their
    init immediately when ghes_edac is enabled. Similar change can be made
    to other edac drivers if necessary.

    Also, remove ".c" from module names of pnp2_edac, sb_edac, and skx_edac.

    Signed-off-by: Toshi Kani
    Suggested-by: Borislav Petkov
    Cc: Mauro Carvalho Chehab
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170823225447.15608-6-toshi.kani@hpe.com
    Signed-off-by: Borislav Petkov

    Toshi Kani
     

21 Sep, 2017

1 commit


02 Aug, 2017

1 commit

  • Basically, there are full memory mirroring and address range partial
    memory mirroring (supported by Haswell EX and Broadwell EX) modes.

    a) In full memory mirroring, the memory behind each memory controller
    is mirrored, i.e. the memory is split into two identical mirrors
    (primary and secondary), half of the memory is reserved for redundancy.

    b) In address range partial memory mirroring, the memory size (range)
    of primary and secondary behind each memory controller can be user
    defined by the TAD0 register. The rest of memory ranges defined by
    TAD1/TAD2/... in that memory controller are non-mirrored.

    For more detail on memory mirroring, see the following link written by Tony Luck:

    https://01.org/lkp/blogs/tonyluck/2016/address-range-partial-memory-mirroring-linux

    Currently the sb_edac driver only supports address decoding in full
    memory mirroring and non-mirroring modes. In address range partial
    memory mirroring mode, it may fail to decode an address that falls in a
    non-mirroring area (the following was one of this kind of failed logs).

    mce: Uncorrected hardware memory error in user-access at 566d53a400
    Memory failure: 0x566d53a: Killing einj_mem_uc:4647 due to hardware memory corruption
    Memory failure: 0x566d53a: recovery action for dirty LRU page: Recovered
    mce: [Hardware Error]: Machine check events logged
    EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
    EDAC sbridge MC1: CPU 48: Machine Check Event: 0 Bank 7: ec00000000010090
    EDAC sbridge MC1: TSC 4b914aa5a99dab
    EDAC sbridge MC1: ADDR 566d53a400
    EDAC sbridge MC1: MISC 1443a0c86
    EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1499712764 SOCKET 2 APIC 80
    EDAC MC1: 0 UE Can't discover the memory rank for ch addr 0x7fb54e900 on any memory ( page:0x0 offset:0x0 grain:32)
    mce: [Hardware Error]: Machine check events logged

    Therefore, classify memory mirroring modes and make the address decoding
    in address range partial memory mode correct.

    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170730180651.30060-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

17 Jul, 2017

1 commit

  • It is a write-only variable so get rid of it.

    Signed-off-by: Borislav Petkov
    Acked-by: Robert Richter
    Acked-by: Michal Simek
    Acked-by: Thor Thayer
    Acked-by: Tony Luck
    Cc: Mark Gross
    Cc: Tim Small
    Cc: Ranganathan Desikan
    Cc: "Arvind R."
    Cc: Jason Baron
    Cc: "Sören Brinkmann"
    Cc: Ralf Baechle
    Cc: David Daney
    Cc: Loc Ho
    Cc: linux-edac@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-mips@linux-mips.org

    Borislav Petkov
     

14 Jun, 2017

1 commit

  • Xiaolong Ye reported the following failure on Broadwell D server:

    EDAC sbridge: Some needed devices are missing
    EDAC MC: Removed device 0 for sbridge_edac.c Broadwell SrcID#0_Ha#0: DEV 0000:ff:12.0
    EDAC sbridge: Couldn't find mci handler
    EDAC sbridge: Failed to register device with error -19.

    Broadwell D (only IMC0 per socket) and Broadwell X (IMC0 and IMC1 per
    socket) use the same PCI device IDs for IMC0 per socket, then they
    share pci_dev_descr_broadwell_table (n_imcs_per_sock=2). In this case,
    Broadwell D wrongly creates the nonexistent SOCK EDAC memory controller
    and reports above error messages, since it has no IMC1 per socket.

    Avoid creating the nonexistent SOCK memory controller.

    Reported-and-tested-by: Xiaolong Ye
    Signed-off-by: Qiuxu Zhuo
    Cc: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170608113351.25323-1-qiuxu.zhuo@intel.com
    [ Massage. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

25 May, 2017

8 commits

  • Collapse 'case:' in *_mci_bind_devs() and update driver version from
    1.1.1 to 1.1.2.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000934.87971-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • This is based on previous work by Patrick Geary, see Link.

    Additional cleanups ontop:

    - Remove the code to read MCMTR from pci_ha1_ta and CHN_TO_HA macro,
    now that TA0 and TA1 are unified.

    - Remove get_pdev_same_bus(), since in get_dimm_config() the
    variable "pvt->pci_ta" for KNL is also ready, we can simply use
    pci_read_config_dword(pvt->pci_ta, KNL_MCMTR, &pvt->info.mcmtr) to read
    MCMTR.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: https://lkml.kernel.org/r/57884350.1030401@supermicro.com
    Link: http://lkml.kernel.org/r/20170523000910.87925-1-qiuxu.zhuo@intel.com
    [ Make __populate_dimms() return int. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • We don't need this quirk anymore now that the EDAC memory controller
    representation matches the hardware.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000834.87881-1-qiuxu.zhuo@intel.com
    [ Commit message. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • ... to slim down get_dimm_config().

    No functionality change.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • It is called "sb_edac.c" now.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Tony pointed out: "currently the driver pretends there is one big
    8-channel memory controller per socket instead of 2 4-channel
    controllers. This is fine with all memory controller populated with
    symmetrical DIMM configurations, but runs into difficulties on
    asymmetrical setups".

    Restructure the driver to assign an EDAC memory controller to each real
    h/w memory controller to resolve the issue.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000731.87793-1-qiuxu.zhuo@intel.com
    [ Break some lines at convenient points. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     
  • EDAC assigns logical memory controller numbers in the order that we find
    memory controllers, which depends on which PCI bus they are on. Some
    systems end up with MC0 on socket0, others (e.g Haswell) have MC0 on
    socket3.

    All this is made more confusing for users because we use the string
    "Socket" while generating names for memory controllers, but the number
    that we attach there is the memory controller number. E.g.

    EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell Socket#0: DEV 0000:ff:12.0 (INTERRUPT)

    Change the names to say "SrcID#%d" (where the number we use is read from
    the h/w associated with the memory controller instead of some logical
    number internal to the EDAC driver). New message:

    EDAC MC0: Giving out device to module sbridge_edac.c controller
    Haswell SrcID#3: DEV 0000:ff:12.0 (INTERRUPT)

    Reported-by: Andrey Korolyov
    Reported-by: Patrick Geary
    Signed-off-by: Tony Luck
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000603.87748-1-qiuxu.zhuo@intel.com
    Signed-off-by: Borislav Petkov

    Tony Luck
     
  • Each of the PCI device IDs belongs to a CPU socket, or to one of the
    integrated memory controllers. Provide an enum to specify the domain of
    each, and distinguish the resource number in each domain: the number
    of the PCI device IDs per integrated memory controller/socket, and the
    number of integrated memory controllers per socket.

    Signed-off-by: Qiuxu Zhuo
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170523000533.87704-1-qiuxu.zhuo@intel.com
    [ Realign pci_dev_descr_knl members. ]
    Signed-off-by: Borislav Petkov

    Qiuxu Zhuo
     

10 Apr, 2017

1 commit


21 Feb, 2017

1 commit

  • Pull RAS updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Assign notifier chain priorities for all RAS related handlers to
    make the ordering explicit (Borislav Petkov)

    - Improve the AMD MCA banks sysfs output (Yazen Ghannam)

    - Various cleanups and restructuring of the x86 RAS code (Borislav
    Petkov)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
    x86/ras: Get rid of mce_process_work()
    EDAC/mce/amd: Dump TSC value
    EDAC/mce/amd: Unexport amd_decode_mce()
    x86/ras/amd/inj: Change dependency
    x86/ras: Flip the TSC-adding logic
    x86/ras/amd: Make sysfs names of banks more user-friendly
    x86/ras/therm_throt: Do not log a fake MCE for thermal events
    x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

    Linus Torvalds
     

24 Jan, 2017

1 commit

  • Assign all notifiers on the MCE decode chain a priority so that they get
    called in the correct order.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Tony Luck
    Cc: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov