18 Aug, 2020

1 commit

  • IA32_MCG_STATUS.RIPV indicates whether the return RIP value pushed onto
    the stack as part of machine check delivery is valid or not.

    Various drivers copied a code fragment that uses the RIPV bit to
    determine the severity of the error as either HW_EVENT_ERR_UNCORRECTED
    or HW_EVENT_ERR_FATAL, but this check is reversed (marking errors where
    RIPV is set as "FATAL").

    Reverse the tests so that the error is marked fatal when RIPV is not set.

    Reported-by: Gabriele Paoloni
    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Cc:
    Link: https://lkml.kernel.org/r/20200707194324.14884-1-tony.luck@intel.com

    Tony Luck
     

15 Aug, 2020

1 commit


11 Aug, 2020

1 commit

  • The Intel uncore driver may claim some of the pci ids from ie31200 which
    means that the ie31200 edac driver will not initialize them as part of
    pci_register_driver().

    Let's add a fallback for this case to 'pci_get_device()' to get a
    reference on the device such that it can still be configured. This is
    similar in approach to other edac drivers.

    Signed-off-by: Jason Baron
    Cc: Borislav Petkov
    Cc: Mauro Carvalho Chehab
    Cc: linux-edac
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/1594923911-10885-1-git-send-email-jbaron@akamai.com

    Jason Baron
     

04 Aug, 2020

2 commits

  • Pull EDAC updates from Tony Luck:
    "Boris is on vacation and aske me to send you the EDAC changes"

    * tag 'edac_updates_for_5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
    EDAC: Fix reference count leaks
    EDAC: Remove edac_get_dimm_by_index()
    EDAC/ghes: Scan the system once on driver init
    EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvt
    EDAC/ghes: Setup DIMM label from DMI and use it in error reports
    EDAC, {skx,i10nm}: Use CPU stepping macro to pass configurations
    EDAC/mc: Call edac_inc_ue_error() before panic
    EDAC, pnd2: Set MCE_PRIO_EDAC priority for pnd2_mce_dec notifier

    Linus Torvalds
     
  • Pull x86 RAS updates from Ingo Molnar:
    "Boris is on vacation and he asked us to send you the pending RAS bits:

    - Print the PPIN field on CPUs that fill them out

    - Fix an MCE injection bug

    - Simplify a kzalloc in dev_mcelog_init_device()"

    * tag 'ras-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mce, EDAC/mce_amd: Print PPIN in machine check records
    x86/mce/dev-mcelog: Use struct_size() helper in kzalloc()
    x86/mce/inject: Fix a wrong assignment of i_mce.status

    Linus Torvalds
     

23 Jun, 2020

1 commit


22 Jun, 2020

1 commit


19 Jun, 2020

1 commit

  • Commit:

    da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")

    added support for F15h, model 0x60 CPUs but in doing so, missed to read
    back SCRCTRL PCI config register on F15h CPUs which are *not* model
    0x60. Add that read so that doing

    $ cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate

    can show the previously set DRAM scrub rate.

    Fixes: da92110dfdfa ("EDAC, amd64_edac: Extend scrub rate support to F15hM60h")
    Reported-by: Anders Andersson
    Signed-off-by: Borislav Petkov
    Cc: #v4.4..
    Link: https://lkml.kernel.org/r/CAKkunMbNWppx_i6xSdDHLseA2QQmGJqj_crY=NF-GZML5np4Vw@mail.gmail.com

    Borislav Petkov
     

17 Jun, 2020

2 commits

  • When kobject_init_and_add() returns an error, it should be handled
    because kobject_init_and_add() takes a reference even when it fails. If
    this function returns an error, kobject_put() must be called to properly
    clean up the memory associated with the object.

    Therefore, replace calling kfree() and call kobject_put() and add a
    missing kobject_put() in the edac_device_register_sysfs_main_kobj()
    error path.

    [ bp: Massage and merge into a single patch. ]

    Fixes: b2ed215a3338 ("Kobject: change drivers/edac to use kobject_init_and_add")
    Signed-off-by: Qiushi Wu
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20200528202238.18078-1-wu000273@umn.edu
    Link: https://lkml.kernel.org/r/20200528203526.20908-1-wu000273@umn.edu

    Qiushi Wu
     
  • Change the hardware scanning and figuring out how many DIMMs a machine
    has to a single, one-time thing which happens once on driver init. After
    that scanning completes, struct ghes_hw_desc contains a representation
    of the hardware which the driver can then use for later initialization.

    Then, copy the DIMM information into the respective EDAC core
    representation of those.

    Get rid of ghes_edac_dimm_fill and use a struct dimm_info array
    directly.

    This way, hw detection and further driver initialization is nicely
    and logically split. Further additions should all be added to
    ghes_scan_system() and the hw representation extended as needed.

    There should be no functionality change resulting from this patch.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

16 Jun, 2020

5 commits

  • The struct members list and ghes of struct ghes_edac_pvt are unused,
    remove them. On that occasion, rename it to the shorter name struct
    ghes_pvt.

    Signed-off-by: Robert Richter
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com

    Robert Richter
     
  • The ghes driver reports errors with 'unknown label' even if the actual
    DIMM label is known, e.g.:

    EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
    module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
    page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
    node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
    location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
    DRAM memory)

    Fix this by using struct dimm_info's label string in error reports:

    EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
    rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
    page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
    node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
    location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
    DRAM memory)

    The labels are initialized by reading the bank and device strings
    from DMI. Now, the label information can also read from sysfs. E.g. a
    ThunderX2 system will show the following:

    /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
    /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
    /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
    /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
    /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
    /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
    /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
    /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
    /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
    /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
    /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
    /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
    /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
    /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
    /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
    /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0

    Since dimm_labels can be rewritten, that label will be used in a later
    error report:

    # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
    # # some error injection here
    # dmesg | grep foobar
    [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
    module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
    page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
    node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
    location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
    memory)

    [ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]

    Signed-off-by: Robert Richter
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com

    Robert Richter
     
  • Use the X86_MATCH_INTEL_FAM6_MODEL_STEPPINGS() macro to pass CPU
    stepping specific configurations to {skx,i10nm}_init(), so can delete
    the CPU stepping check from 10nm_init().

    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200509010822.76331-1-qiuxu.zhuo@intel.com

    Qiuxu Zhuo
     
  • By calling edac_inc_ue_error() before panic, we get a correct UE error
    count for core dump analysis.

    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200610065846.3626-2-zhenzhong.duan@gmail.com

    Zhenzhong Duan
     
  • Avoid giving it MCE_PRIO_LOWEST priority by default.

    Signed-off-by: Zhenzhong Duan
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200610065846.3626-1-zhenzhong.duan@gmail.com

    Zhenzhong Duan
     

14 Jun, 2020

2 commits

  • Pull more Kbuild updates from Masahiro Yamada:

    - fix build rules in binderfs sample

    - fix build errors when Kbuild recurses to the top Makefile

    - covert '---help---' in Kconfig to 'help'

    * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    treewide: replace '---help---' in Kconfig files with 'help'
    kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables
    samples: binderfs: really compile this sample and fix build issues

    Linus Torvalds
     
  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

11 Jun, 2020

1 commit


01 Jun, 2020

1 commit


29 May, 2020

1 commit


23 May, 2020

1 commit


20 May, 2020

1 commit

  • The skx_edac driver wrongly uses the mtr register to retrieve two fields
    close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
    to get the two fields.

    Cc:
    Signed-off-by: Qiuxu Zhuo
    Reported-by: Matthew Riley
    Acked-by: Aristeu Rozanski
    Signed-off-by: Tony Luck
    Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com

    Qiuxu Zhuo
     

28 Apr, 2020

2 commits

  • The i10nm_edac driver failed to load on Ice Lake and Tremont/Jacobsville
    servers if their CPU stepping >= 4 and failed on Ice Lake-D servers from
    stepping 0. The root cause was that for Ice Lake and Tremont/Jacobsville
    servers with CPU stepping >=4, the offset for bus number configuration
    register was updated from 0xcc to 0xd0. For Ice Lake-D servers, all the
    steppings use the updated 0xd0 offset.

    Fix the issue by using the appropriate offset for bus number
    configuration register according to the CPU model number and stepping.

    Reported-by: Jerry Chen
    Reported-and-tested-by: Jin Wen
    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Reviewed-by: Borislav Petkov
    Link: https://lore.kernel.org/linux-edac/20200427084022.GC11036@zn.tnic

    Qiuxu Zhuo
     
  • The device ID for configuration agent PCI device and the offset for
    bus number configuration register can be CPU model specific. So add
    a new structure res_config to make them configurable and pass res_config
    to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.

    Signed-off-by: Qiuxu Zhuo
    Signed-off-by: Tony Luck
    Reviewed-by: Borislav Petkov
    Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic

    Qiuxu Zhuo
     

24 Apr, 2020

1 commit

  • Fix the following gcc warning:

    drivers/edac/amd8131_edac.c:47:21: warning: ‘bridge_str’ defined but not
    used [-Wunused-const-variable=]
    static char * const bridge_str[] = {
    ^~~~~~~~~~

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Borislav Petkov
    Reviewed-by: Robert Richter
    Link: https://lkml.kernel.org/r/20200415085006.6732-1-yanaijie@huawei.com

    Jason Yan
     

23 Apr, 2020

1 commit

  • Make a couple of symbols static, as reported by sparse.

    [ bp: Massage. ]

    Reported-by: Hulk Robot
    Signed-off-by: Zou Wei
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/1587624744-97240-1-git-send-email-zou_wei@huawei.com

    Zou Wei
     

14 Apr, 2020

5 commits

  • When acpi_extlog was added, we were worried that the same error would
    be reported more than once by different subsystems. But in the ensuing
    years I've seen complaints that people could not find an error log
    (because this mechanism suppressed the log they were looking for).

    Rip it all out. People are smart enough to notice the same address from
    different reporting mechanisms.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-8-tony.luck@intel.com

    Tony Luck
     
  • If the handler took any action to log or deal with the error, set a bit
    in mce->kflags so that the default handler on the end of the machine
    check chain can see what has been done.

    Get rid of NOTIFY_STOP returns. Make the EDAC and dev-mcelog handlers
    skip over errors already processed by CEC.

    Signed-off-by: Tony Luck
    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200214222720.13168-5-tony.luck@intel.com

    Tony Luck
     
  • ... because no one should be interested in spurious MCEs anyway. Make
    the filtering unconditional and move it to amd_filter_mce().

    Signed-off-by: Borislav Petkov
    Tested-by: Tony Luck
    Link: https://lkml.kernel.org/r/20200407163414.18058-2-bp@alien8.de

    Borislav Petkov
     
  • Fix the following gcc warning:

    drivers/edac/xgene_edac.c:1486:7: warning: variable ‘address’ set but
    not used [-Wunused-but-set-variable]
    u32 address;
    ^~~~~~~
    Remove the unused macro RBERRADDR_RD while at it.

    Reported-by: Hulk Robot
    Signed-off-by: Jason Yan
    Signed-off-by: Borislav Petkov
    Link: https://lkml.kernel.org/r/20200409093259.20069-1-yanaijie@huawei.com

    Jason Yan
     
  • Fix spelling (s/Aramda/Armada/) in a log message and in a comment. While
    at it, add a trailing '\n' in messages.

    Signed-off-by: Christophe JAILLET
    Signed-off-by: Borislav Petkov
    Reviewed-by: Jan Luebbe
    Link: https://lkml.kernel.org/r/20200413041556.3514-1-christophe.jaillet@wanadoo.fr

    Christophe JAILLET
     

31 Mar, 2020

1 commit

  • Pull perf updates from Ingo Molnar:
    "The main changes in this cycle were:

    Kernel side changes:

    - A couple of x86/cpu cleanups and changes were grandfathered in due
    to patch dependencies. These clean up the set of CPU model/family
    matching macros with a consistent namespace and C99 initializer
    style.

    - A bunch of updates to various low level PMU drivers:
    * AMD Family 19h L3 uncore PMU
    * Intel Tiger Lake uncore support
    * misc fixes to LBR TOS sampling

    - optprobe fixes

    - perf/cgroup: optimize cgroup event sched-in processing

    - misc cleanups and fixes

    Tooling side changes are to:

    - perf {annotate,expr,record,report,stat,test}

    - perl scripting

    - libapi, libperf and libtraceevent

    - vendor events on Intel and S390, ARM cs-etm

    - Intel PT updates

    - Documentation changes and updates to core facilities

    - misc cleanups, fixes and other enhancements"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits)
    cpufreq/intel_pstate: Fix wrong macro conversion
    x86/cpu: Cleanup the now unused CPU match macros
    hwrng: via_rng: Convert to new X86 CPU match macros
    crypto: Convert to new CPU match macros
    ASoC: Intel: Convert to new X86 CPU match macros
    powercap/intel_rapl: Convert to new X86 CPU match macros
    PCI: intel-mid: Convert to new X86 CPU match macros
    mmc: sdhci-acpi: Convert to new X86 CPU match macros
    intel_idle: Convert to new X86 CPU match macros
    extcon: axp288: Convert to new X86 CPU match macros
    thermal: Convert to new X86 CPU match macros
    hwmon: Convert to new X86 CPU match macros
    platform/x86: Convert to new CPU match macros
    EDAC: Convert to new X86 CPU match macros
    cpufreq: Convert to new X86 CPU match macros
    ACPI: Convert to new X86 CPU match macros
    x86/platform: Convert to new CPU match macros
    x86/kernel: Convert to new CPU match macros
    x86/kvm: Convert to new CPU match macros
    x86/perf/events: Convert to new CPU match macros
    ...

    Linus Torvalds
     

30 Mar, 2020

1 commit


25 Mar, 2020

2 commits


18 Mar, 2020

1 commit


17 Mar, 2020

1 commit

  • On the ZynqMP platform, zynqmp_get_error_info() is used to read out
    error information. In this function, the pinf->col parameter is not
    used (it is only used by the Zynq platform's zynq_get_error_info()). So
    there's no need to print pinf->col on ZynqMP.

    In order to differentiate on which platform handle_error() is executed,
    use DDR_ECC_INTR_SUPPORT as the check condition to distinguish between
    Zynq and ZynqMP platforms.

    [ bp: Massage. ]

    Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
    Signed-off-by: Sherry Sun
    Signed-off-by: Borislav Petkov
    Reviewed-by: Manish Narani
    Link: https://lkml.kernel.org/r/1584365679-27443-1-git-send-email-sherry.sun@nxp.com

    Sherry Sun
     

27 Feb, 2020

1 commit

  • handle_error() currently calls snprintf() a couple of times in
    succession to output the message for a CE/UE, therefore overwriting each
    part of the message which was formatted with the previous snprintf()
    call. As a result, only the part of the message from the last snprintf()
    call will be printed.

    The simplest and most effective way to fix this problem is to combine
    the whole string into one which to supply to a single snprintf() call.

    [ bp: Massage. ]

    Fixes: b500b4a029d57 ("EDAC, synopsys: Add ECC support for ZynqMP DDR controller")
    Signed-off-by: Sherry Sun
    Signed-off-by: Borislav Petkov
    Reviewed-by: James Morse
    Cc: Manish Narani
    Link: https://lkml.kernel.org/r/1582792452-32575-1-git-send-email-sherry.sun@nxp.com

    Sherry Sun
     

20 Feb, 2020

1 commit

  • The driver supports error detection and correction on devices with an
    ARM DMC-520 memory controller.

    Signed-off-by: Lei Wang
    Signed-off-by: Shiping Ji
    Signed-off-by: Borislav Petkov
    Reviewed-by: James Morse
    Link: https://lkml.kernel.org/r/83b48c70-dc06-d0d4-cae9-a2187fca628b@gmail.com

    Lei Wang
     

19 Feb, 2020

1 commit