28 Jan, 2017

5 commits

  • Match one of the devices in amd64_cpuids[] before loading the module.
    This is an additional sanity check against users trying to load
    amd64_edac_mod on unsupported systems.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-9-git-send-email-Yazen.Ghannam@amd.com
    [ Get rid of err_ret label, make it a bit more readable this way. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Having ECC disabled on a node doesn't necessarily mean that it's
    disabled for the entire system. So let's return a non-failing code when
    ECC is disabled on a node. This way we can skip initialization for the
    node but still continue with the remaining nodes.

    After probing all instances, make sure we have at least one MC device
    allocated.

    This issue is seen and fix tested on Fam15h and Fam17h MCM systems.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-8-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Print the node number when informing that DRAM ECC is disabled so
    that we can show which nodes have DRAM ECC disabled. Also, print more
    detailed system information as edac_dbg(), so as to not bother general
    users.

    Switch amd64_notice to amd64_info to match the message above it.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485537863-2707-5-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • We have a few functions that register/unregister an ECC error decoding
    routine. These functions are called when we init/remove instances.
    However, they are global and so don't need to be registered/unregistered
    multiple times.

    So move them out of the init/remove instance functions and into the
    module init/exit routines.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-4-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Jump to memory freeing routines when init_one_instance() fails.

    Signed-off-by: Yazen Ghannam
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1485297149-13733-3-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

16 Jan, 2017

1 commit


04 Dec, 2016

1 commit

  • When the call to zalloc_cpumask_var() fails, returning "false" seems
    improper. The real value of macro "false" is 0, and 0 means no error.
    Return -ENOMEM instead.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189071

    Signed-off-by: Pan Bian
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1480831638-5361-1-git-send-email-bianpan201604@163.com
    Signed-off-by: Borislav Petkov

    Pan Bian
     

01 Dec, 2016

1 commit


30 Nov, 2016

5 commits

  • Add Fam17h to the list of families to autoload amd64_edac_mod.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-18-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • How we need to decode UMC errors is different from how we decode bus
    errors, so let's define a new function for this. We also need a way to
    determine the UMC channel since we're not guaranteed that there is a
    fixed relation between channel and MCA bank.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1480359593-80369-1-git-send-email-Yazen.Ghannam@amd.com
    [ Fold in decode_synd_reg(), simplify. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • We need to determine the EDAC capabilities from all UMCs on the node. We
    should only check UMCs that are enabled and make sure they all agree.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-15-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • The UMCs on Fam17h are independent memory controllers so we need to
    read the capabilities from all UMCs and make sure they agree. Once
    we determine what capabilities are available we should save them for
    convenience.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1480431116-94683-1-git-send-email-Yazen.Ghannam@amd.com
    [ Simplify f17h_determine_edac_ctl_cap(), preinit edac_mode in init_csrows(). ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Read a few more UMC registers and provide debug output in order to be as
    similar as possible to older AMD systems.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1480344621-14966-1-git-send-email-Yazen.Ghannam@amd.com
    [ Remove unneeded K8 check and comments, fixup others. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

29 Nov, 2016

3 commits

  • Fam17h has new register offsets and fields for setting up the DRAM
    scrubber so add support for this.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-17-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Fam17h has a different set of registers and bitfields. Most of these
    registers are read through SMN (System Management Network) rather
    than PCI config space. Also, the derivation of various values is now
    different.

    Update amd64_edac to read the appropriate registers and extract the
    correct values for Fam17h.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-12-git-send-email-Yazen.Ghannam@amd.com
    [ Save us the indentation level in read_mc_regs(), add defines ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Fam17h needs PCI device functions 0 and 6 instead of 1 and 2 as on older
    systems. Update struct amd64_pvt to hold the new functions and reserve
    them if on Fam17h.

    Also, allocate an array of UMC structs within our newly allocated PVT
    struct.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-11-git-send-email-Yazen.Ghannam@amd.com
    [ init_one_instance() error handling, shorten lines, unbreak >80 cols lines. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

25 Nov, 2016

2 commits

  • Add a family type and associated ops for Fam17h. Define a struct to hold
    all the UMC registers that we need. Make this a part of struct amd64_pvt
    in order to maximize code reuse in the rest of the driver.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-10-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • Update the ecc_enabled() function to work on Fam17h. This entails
    reading a different set of registers and using the SMN (System
    Management Network) rather than PCI devices.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc: x86-ml
    Link: http://lkml.kernel.org/r/1479423463-8536-9-git-send-email-Yazen.Ghannam@amd.com
    [ Fixup ecc_en assignment and get_umc_base(). ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

24 Nov, 2016

1 commit

  • It's not recommended for the OS to try and force-enable ECC checking.
    This is considered a firmware task since it includes memory training,
    etc, so don't change ECC settings on Fam17h or newer systems and inform
    the user.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1479850816-1595-1-git-send-email-Yazen.Ghannam@amd.com
    [ Put the "forcing" message in an else branch. ]
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

21 Nov, 2016

3 commits

  • Currently, deferred errors are classified as correctable in EDAC. Add a
    new error type for deferred errors so that they are correctly reported
    to the user.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1479423463-8536-7-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • We only use __log_bus_error() to log DRAM ECC errors, so let's change
    the name to reflect this. We'll also use this function for DRAM ECC
    errors on Fam17h, but we'll call it from a different function than
    decode_bus_error().

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1479423463-8536-6-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     
  • AMD Fam17h will not be using PCI function 2 for EDAC, but will continue
    to use function 3. So let's get the name of F3 instead of F2 to support
    Fam17h and previous families.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1479423463-8536-5-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

21 Sep, 2016

1 commit


08 Aug, 2016

1 commit

  • Fam15hMod60h systems are using the channel decode of Fam15hMod30h which
    gives incorrect results. Fam15hMod60h systems should use the generic
    channel decode method plus a couple more cases.

    Signed-off-by: Yazen Ghannam
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1470236355-30039-1-git-send-email-Yazen.Ghannam@amd.com
    Signed-off-by: Borislav Petkov

    Yazen Ghannam
     

16 Jun, 2016

1 commit


18 May, 2016

1 commit

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
    gitignore: fix wording
    mfd: ab8500-debugfs: fix "between" in printk
    memstick: trivial fix of spelling mistake on management
    cpupowerutils: bench: fix "average"
    treewide: Fix typos in printk
    IB/mlx4: printk fix
    pinctrl: sirf/atlas7: fix printk spelling
    serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
    w1: comment spelling s/minmum/minimum/
    Blackfin: comment spelling s/divsor/divisor/
    metag: Fix misspellings in comments.
    ia64: Fix misspellings in comments.
    hexagon: Fix misspellings in comments.
    tools/perf: Fix misspellings in comments.
    cris: Fix misspellings in comments.
    c6x: Fix misspellings in comments.
    blackfin: Fix misspelling of 'register' in comment.
    avr32: Fix misspelling of 'definitions' in comment.
    treewide: Fix typos in printk
    Doc: treewide : Fix typos in DocBook/filesystem.xml
    ...

    Linus Torvalds
     

10 May, 2016

1 commit

  • - remove homegrown instances counting.
    - take F3 PCI device from amd_nb caching instead of F2 which was used with the
    PCI core.

    With those changes, the driver doesn't need to register a PCI driver and
    relies on the northbridges caching which we do anyway on AMD.

    Signed-off-by: Borislav Petkov
    Cc: Yazen Ghannam

    Borislav Petkov
     

27 Apr, 2016

1 commit


18 Apr, 2016

1 commit


25 Jan, 2016

1 commit

  • dct_sel_base_off is declared as a u64 but we're only using the lower 32
    bits because of a shift wrapping bug. This can possibly truncate the
    upper 16 bits of DctSelBaseOffset[47:26], causing us to misdecode the CS
    row.

    Fixes: c8e518d5673d ('amd64_edac: Sanitize f10_get_base_addr_offset')
    Signed-off-by: Dan Carpenter
    Cc: Aravind Gopalakrishnan
    Cc: linux-edac
    Cc:
    Link: http://lkml.kernel.org/r/20160120095451.GB19898@mwanda
    Signed-off-by: Borislav Petkov

    Dan Carpenter
     

04 Nov, 2015

1 commit

  • Pull RAS changes from Ingo Molnar:
    "The main system reliability related changes were from x86, but also
    some generic RAS changes:

    - AMD MCE error injection subsystem enhancements. (Aravind
    Gopalakrishnan)

    - Fix MCE and CPU hotplug interaction bug. (Ashok Raj)

    - kcrash bootup robustness fix. (Baoquan He)

    - kcrash cleanups. (Borislav Petkov)

    - x86 microcode driver rework: simplify it by unmodularizing it and
    other cleanups. (Borislav Petkov)"

    * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init()
    x86/mce: Add a Scalable MCA vendor flags bit
    MAINTAINERS: Unify the microcode driver section
    x86/microcode/intel: Move #ifdef DEBUG inside the function
    x86/microcode/amd: Remove maintainers from comments
    x86/microcode: Remove modularization leftovers
    x86/microcode: Merge the early microcode loader
    x86/microcode: Unmodularize the microcode driver
    x86/mce: Fix thermal throttling reporting after kexec
    kexec/crash: Say which char is the unrecognized
    x86/setup/crash: Check memblock_reserve() retval
    x86/setup/crash: Cleanup some more
    x86/setup/crash: Remove alignment variable
    x86/setup: Cleanup crashkernel reservation functions
    x86/amd_nb, EDAC: Rename amd_get_node_id()
    x86/setup: Do not reserve crashkernel high memory if low reservation failed
    x86/microcode/amd: Do not overwrite final patch levels
    x86/microcode/amd: Extract current patch level read to a function
    x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC
    x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts
    ...

    Linus Torvalds
     

21 Oct, 2015

1 commit

  • This function doesn't give us the "Node ID" as the function name
    suggests. Rather, it receives a PCI device as argument, checks
    the available F3 PCI device IDs in the system and returns the
    index of the matching Bus/Device IDs.

    Rename it to amd_pci_dev_to_node_id().

    No functional change is introduced.

    Suggested-by: Ingo Molnar
    Signed-off-by: Aravind Gopalakrishnan
    Signed-off-by: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Mauro Carvalho Chehab
    Cc: Peter Zijlstra
    Cc: Suravee Suthikulpanit
    Cc: Thomas Gleixner
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1445246268-26285-3-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Aravind Gopalakrishnan
     

29 Sep, 2015

1 commit

  • The scrub rate control register has moved to function 2 in PCI config
    space and is at a different offset on family 0x15, models 0x60 and
    later. The minimum recommended scrub rate has also changed. (Refer to
    D18F2x1c9_dct[1:0][DramScrub] in Fam15hM60h BKDG).

    Adjust set_scrub_rate() and get_scrub_rate() functions to accommodate
    this.

    Tested on F15hM60h, Fam15h, models 00h-0fh and Fam10h systems.

    Signed-off-by: Aravind Gopalakrishnan
    Cc: linux-edac
    Link: http://lkml.kernel.org/r/1443440593-2316-2-git-send-email-Aravind.Gopalakrishnan@amd.com
    [ Cleanup conditionals. ]
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan
     

20 May, 2015

1 commit

  • While testing asynchronous PCI probe on this driver I noticed it failed
    because the driver checks if any of the PCI devices have been bound to
    the driver after registering it, which obviously does not work if
    probing is asynchronous.

    While there are patches and discussions on how the driver should behave
    are ongoing, let's enforce synchronous probe for this driver for now.

    Reviewed-by: Tejun Heo
    Signed-off-by: Luis R. Rodriguez
    Signed-off-by: Dmitry Torokhov
    Signed-off-by: Greg Kroah-Hartman

    Luis R. Rodriguez
     

23 Feb, 2015

2 commits


17 Feb, 2015

1 commit

  • When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
    the kernel fatally dereferences unallocated structures, see splat below;
    this occurs on at least NumaConnect systems.

    Fix by checking if a memory controller info structure was found.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
    IP: [] decode_bus_error+0x2f/0x2b0
    PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
    Oops: 0000 [#2] SMP
    Modules linked in:
    CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G D 3.19.0 #1
    Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b 01/28/2015
    task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
    RIP: 0010:[] [] decode_bus_error+0x2f/0x2b0
    RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
    RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
    RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
    RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
    R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
    R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
    FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
    Stack:
    0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
    000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
    ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
    Call Trace:

    ? vprintk_default
    ? printk
    amd_decode_mce
    notifier_call_chain
    atomic_notifier_call_chain
    mce_log
    machine_check_poll
    mce_timer_fn
    ? mce_cpu_restart
    call_timer_fn.isra.29
    run_timer_softirq
    __do_softirq
    irq_exit
    smp_apic_timer_interrupt
    apic_timer_interrupt

    ? down_read_trylock
    __do_page_fault
    ? __schedule
    do_page_fault
    page_fault

    Signed-off-by: Daniel J Blueman
    Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
    Cc: stable@vger.kernel.org
    [ Boris: massage commit message ]
    Signed-off-by: Borislav Petkov

    Daniel J Blueman
     

05 Nov, 2014

1 commit

  • By popular demand, enable amd64_edac on 32-bit too.

    Boris:
    - update Kconfig text.
    - add a warning on load which states that 32-bit configurations are unsupported.

    Signed-off-by: Tomasz Pala
    Link: http://lkml.kernel.org/r/20141102102212.GA7034@polanet.pl
    Signed-off-by: Borislav Petkov

    Tomasz Pala
     

30 Oct, 2014

1 commit

  • This patch adds support for ECC error decoding for F15h M60h processor.
    Aside from the usual changes, the patch adds support for some new features
    in the processor:
    - DDR4(unbuffered, registered); LRDIMM DDR3 support
    - relevant debug messages have been modified/added to report these
    memory types
    - new dbam_to_cs mappers
    - if (F15h M60h && LRDIMM); we need a 'multiplier' value to find
    cs_size. This multiplier value is obtained from the per-dimm
    DCSM register. So, change the interface to accept a 'cs_mask_nr'
    value to facilitate this calculation
    - switch-casing determine_memory_type()
    - done to cleanse the function of too many if-else statements
    and improve readability
    - This is now called early in read_mc_regs() to cache dram_type

    Misc cleanup:
    - amd64_pci_table[] is condensed by using PCI_VDEVICE macro.

    Testing details:
    Tested the patch by injecting 'ECC' type errors using mce_amd_inj
    and error decoding works fine.

    Signed-off-by: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/1414617483-4941-1-git-send-email-Aravind.Gopalakrishnan@amd.com
    [ Boris: determine_memory_type() cleanups ]
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan
     

23 Sep, 2014

1 commit

  • Rationale behind this change:
    - F2x1xx addresses were stopped from being mapped explicitly to DCT1
    from F15h (OR) onwards. They use _dct[0:1] mechanism to access the
    registers. So we should move away from using address ranges to select
    DCT for these families.
    - On newer processors, the address ranges used to indicate DCT1 (0x140,
    0x1a0) have different meanings than what is assumed currently.

    Changes introduced:
    - amd64_read_dct_pci_cfg() now takes in dct value and uses it for
    'selecting the dct'
    - Update usage of the function. Keep in mind that different families
    have specific handling requirements
    - Remove [k8|f10]_read_dct_pci_cfg() as they don't do much different
    from amd64_read_pci_cfg()
    - Move the k8 specific check to amd64_read_pci_cfg
    - Remove f15_read_dct_pci_cfg() and move logic to amd64_read_dct_pci_cfg()
    - Remove now needless .read_dct_pci_cfg

    Testing:
    - Tested on Fam 10h; Fam15h Models: 00h, 30h; Fam16h using 'EDAC_DEBUG'
    and mce_amd_inj
    - driver obtains info from F2x registers and caches it in pvt
    structures correctly
    - ECC decoding works fine

    Signed-off-by: Aravind Gopalakrishnan
    Link: http://lkml.kernel.org/r/1410799058-3149-1-git-send-email-aravind.gopalakrishnan@amd.com
    Signed-off-by: Borislav Petkov

    Aravind Gopalakrishnan