11 Jun, 2012

1 commit

  • Add a new tracepoint-based hardware events report method for
    reporting Memory Controller events.

    Part of the description bellow is shamelessly copied from Tony
    Luck's notes about the Hardware Error BoF during LPC 2010 [1].
    Tony, thanks for your notes and discussions to generate the
    h/w error reporting requirements.

    [1] http://lwn.net/Articles/416669/

    We have several subsystems & methods for reporting hardware errors:

    1) EDAC ("Error Detection and Correction"). In its original form
    this consisted of a platform specific driver that read topology
    information and error counts from chipset registers and reported
    the results via a sysfs interface.

    2) mcelog - x86 specific decoding of machine check bank registers
    reporting in binary form via /dev/mcelog. Recent additions make use
    of the APEI extensions that were documented in version 4.0a of the
    ACPI specification to acquire more information about errors without
    having to rely reading chipset registers directly. A user level
    programs decodes into somewhat human readable format.

    3) drivers/edac/mce_amd.c - this driver hooks into the mcelog path and
    decodes errors reported via machine check bank registers in AMD
    processors to the console log using printk();

    Each of these mechanisms has a band of followers ... and none
    of them appear to meet all the needs of all users.

    As part of a RAS subsystem, let's encapsulate the memory error hardware
    events into a trace facility.

    The tracepoint printk will be displayed like:

    mc_event: [quant] (Corrected|Uncorrected|Fatal) error:[error msg] on [label] ([location] [edac_mc detail] [driver_detail]

    Where:
    [quant] is the quantity of errors
    [error msg] is the driver-specific error message
    (e. g. "memory read", "bus error", ...);
    [location] is the location in terms of memory controller and
    branch/channel/slot, channel/slot or csrow/channel;
    [label] is the memory stick label;
    [edac_mc detail] describes the address location of the error
    and the syndrome;
    [driver detail] is driver-specifig error message details,
    when needed/provided (e. g. "area:DMA", ...)

    For example:

    mc_event: 1 Corrected error:memory read on memory stick DIMM_1A (mc:0 location:0:0:0 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 area:DMA)

    Of course, any userspace tools meant to handle errors should not parse
    the above data. They should, instead, use the binary fields provided by
    the tracepoint, mapping them directly into their Management Information
    Base.

    NOTE: The original patch was providing an additional mechanism for
    MCA-based trace events that also contained MCA error register data.
    However, as no agreement was reached so far for the MCA-based trace
    events, for now, let's add events only for memory errors.
    A latter patch is planned to change the tracepoint, for those types
    of event.

    Cc: Aristeu Rozanski
    Cc: Doug Thompson
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

29 May, 2012

2 commits

  • Now that all drivers got converted to use the new ABI, we can
    drop the old one.

    Acked-by: Chris Metcalf
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Change the EDAC internal representation to work with non-csrow
    based memory controllers.

    There are lots of those memory controllers nowadays, and more
    are coming. So, the EDAC internal representation needs to be
    changed, in order to work with those memory controllers, while
    preserving backward compatibility with the old ones.

    The edac core was written with the idea that memory controllers
    are able to directly access csrows.

    This is not true for FB-DIMM and RAMBUS memory controllers.

    Also, some recent advanced memory controllers don't present a per-csrows
    view. Instead, they view memories as DIMMs, instead of ranks.

    So, change the allocation and error report routines to allow
    them to work with all types of architectures.

    This will allow the removal of several hacks with FB-DIMM and RAMBUS
    memory controllers.

    Also, several tests were done on different platforms using different
    x86 drivers.

    TODO: a multi-rank DIMMs are currently represented by multiple DIMM
    entries in struct dimm_info. That means that changing a label for one
    rank won't change the same label for the other ranks at the same DIMM.
    This bug is present since the beginning of the EDAC, so it is not a big
    deal. However, on several drivers, it is possible to fix this issue, but
    it should be a per-driver fix, as the csrow => DIMM arrangement may not
    be equal for all. So, don't try to fix it here yet.

    I tried to make this patch as short as possible, preceding it with
    several other patches that simplified the logic here. Yet, as the
    internal API changes, all drivers need changes. The changes are
    generally bigger in the drivers for FB-DIMMs.

    Cc: Aristeu Rozanski
    Cc: Doug Thompson
    Cc: Borislav Petkov
    Cc: Mark Gross
    Cc: Jason Uhlenkott
    Cc: Tim Small
    Cc: Ranganathan Desikan
    Cc: "Arvind R."
    Cc: Olof Johansson
    Cc: Egor Martovetsky
    Cc: Chris Metcalf
    Cc: Michal Marek
    Cc: Jiri Kosina
    Cc: Joe Perches
    Cc: Dmitry Eremin-Solenikov
    Cc: Benjamin Herrenschmidt
    Cc: Hitoshi Mitake
    Cc: Andrew Morton
    Cc: "Niklas Söderlund"
    Cc: Shaohui Xie
    Cc: Josh Boyer
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

15 Dec, 2011

1 commit


01 Nov, 2011

1 commit


27 May, 2011

1 commit


31 Mar, 2011

1 commit


14 Jan, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

07 Jan, 2011

2 commits

  • Make the ->{get|set}_sdram_scrub_rate return the actual scrub rate
    bandwidth it succeeded setting and remove superfluous arg pointer used
    for that. A negative value returned still means that an error occurred
    while setting the scrubrate. Document this for future reference.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • Add a macro per printk level, shorten up error messages. Add relevant
    information to KERN_INFO level. No functional change.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

23 Dec, 2010

1 commit


09 Dec, 2010

1 commit

  • This corrects the misprint introduced when moving '#if
    PAGE_SHIFT' from i7core_edac.c to edac_core.h (commit
    e9144601d364d5b81f3e63949337f8507eb58dca)

    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrei Konovalov
    Signed-off-by: Borislav Petkov

    Andrei Konovalov
     

02 Nov, 2010

1 commit

  • "gadget", "through", "command", "maintain", "maintain", "controller", "address",
    "between", "initiali[zs]e", "instead", "function", "select", "already",
    "equal", "access", "management", "hierarchy", "registration", "interest",
    "relative", "memory", "offset", "already",

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Jiri Kosina

    Uwe Kleine-König
     

24 Oct, 2010

3 commits


03 Aug, 2010

2 commits

  • Fortify the interface to not accept negative values, remove
    memctrl_int_store() as a result. Also, sanitize bandwidth setting by
    making the argument a simple u32 instead of strange u32 pointer being
    passed around for no obvious reason. Then, fix error handling and teach
    it to return proper error values. Finally, make code more readable,
    simplify debug messages.

    Cc: Mauro Carvalho Chehab
    Cc: Arthur Jones
    Signed-off-by: Borislav Petkov
    Acked-by: Doug Thompson

    Borislav Petkov
     
  • This option differs from EDAC_DEBUG only by printing the file and
    line of where the debug statement is placed, which contains unneeded
    information. So remove it.

    Signed-off-by: Borislav Petkov
    Acked-by: Doug Thompson

    Borislav Petkov
     

10 May, 2010

3 commits


08 Dec, 2009

1 commit

  • Instead of using deeply-nested conditionals for dumping the DIMM type in
    debug mode, add a strings array of the supported DIMM types.

    This is useful in cases where an edac driver supports multiple DRAM
    types and is only defined in debug builds.

    Signed-off-by: Borislav Petkov

    Borislav Petkov
     

21 Sep, 2009

1 commit


01 Jul, 2009

1 commit


19 Jun, 2009

1 commit

  • Add edac_device_alloc_index(), because for MAPLE platform there may
    exist several EDAC driver modules that could make use of
    edac_device_ctl_info structure at the same time. The index allocation
    for these structures should be taken care of by EDAC core.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Harry Ciao
    Cc: Doug Thompson
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Kumar Gala
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harry Ciao
     

10 Jun, 2009

1 commit


14 Apr, 2009

1 commit

  • Fix the edac local pci_write_bits32 to properly note the 'escape' mask if
    all ones in a 32-bit word.

    Currently no consumer of this function uses that mask, so there is no
    danger to existing code.

    Signed-off-by: Jeff Haran
    Signed-off-by: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Haran
     

03 Apr, 2009

2 commits

  • Add edac_pci_alloc_index(), because for MAPLE platform there may exist
    several EDAC driver modules that could make use of edac_pci_ctl_info
    structure at the same time. The index allocation for these structures
    should be taken care of by EDAC core.

    Signed-off-by: Harry Ciao
    Cc: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harry Ciao
     
  • A patch for making a debugging information more verbose for use in
    development debugging.

    By enabling the new option "More verbose debugging", information about
    source file and line number will be added to debugging message.

    This is sample output,

    EDAC MC0: Giving out device to 'e7xxx_edac' 'E7205': DEV 0000:00:00.0
    EDAC DEBUG: in drivers/edac/edac_pci.c, line at 48: edac_pci_alloc_ctl_info()
    EDAC DEBUG: in drivers/edac/edac_pci.c, line at 334: edac_pci_add_device()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Signed-off-by: Hitoshi Mitake
    Signed-off-by: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hitoshi Mitake
     

24 Aug, 2008

1 commit


06 May, 2008

1 commit

  • Commit 06916639e2fed9ee475efef2747a1b7429f8fe76 ("driver-core: add
    dev_name() to help transition away from using bus_id") added a static
    inline dev_name() and used it in dev_printk.

    Unfortunately, drivers/edac/edac_core.h defines a macro called
    dev_name(). Rename the latter.

    Diagnosis by Tony Breeds and Michael Ellerman.

    Signed-off-by: Stephen Rothwell
    Acked-by: Doug Thompson
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

08 Feb, 2008

1 commit


20 Oct, 2007

1 commit

  • define global BIT macro

    move all local BIT defines to the new globally define macro.

    Signed-off-by: Jiri Slaby
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Kumar Gala
    Cc: Dmitry Torokhov
    Cc: Jeff Garzik
    Cc: James Bottomley
    Cc: "Antonino A. Daplas"
    Cc: Russell King
    Acked-by: Ralf Baechle
    Cc: "John W. Linville"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

12 Sep, 2007

1 commit


20 Jul, 2007

6 commits

  • Some simple fixes to properly reference counter values from the block
    attribute level of edac_device objects. Properly sequencing the array pointer
    was added, resulting in correct identification of block level attributes from
    their base class functions.

    Added more verbose debug statement for event tracking.

    Also during some corner testing, found a bug in the store/show sequence
    of operations for the block attribute/controls management.

    An old intermediate structure for 'blocks' was still in the processing
    pipeline. This patch removes that old structure and correctly utilizes the
    new struct edac_dev_sysfs_block_attribute for passing control from the sysfs
    to the low level store/show function of the edac driver.

    Now the proper kobj pointer to passed downward to the store/show
    functions.

    Signed-off-by: Doug Thompson
    Cc: Greg KH
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Thompson
     
  • With feedback, this patch corrects operation of the kobject release operation
    on kobjects, attributes and controls for the edac_device.

    Cc: Alan Cox alan@lxorguk.ukuu.org.uk
    Signed-off-by: Doug Thompson
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Douglas Thompson
     
  • This patch refactors the 'releasing' of kobjects for the edac_mc type of
    device. The correct pattern of kobject release is followed.

    As internal kobjs are allocated they bump a ref count on the top level kobj.
    It in turn has a module ref count on the edac_core module. When internal
    kobjects are released, they dec the ref count on the top level kobj. When the
    top level kobj reaches zero, it decrements the ref count on the edac_core
    object, allow it to be unloaded, as all resources have all now been released.

    Cc: Alan Cox alan@lxorguk.ukuu.org.uk
    Signed-off-by: Doug Thompson
    Acked-by: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Thompson
     
  • Refactoring of sysfs code necessitated the refactoring of the
    edac_device_alloc() and edac_device_add_device() apis, of moving the index
    value to the alloc() function. This patch alters the in tree drivers to
    utilize this new api signature.

    Having the index value performed later created a chicken-and-the-egg issue.
    Moving it to the alloc() function allows for creating the necessary sysfs
    entries with the proper index number

    Cc: Alan Cox alan@lxorguk.ukuu.org.uk
    Signed-off-by: Doug Thompson
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Thompson
     
  • Refactoring of sysfs code necessitated the refactoring of the edac_mc_alloc()
    and edac_mc_add_mc() apis, of moving the index value to the alloc() function.
    This patch alters the in tree drivers to utilize this new api signature.

    Having the index value performed later created a chicken-and-the-egg issue.
    Moving it to the alloc() function allows for creating the necessary sysfs
    entries with the proper index number

    Cc: Alan Cox alan@lxorguk.ukuu.org.uk
    Signed-off-by: Doug Thompson
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Thompson
     
  • This patch fixes and enhances the driver level set of sysfs attributes that
    can be added to the 'block' level of an edac_device type of driver.

    There is a controller information structure, which contains one or more
    instances of device. Each instance will have one or more blocks of device
    specific counters. This patch fixes the ability to have more detailed
    attributes/controls for each of the 'blocks', providing for the addition of
    controls/attributes from the low level driver to user space via sysfs.

    Cc: Alan Cox alan@lxorguk.ukuu.org.uk
    Signed-off-by: Douglas Thompson
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Douglas Thompson