28 Apr, 2014

40 commits

  • The problem was initially reported by Wendy who tried pass through
    IPR adapter, which was connected to PHB root port directly, to KVM
    based guest. When doing that, pci_reset_bridge_secondary_bus() was
    called by VFIO driver and linkDown was detected by the root port.
    That caused all PEs to be frozen.

    The patch fixes the issue by routing the reset for the secondary bus
    of root port to underly firmware. For that, one more weak function
    pci_reset_secondary_bus() is introduced so that the individual platforms
    can override that and do specific reset for bridge's secondary bus.

    Reported-by: Wendy Xiong
    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • Basically, we have 3 types of resets to fulfil PE reset: fundamental,
    hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
    and settlement delay as specified by PCI spec. PowerNV and pSeries
    platforms are running on top of different firmware and some of the
    delays have been covered by underly firmware (PowerNV).

    The patch makes the delays unified to be done in backend, instead of
    EEH core.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • Resetting root port has more stuff to do than that for PCIe switch
    ports and we should have resetting root port done in firmware instead
    of the kernel itself. The problem was introduced by commit 5b2e198e
    ("powerpc/powernv: Rework EEH reset").

    Cc: linux-stable
    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always
    overwritten by EEH_STATE_NOT_SUPPORT because of the missed
    "break" there. The patch fixes the issue.

    Reported-by: Joe Perches
    Cc: linux-stable
    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • Once one specific PE has been marked as EEH_PE_ISOLATED, it's in
    the middile of recovery or removed permenently. We needn't report
    the frozen PE again. Otherwise, we will have endless reporting
    same frozen PE.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The issue was detected in a bit complicated test case where
    we have multiple hierarchical PEs shown as following figure:

    +-----------------+
    | PE#3 p2p#0 |
    | p2p#1 |
    +-----------------+
    |
    +-----------------+
    | PE#4 pdev#0 |
    | pdev#1 |
    +-----------------+

    PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
    bridges. We accidentally had less-known scenario: PE#4 was removed
    permanently from the system because of permanent failure (e.g.
    exceeding the max allowd failure times in last hour), then we detects
    EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
    for pdev#0/1 were not detached from PE#4, which was still connected to
    PE#3. All of that was because of the fact that we rely on count-based
    pcibios_release_device(), which isn't reliable enough. When doing
    recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
    are not valid any more. Eventually, we run into kernel crash.

    The patch fixes above issue from two aspects. For unplug, we simply
    skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
    && !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
    than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
    removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
    its PCI config so that PCI core will omit them.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The patch introduces bootarg "eeh=off" to disable EEH functinality.
    Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable
    or enable EEH functionality. By default, we have the functionality
    enabled.

    For PowerNV platform, we will restore to have the conventional
    mechanism of clearing frozen PE during PCI config access if we're
    going to disable EEH functionality. Conversely, we will rely on
    EEH for error recovery.

    The patch also fixes the issue that we missed to cover the case
    of disabled EEH functionality in function ioda_eeh_event(). Those
    events driven by interrupt should be cleared to avoid endless
    reporting.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • There're 2 EEH subsystem variables: eeh_subsystem_enabled and
    eeh_probe_mode. We needn't maintain 2 variables and we can just
    have one variable and introduce different flags. The patch also
    introduces additional flag EEH_FORCE_DISABLE, which will be used
    to disable EEH subsystem via boot parameter ("eeh=off") in future.
    Besides, the patch also introduces flag EEH_ENABLED, which is
    changed to disable or enable EEH functionality on the fly through
    debugfs entry in future.

    With the patch applied, the creteria to check the enabled EEH
    functionality is changed to:

    !EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
    Other cases : Disabled

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • When calling into eeh_gather_pci_data() on pSeries platform, we
    possiblly don't have pci_dev instance yet, but eeh_dev is always
    ready. So we use cached capability from eeh_dev instead of pci_dev
    for log dump there. In order to keep things unified, we also cache
    PCI capability positions to eeh_dev for PowerNV as well.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The patch replaces printk(KERN_WARNING ...) with pr_warn() in the
    function eeh_gather_pci_data().

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • We have suffered recrusive frozen PE a lot, which was caused
    by IO accesses during the PE reset. Ben came up with the good
    idea to keep frozen PE until recovery (BAR restore) gets done.
    With that, IO accesses during PE reset are dropped by hardware
    and wouldn't incur the recrusive frozen PE any more.

    The patch implements the idea. We don't clear the frozen state
    until PE reset is done completely. During the period, the EEH
    core expects unfrozen state from backend to keep going. So we
    have to reuse EEH_PE_RESET flag, which has been set during PE
    reset, to return normal state from backend. The side effect is
    we have to clear frozen state for towice (PE reset and clear it
    explicitly), but that's harmless.

    We have some limitations on pHyp. pHyp doesn't allow to enable
    IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE
    in eeh_pci_enable(). We have to enable IO before grabbing logs on
    pHyp. Otherwise, 0xFF's is always returned from PCI config space.
    Also, we had wrong return value from eeh_pci_enable() for
    EEH_OPT_THAW_DMA case. The patch fixes it too.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • For EEH PowerNV backends, they need use their own PCI config
    accesors as the normal one could be blocked during PE reset.
    The patch also removes necessary parameter "hose" for the
    function ioda_eeh_bridge_reset().

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • We've observed multiple PE reset failures because of PCI-CFG
    access during that period. Potentially, some device drivers
    can't support EEH very well and they can't put the device to
    motionless state before PE reset. So those device drivers might
    produce PCI-CFG accesses during PE reset. Also, we could have
    PCI-CFG access from user space (e.g. "lspci"). Since access to
    frozen PE should return 0xFF's, we can block PCI-CFG access
    during the period of PE reset so that we won't get recrusive EEH
    errors.

    The patch adds flag EEH_PE_RESET, which is kept during PE reset.
    The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
    PCI-CFG accordingly.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally.
    However, We should remove that if the PE reset has cleared the
    frozen state successfully. Otherwise, the flag should be kept.
    The patch fixes the issue.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's
    meaningless to print them if they have non-zero value in the
    corresponding mask registers because we always have non-zero values
    in the mask registers. The patch only prints those fieds if we
    have non-zero values in the primary registers (e.g. LEM, MMIO, DMA
    status) so that we can save couple of lines. The patch also removes
    unnecessary spare line before "brdgCtl:" and two leading spaces as
    prefix in each line as Ben suggested.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state,
    which is protected by CONFIG_EEH. We needn't that. Instead, we
    can have pnv_phb::flags and maintain all flags there, which is
    the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED
    to PNV_PHB_FLAG_EEH.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't
    so useful any more and it's duplicated to EEH_PE_ISOLATED. The
    patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
    EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
    would be removed from the system. However, it's safe to replace
    that with EEH_PE_ISOLATED.

    The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
    either failure or success. It makes the PHB PE state consistent with:

    PHB functions normally NONE
    PHB has been removed EEH_PE_ISOLATED
    PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • This patch fixes this section mismatch:

    WARNING: vmlinux.o(.text+0x1efc4): Section mismatch in reference from
    the function apm821xx_pciex_init_port_hw() to the function
    .init.text:ppc4xx_pciex_wait_on_sdr.isra.9()

    The function apm821xx_pciex_init_port_hw() references the function
    __init ppc4xx_pciex_wait_on_sdr.isra.9(). This is often because
    apm821xx_pciex_init_port_hw lacks a __init annotation or the
    annotation of ppc4xx_pciex_wait_on_sdr.isra.9 is wrong.

    apm821xx_pciex_init_port_hw is only referenced by a struct in
    __initdata, so it should be safe to add __init to
    apm821xx_pciex_init_port_hw.

    Signed-off-by: Alistair Popple
    Signed-off-by: Benjamin Herrenschmidt

    Alistair Popple
     
  • When the guest cedes the vcpu or the vcpu has no guest to
    run it naps. Clear the runlatch bit of the vcpu before
    napping to indicate an idle cpu.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • The secondary threads in the core are kept offline before launching guests
    in kvm on powerpc: "371fefd6f2dc4666:KVM: PPC: Allow book3s_hv guests to use
    SMT processor modes."

    Hence their runlatch bits are cleared. When the secondary threads are called
    in to start a guest, their runlatch bits need to be set to indicate that they
    are busy. The primary thread has its runlatch bit set though, but there is no
    harm in setting this bit once again. Hence set the runlatch bit for all
    threads before they start guest.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • Up until now we have been setting the runlatch bits for a busy CPU and
    clearing it when a CPU enters idle state. The runlatch bit has thus
    been consistent with the utilization of a CPU as long as the CPU is online.

    However when a CPU is hotplugged out the runlatch bit is not cleared. It
    needs to be cleared to indicate an unused CPU. Hence this patch has the
    runlatch bit cleared for an offline CPU just before entering an idle state
    and sets it immediately after it exits the idle state.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • While testing memory hot-remove, I found following dead lock:

    Process #1141 is drmgr, trying to remove some memory, i.e. memory499.
    It holds the memory_hotplug_mutex, and blocks when trying to remove file
    "online" under dir memory499, in kernfs_drain(), at
    wait_event(root->deactivate_waitq,
    atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);

    Process #1120 is trying to online memory499 by
    echo 1 > memory499/online

    In .kernfs_fop_write, it uses kernfs_get_active() to increase
    &kn->active, thus blocking process #1141. While itself is blocked later
    when trying to acquire memory_hotplug_mutex, which is held by process

    The backtrace of both processes are shown below:

    [] 0xc000000001b18600
    [] .__switch_to+0x144/0x200
    [] .online_pages+0x74/0x7b0
    [] .memory_subsys_online+0x9c/0x150
    [] .device_online+0xb8/0x120
    [] .online_store+0xb4/0xc0
    [] .dev_attr_store+0x64/0xa0
    [] .sysfs_kf_write+0x7c/0xb0
    [] .kernfs_fop_write+0x154/0x1e0
    [] .vfs_write+0xe0/0x260
    [] .SyS_write+0x64/0x110
    [] syscall_exit+0x0/0x7c

    [] 0xc000000001b18600
    [] .__switch_to+0x144/0x200
    [] .__kernfs_remove+0x204/0x300
    [] .kernfs_remove_by_name_ns+0x68/0xf0
    [] .sysfs_remove_file_ns+0x38/0x60
    [] .device_remove_attrs+0x54/0xc0
    [] .device_del+0x158/0x250
    [] .device_unregister+0x34/0xa0
    [] .unregister_memory_section+0x164/0x170
    [] .__remove_pages+0x108/0x4c0
    [] .arch_remove_memory+0x60/0xc0
    [] .remove_memory+0x8c/0xe0
    [] .pseries_remove_memblock+0xd4/0x160
    [] .pseries_memory_notifier+0x27c/0x290
    [] .notifier_call_chain+0x8c/0x100
    [] .__blocking_notifier_call_chain+0x6c/0xe0
    [] .of_property_notify+0x7c/0xc0
    [] .of_update_property+0x3c/0x1b0
    [] .ofdt_write+0x3dc/0x740
    [] .proc_reg_write+0xac/0x110
    [] .vfs_write+0xe0/0x260
    [] .SyS_write+0x64/0x110
    [] syscall_exit+0x0/0x7c

    This patch uses lock_device_hotplug() to protect remove_memory() called
    in pseries_remove_memblock(), which is also stated before function
    remove_memory():

    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
    * and online/offline operations before this call, as required by
    * try_offline_node().
    */
    void __ref remove_memory(int nid, u64 start, u64 size)

    With this lock held, the other process(#1120 above) trying to online the
    memory block will retry the system call when calling
    lock_device_hotplug_sysfs(), and finally find No such device error.

    Signed-off-by: Li Zhong
    Signed-off-by: Benjamin Herrenschmidt

    Li Zhong
     
  • module_init should return 0 or a negative errno.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Bump the boot wrapper BOOT_COMMAND_LINE_SIZE to match the
    kernel.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • I've had a report that the current limit is too small for
    an automated network based installer. Bump it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We have two definitions of COMMAND_LINE_SIZE, one for the kernel
    and one for the boot wrapper. I assume this is so the boot
    wrapper can be self sufficient and not rely on kernel headers.

    Having two defines with the same name is confusing, I just
    updated the wrong one when trying to bump it.

    Make the boot wrapper define unique by calling it
    BOOT_COMMAND_LINE_SIZE.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The catalog version number was changed from a be32 (with proceeding
    32bits of padding) to a be64, update the code to treat it as a be64

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • fixup for "powerpc/perf: Add support for the hv gpci (get performance
    counter info) interface".

    Makes the "not enabled" message less awful (and hidden unless
    debugging).

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • fixup for "powerpc/perf: Add support for the hv 24x7 interface"

    Makes the "not enabled" message less awful (and hides it in most cases).

    Signed-off-by: Cody P Schafer
    Signed-off-by: Benjamin Herrenschmidt

    Cody P Schafer
     
  • The if condition check was based on a draft ISA doc. Remove the same.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     
  • Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We have two copies of code that creates an OPAL sg list. Consolidate
    these into a common set of helpers and fix the endian issues.

    The flash interface embedded a version number in the num_entries
    field, whereas the dump interface did did not. Since versioning
    wasn't added to the flash interface and it is impossible to add
    this in a backwards compatible way, just remove it.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Fix little endian issues with the OPAL error log code.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Stewart Smith
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The bitmap in opal_poll_events and opal_handle_interrupt is
    big endian, so we need to byteswap it on little endian builds.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We had some duplication of the internal OPAL functions.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Using size_t in our APIs is asking for trouble, especially
    when some OPAL calls use size_t pointers.

    Signed-off-by: Anton Blanchard
    Reviewed-by: Stewart Smith
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard