26 Apr, 2013

36 commits

  • Provides basic enablement for perf branch stack sampling framework on
    POWER8 processor based platforms. Adds new BHRB related elements into
    cpu_hw_event structure to represent current BHRB config, BHRB filter
    configuration, manage context and to hold output BHRB buffer during
    PMU interrupt before passing to the user space. This also enables
    processing of BHRB data and converts them into generic perf branch
    stack data format.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Benjamin Herrenschmidt

    Anshuman Khandual
     
  • This patch populates BHRB specific data for power_pmu structure. It
    also implements POWER8 specific BHRB filter and configuration functions.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Benjamin Herrenschmidt

    Anshuman Khandual
     
  • This patch adds couple of generic functions to power_pmu structure
    which would configure the BHRB and it's filters. It also adds
    representation of the number of BHRB entries present on the PMU.
    A new PMU flag PPMU_BHRB would indicate presence of BHRB feature.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Benjamin Herrenschmidt

    Anshuman Khandual
     
  • This patch adds the basic assembly code to read BHRB buffer. BHRB entries
    are valid only after a PMU interrupt has happened (when MMCR0[PMAO]=1)
    and BHRB has been freezed. BHRB read should not be attempted when it is
    still enabled (MMCR0[PMAE]=1) and getting updated, as this can produce
    non-deterministic results.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Benjamin Herrenschmidt

    Anshuman Khandual
     
  • This patch adds new POWER8 instruction encoding for reading
    and clearing Branch History Rolling Buffer entries. The new
    instruction 'mfbhrbe' (move from branch history rolling buffer
    entry) is used to read BHRB buffer entries and instruction
    'clrbhrb' (clear branch history rolling buffer) is used to
    clear the entire buffer. The instruction 'clrbhrb' has straight
    forward encoding. But the instruction encoding format for
    reading the BHRB entries is like 'mfbhrbe RT, BHRBE' where it
    takes two arguments, i.e the index for the BHRB buffer entry to
    read and a general purpose register to put the value which was
    read from the buffer entry.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Benjamin Herrenschmidt

    Anshuman Khandual
     
  • This patch adds support for the power8 PMU to perf.

    Work is ongoing to add generic cache events.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • On power8 we have a new SIER (Sampled Instruction Event Register), which
    captures information about instructions when we have random sampling
    enabled.

    Add support for loading the SIER into pt_regs, overloading regs->dar.
    Also set the new NO_SIPR flag in regs->result if we don't have SIPR.

    Update regs_sihv/sipr() to look for SIPR/SIHV in SIER.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • On power8 the presence or absence of SIPR depends on settings at runtime,
    so convert to using a dynamic flag for NO_SIPR. Existing backends that
    set NO_SIPR unconditionally set the dynamic flag obviously.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • Add an accessor for regs->result so we can use it to store more flags in
    future.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • On power8 the SIPR and SIHV are not in MMCRA, so convert the routines
    to take regs and change the names accordingly.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • In perf_ip_adjust() we potentially use the MMCRA[SLOT] field to adjust
    the reported IP of a sampled instruction.

    Currently the logic is written so that if the backend does NOT have
    the PPMU_ALT_SIPR flag set then we assume MMCRA[SLOT] exists.

    However on power8 we do not want to set ALT_SIPR (it's in a third
    location), and we also do not have MMCRA[SLOT].

    So add a new flag which only indicates whether MMCRA[SLOT] exists.

    Naively we'd set it on everything except power6/7, because they set
    ALT_SIPR, and we've reversed the polarity of the flag. But it's more
    complicated than that.

    mpc7450 is 32-bit, and uses its own version of perf_ip_adjust()
    which doesn't use MMCRA[SLOT], so it doesn't need the new flag set and
    the behaviour is unchanged.

    PPC970 (and I assume power4) don't have MMCRA[SLOT], so shouldn't have
    the new flag set. This is a behaviour change on those cpus, though we
    were probably getting lucky and the bits in question were 0.

    power5 and power5+ set the new flag, behaviour unchanged.

    power6 & power7 do not set the new flag, behaviour unchanged.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • For both HV and guest kernels, intialise PMU regs to something sane.

    Signed-off-by: Michael Ellerman
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     
  • Ben found the root cause. Commit 37f02195bee9c25ce44e25204f40b7961a6d7c9d
    ("powerpc/pci: fix PCI-e devices rescan issue on powerpc platform")
    overwrites the IOMMU table of PCI device while enabling PCI device.
    The patch intends to fix the IOMMU table after that point.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The patch intends to build 32-bits DMA space for individual PEs on
    PHB3. The TVE# is recognized by the combo of PE# and fixed bits
    from DMA address, which is zero for 32-bits DMA space.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The TCE should be invalidated while it's created or free'd. The
    approach to do that for IODA1 and IODA2 compliant PHBs are different.
    So the patch differentiate them with different functions called to
    do that for IODA1 and IODA2 compliant PHBs. It's notable that the
    PCI address is used to invalidate the corresponding TCE on IODA2
    compliant PHB3.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The EOI handler of MSI/MSI-X interrupts for P8 (PHB3) need additional
    steps to handle the P/Q bits in IVE before EOIing the corresponding
    interrupt. The patch changes the EOI handler to cover that. we have
    individual IRQ chip in each PHB instance. During the MSI IRQ setup
    time, the IRQ chip is copied over from the original one for that IRQ,
    and the EOI handler is patched with the one that will handle the P/Q
    bits (As Ben suggested).

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • As Michael Ellerman suggested, to add CONFIG_POWERNV_MSI for PowerNV
    platform. That's similar to CONFIG_PSERIES_MSI for pSeries platform.
    For now, we don't make it dependent on CONFIG_EEH since it's not ready
    to enable that yet.

    Apart from that, we also enable CONFIG_PPC_MSI_BITMAP on selecting
    CONFIG_POWERNV_MSI.

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • The patch intends to initialize PHB3 during system boot stage. The
    flag "PNV_PHB_MODEL_PHB3" is introduced to differentiate IODA2
    compatible PHB3 from other types of PHBs.

    Signed-off-by: Benjamin Herrenschmidt

    Gavin Shan
     
  • Building a 64-bit powerpc kernel with PR KVM enabled currently gives
    this error:

    AS arch/powerpc/kernel/head_64.o
    arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
    arch/powerpc/kernel/exceptions-64s.S:258: Error: attempt to move .org backwards
    make[2]: *** [arch/powerpc/kernel/head_64.o] Error 1

    This happens because the MASKABLE_EXCEPTION_PSERIES macro turns into
    33 instructions, but we only have space for 32 at the decrementer
    interrupt vector (from 0x900 to 0x980).

    In the code generated by the MASKABLE_EXCEPTION_PSERIES macro, we
    currently have two instances of the HMT_MEDIUM macro, which has the
    effect of setting the SMT thread priority to medium. One is the
    first instruction, and is overwritten by a no-op on processors where
    we save the PPR (processor priority register), that is, POWER7 or
    later. The other is after we have saved the PPR.

    In order to reduce the code at 0x900 by one instruction, we omit the
    first HMT_MEDIUM. On processors without SMT this will have no effect
    since HMT_MEDIUM is a no-op there. On POWER5 and RS64 machines this
    will mean that the first few instructions take a little longer in the
    case where a decrementer interrupt occurs when the hardware thread is
    running at low SMT priority. On POWER6 and later machines, the
    hardware automatically boosts the thread priority when a decrementer
    interrupt is taken if the thread priority was below medium, so this
    change won't make any difference.

    The alternative would be to branch out of line after saving the CFAR.
    However, that would incur an extra overhead on all processors, whereas
    the approach adopted here only adds overhead on older threaded processors.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     
  • There are instances in which we do not want topology updates to occur.
    In order to allow this a /proc interface (/proc/powerpc/topology_updates)
    is introduced so that topology updates can be enabled and disabled.

    This patch also adds a prrn_is_enabled() call so that PRRN events are
    handled in the kernel only if topology updating is enabled.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • The Linux kernel and platform firmware negotiate their mutual support
    of the PRRN option via the ibm,client-architecture-support interface.
    This patch simply sets the appropriate fields in the client architecture
    vector to indicate Linux support for PRRN and will allow the firmware to
    report PRRN events via the RTAS event-scan mechanism.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • The new PRRN firmware feature provides a more convenient and event-driven
    interface than VPHN for notifying Linux of changes to the NUMA affinity of
    platform resources. However, for practical reasons, it may not be feasible
    for some customers to update to the latest firmware. For these customers,
    the VPHN feature supported on previous firmware versions may still be the
    best option.

    The VPHN feature was previously disabled due to races with the load
    balancing code when accessing the NUMA cpu maps, but the new stop_machine()
    approach protects the NUMA cpu maps from these concurrent accesses. It
    should be safe to re-enable this feature now.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Jesse Larrew
     
  • The following patch adds vdso_getcpu_init(), which stores the NUMA node for
    a cpu in SPRG3:

    Commit 18ad51dd34 ("powerpc: Add VDSO version of getcpu") adds
    vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3.

    This patch ensures that this information is also updated when the NUMA
    affinity of a cpu changes.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Jesse Larrew
     
  • The new PRRN firmware feature allows CPU and memory resources to be
    transparently reassigned across NUMA boundaries. When this happens, the
    kernel must update the node maps to reflect the new affinity information.

    Although the NUMA maps can be protected by locking primitives during the
    update itself, this is insufficient to prevent concurrent accesses to these
    structures. Since cpumask_of_node() hands out a pointer to these
    structures, they can still be modified outside of the lock. Furthermore,
    tracking down each usage of these pointers and adding locks would be quite
    invasive and difficult to maintain.

    The approach used is to make a list of affected cpus and call stop_machine
    to have the update routine run on each of the affected cpus allowing them
    to update themselves. Each cpu finds itself in the list of cpus and makes
    the appropriate updates. We need to have each cpu do this for themselves to
    handle calls to vdso_getcpu_init() added in a subsequent patch.

    Situations like these are best handled using stop_machine(). Since the NUMA
    affinity updates are exceptionally rare events, this approach has the
    benefit of not adding any overhead while accessing the NUMA maps during
    normal operation.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • Platform events such as partition migration or the new PRRN firmware
    feature can cause the NUMA characteristics of a CPU to change, and these
    changes will be reflected in the device tree nodes for the affected
    CPUs.

    This patch registers a handler for Open Firmware device tree updates
    and reconfigures the CPU and node maps whenever the associativity
    changes. Currently, this is accomplished by marking the affected CPUs in
    the cpu_associativity_changes_mask and allowing
    arch_update_cpu_topology() to retrieve the new associativity information
    using hcall_vphn().

    Protecting the NUMA cpu maps from concurrent access during an update
    operation will be addressed in a subsequent patch in this series.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Jesse Larrew
     
  • Update the numa code to use the updated firmware_has_feature() when checking
    for type 1 affinity.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • The firmware_has_feature() function makes it easy to check for supported
    features of the hypervisor. This patch extends the capability of
    firmware_has_feature() to include checking for specified bits
    in vector 5 of the architecture vector as reported in the device tree.

    As part of this the #defines used for the architecture vector are re-defined
    such that each option has the index into vector 5 and the feature bit encoded
    into it. This makes checking for architecture bits when initiating data
    for firmware_has_feature much easier.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • When iterating over the entries in firmware_features_table we only need
    to go over the actual number of entries in the array instead of declaring
    it to be bigger and checking to make sure there is a valid entry in every
    slot.

    This patch removes the FIRMWARE_MAX_FEATURES #define and replaces the
    array looping with the use of ARRAY_SIZE().

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • As part of handling of PRRN events we need to check vector 5 of the
    architecture vector bits reported in the device tree to ensure PRRN event
    handling is enabled. To do this firmware_has_feature() is updated (in a
    subsequent patch) to make this check vector 5 bits. To avoid having to
    re-define bits in the architecture vector the bit definitions are moved
    to prom.h.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • A PRRN event is signaled via the RTAS event-scan mechanism, which
    returns a Hot Plug Event message "fixed part" indicating "Platform
    Resource Reassignment". In response to the Hot Plug Event message,
    we must call ibm,update-nodes to determine which resources were
    reassigned and then ibm,update-properties to obtain the new affinity
    information about those resources.

    The PRRN event-scan RTAS message contains only the "fixed part" with
    the "Type" field set to the value 160 and no Extended Event Log. The
    four-byte Extended Event Log Length field is re-purposed (since no
    Extended Event Log message is included) to pass the "scope" parameter
    that causes the ibm,update-nodes to return the nodes affected by the
    specific resource reassignment.

    This patch adds a handler for RTAS events. The function
    pseries_devicetree_update() (from mobility.c) is used to make the
    ibm,update-nodes/ibm,update-properties RTAS calls. Updating the NUMA maps
    (handled by a subsequent patch) will require significant processing,
    so pseries_devicetree_update() is called from an asynchronous workqueue
    to allow event processing to continue.

    PRRN RTAS events on pseries systems are rare events that have to be
    initiated from the HMC console for the system by an IBM tech. This allows
    us to assume that these events are widely spaced. Additionally, all work
    on the queue is flushed before handling any new work to ensure we only have
    one event in flight being handled at a time.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Jesse Larrew
     
  • Correct parsing of the buffer returned from ibm,update-properties. The first
    element is a length and the path to the property which is slightly different
    from the list of properties in the buffer so we need to specifically
    handle this.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • Newer firmware on Power systems can transparently reassign platform resources
    (CPU and Memory) in use. For instance, if a processor or memory unit is
    predicted to fail, the platform may transparently move the processing to an
    equivalent unused processor or the memory state to an equivalent unused
    memory unit. However, reassigning resources across NUMA boundaries may alter
    the performance of the partition. When such reassignment is necessary, the
    Platform Resource Reassignment Notification (PRRN) option provides a
    mechanism to inform the Linux kernel of changes to the NUMA affinity of
    its platform resources.

    When rtasd receives a PRRN event, it needs to make a series of RTAS
    calls (ibm,update-nodes and ibm,update-properties) to retrieve the
    updated device tree information. These calls are already handled in the
    pseries_devicetree_update() routine used in partition migration.

    This patch exposes pseries_devicetree_update() to make it accessible
    to other pseries routines, this patch also updates pseries_devicetree_update()
    to take a 32-bit scope parameter. The scope value, which was previously hard
    coded to 1 for partition migration, is used for the RTAS calls
    ibm,update-nodes/properties to update the device tree.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Benjamin Herrenschmidt

    Nathan Fontenot
     
  • POWER8 allows us to take interrupts with the MMU on. This gives us a
    second set of vectors offset at 0x4000.

    Unfortunately when coping these vectors we missed checking for MSR HV
    for hardware interrupts (0x500). This results in us trying to use
    HSRR0/1 when HV=0, rather than SRR0/1 on HW IRQs

    The below fixes this to check CPU_FTR_HVMODE when patching the code at
    0x4500.

    Also we remove the check for CPU_FTR_ARCH_206 since relocation on IRQs
    are only available in arch 2.07 and beyond.

    Thanks to benh for helping find this.

    Signed-off-by: Michael Neuling
    CC:
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     
  • In __restore_cpu_power8 we determine if we are HV and if not, we return
    before setting HV only resources.

    Unfortunately we forgot to restore the link register from r11 before
    returning.

    This will happen on boot and with secondary CPUs not coming online.

    This adds the missing link register restore.

    Signed-off-by: Michael Neuling
    CC:
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     
  • In __after_prom_start we copy the kernel down to zero in two calls to
    copy_and_flush. After the first call (copy from 0 to copy_to_here:)
    we jump to the newly copied code soon after.

    Unfortunately there's no isync between the copy of this code and the
    jump to it. Hence it's possible that stale instructions could still be
    in the icache or pipeline before we branch to it.

    We've seen this on real machines and it's results in no console output
    after:
    calling quiesce...
    returning from prom_init

    The below adds an isync to ensure that the copy and flushing has
    completed before any branching to the new instructions occurs.

    Signed-off-by: Michael Neuling
    CC:
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     
  • We are currently out of free bits in AT_HWCAP. With POWER8, we have
    several hardware features that we need to advertise.

    Tested on POWER and x86.

    Signed-off-by: Michael Neuling
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     

24 Apr, 2013

4 commits