14 Aug, 2015

11 commits

  • Provide a kernel API and a sysfs entry which allow a user to specify
    that when a card is PERSTed, it's image will stay the same, allowing
    it to participate in EEH.

    cxl_reset is used to reflash the card. In that case, we cannot safely
    assert that the image will not change. Therefore, disallow cxl_reset
    if the flag is set.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • If the driver doesn't participate in EEH, the AFUs will be removed
    by cxl_remove, which will be invoked by EEH.

    If the driver does particpate in EEH, the vPHB needs to stick around
    so that the it can particpate.

    In both cases, we shouldn't remove the AFU/vPHB.

    Reviewed-by: Cyril Bur
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • As with an adapter, some aspects of initialisation are done only once
    in the lifetime of an AFU: for example, allocating memory, or setting
    up sysfs/debugfs files.

    However, we may want to be able to do some parts of the initialisation
    multiple times: for example, in error recovery we want to be able to
    tear down and then re-map IO memory and IRQs.

    Therefore, refactor AFU init/teardown as follows.

    - Create two new functions: 'cxl_configure_afu', and its pair
    'cxl_deconfigure_afu'. As with the adapter functions,
    these (de)configure resources that do not need to last the entire
    lifetime of the AFU.

    - Allocating and releasing memory remain the task of 'cxl_alloc_afu'
    and 'cxl_release_afu'.

    - Once-only functions that do not involve allocating/releasing memory
    stay in the overarching 'cxl_init_afu'/'cxl_remove_afu' pair.
    However, the task of picking an AFU mode and activating it has been
    broken out.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Some aspects of initialisation are done only once in the lifetime of
    an adapter: for example, allocating memory for the adapter,
    allocating the adapter number, or setting up sysfs/debugfs files.

    However, we may want to be able to do some parts of the
    initialisation multiple times: for example, in error recovery we
    want to be able to tear down and then re-map IO memory and IRQs.

    Therefore, refactor CXL init/teardown as follows.

    - Keep the overarching functions 'cxl_init_adapter' and its pair,
    'cxl_remove_adapter'.

    - Move all 'once only' allocation/freeing steps to the existing
    'cxl_alloc_adapter' function, and its pair 'cxl_release_adapter'
    (This involves moving allocation of the adapter number out of
    cxl_init_adapter.)

    - Create two new functions: 'cxl_configure_adapter', and its pair
    'cxl_deconfigure_adapter'. These two functions 'wire up' the
    hardware --- they (de)configure resources that do not need to
    last the entire lifetime of the adapter

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • - MMIO pointer unmapping is guarded by a null pointer check.
    However, iounmap doesn't null the pointer, just invalidate it.
    Therefore, explicitly null the pointer after unmapping.

    - afu_desc_mmio also needs to be unmapped.

    - PCI regions are allocated in cxl_map_adapter_regs.
    Therefore they should be released in unmap, not elsewhere.

    Acked-by: Cyril Bur
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Check if an IRQ is mapped before releasing it.

    This will simplify future EEH code by allowing unconditional unmapping
    of IRQs.

    Acked-by: Cyril Bur
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Previously the SPA was allocated and freed upon entering and leaving
    AFU-directed mode. This causes some issues for error recovery - contexts
    hold a pointer inside the SPA, and they may persist after the AFU has
    been detached.

    We would ideally like to allocate the SPA when the AFU is allocated, and
    release it until the AFU is released. However, we don't know how big the
    SPA needs to be until we read the AFU descriptor.

    Therefore, restructure the code:

    - Allocate the SPA only once, on the first attach.

    - Release the SPA only when the entire AFU is being released (not
    detached). Guard the release with a NULL check, so we don't free
    if it was never allocated (e.g. dedicated mode)

    Acked-by: Cyril Bur
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • If the PCI channel has gone down, don't attempt to poke the hardware.

    We need to guard every time cxl_whatever_(read|write) is called. This
    is because a call to those functions will dereference an offset into an
    mmio register, and the mmio mappings get invalidated in the EEH
    teardown.

    Check in the read/write functions in the header.
    We give them the same semantics as usual PCI operations:
    - a write to a channel that is down is ignored.
    - a read from a channel that is down returns all fs.

    Also, we try to access the MMIO space of a vPHB device as part of the
    PCI disable path. Because that's a read that bypasses most of our usual
    checks, we handle it explicitly.

    As far as user visible warnings go:
    - Check link state in file ops, return -EIO if down.
    - Be reasonably quiet if there's an error in a teardown path,
    or when we already know the hardware is going down.
    - Throw a big WARN if someone tries to start a CXL operation
    while the card is down. This gives a useful stacktrace for
    debugging whatever is doing that.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • We're about to make these more complex, so make them functions
    first.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • In the complete hotplug case, EEH PEs are supposed to be released
    and set to NULL. Normally, this is done by eeh_remove_device(),
    which is called from pcibios_release_device().

    However, if something is holding a kref to the device, it will not
    be released, and the PE will remain. eeh_add_device_late() has
    a check for this which will explictly destroy the PE in this case.

    This check in eeh_add_device_late() occurs after a call to
    eeh_ops->probe(). On PowerNV, probe is a pointer to pnv_eeh_probe(),
    which will exit without probing if there is an existing PE.

    This means that on PowerNV, devices with outstanding krefs will not
    be rediscovered by EEH correctly after a complete hotplug. This is
    affecting CXL (CAPI) devices in the field.

    Put the probe after the kref check so that the PE is destroyed
    and affected devices are correctly rediscovered by EEH.

    Fixes: d91dafc02f42 ("powerpc/eeh: Delay probing EEH device during hotplug")
    Cc: stable@vger.kernel.org
    Cc: Gavin Shan
    Signed-off-by: Daniel Axtens
    Acked-by: Gavin Shan
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Section 3.7 of Version 1.2 of the Power8 Processor User's Manual
    prescribes that updates to HID0 be preceded by a SYNC instruction and
    followed by an ISYNC instruction (Page 91).

    Create an inline function name update_power8_hid0() which follows this
    recipe and invoke it from the static split core path.

    Signed-off-by: Gautham R. Shenoy
    Reviewed-by: Sam Bobroff
    Tested-by: Sam Bobroff
    Signed-off-by: Michael Ellerman

    Gautham R. Shenoy
     

12 Aug, 2015

8 commits

  • Replace hard coded values with existing DRCONF flags while procesing
    detected LMBs from the device tree. Does not change any functionality.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • The value of 'valid' is always zero when 'esid' is zero, and if 'esid'
    is non-zero then the value of 'valid' is irrelevant because we are using
    logical or in the if expression.

    In fact 'valid' can be dropped completely from dump_segments() by
    simply doing the check with SLB_ESID_V directly in the if.

    Signed-off-by: Anshuman Khandual
    [mpe: Rewrite change log]
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • The code to fetch the SLB size from the device tree wants to first look
    for "slb-size" and then if that's not found "ibm,slb-size".

    We can simplify the code by looking for the properties and then if we
    find one of them we set mmu_slb_size.

    We also change the function name from check_cpu_slb_size() to
    init_mmu_slb_size() as the function doesn't check anything, it only
    initialises mmu_slb_size.

    Signed-off-by: Anshuman Khandual
    [mpe: Rewrite change log]
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • This patch adds some documentation to patch_slb_encoding() explaining
    how it works.

    Signed-off-by: Anshuman Khandual
    [mpe: Update change log and mention the signedness of the immediate]
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • The SLB code uses 'slot' and 'entry' interchangeably, change it to always
    use 'entry'.

    Signed-off-by: Anshuman Khandual
    [mpe: Rewrite change log]
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • This patch just removes one redundant entry for one extern variable
    'slb_compare_rr_to_size' from the scope. This patch does not change
    any functionality.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Michael Ellerman

    Anshuman Khandual
     
  • An IO address, tagged with __iomem, is passed to debugfs_create_file
    as private data. This requires that it be cast to void *. The cast
    drops the __iomem annotation and so creates a sparse warning:

    drivers/misc/cxl/debugfs.c:51:57: warning: cast removes address space of expression

    The address space marker is added back in the file operations
    (fops_io_u64).

    Silence the warning with __force.

    Signed-off-by: Daniel Axtens
    Acked-by: Michael Neuling
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • A few declarations were identified by sparse as needing to be static:

    drivers/misc/cxl/irq.c:408:6: warning: symbol 'afu_irq_name_free' was not declared. Should it be static?
    drivers/misc/cxl/irq.c:467:6: warning: symbol 'afu_register_hwirqs' was not declared. Should it be static?
    drivers/misc/cxl/file.c:254:6: warning: symbol 'afu_compat_ioctl' was not declared. Should it be static?
    drivers/misc/cxl/file.c:399:30: warning: symbol 'afu_master_fops' was not declared. Should it be static?

    Make them static.

    Signed-off-by: Daniel Axtens
    Acked-by: Michael Neuling
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     

11 Aug, 2015

1 commit


06 Aug, 2015

12 commits

  • Add a new powerpc-specific trace clock using the timebase register,
    similar to x86-tsc. This gives us
    - a fast, monotonic, hardware clock source for trace entries, and
    - a clock that can be used to correlate events across cpus as well as across
    hypervisor and guests.

    Signed-off-by: Naveen N. Rao
    Acked-by: Steven Rostedt
    Signed-off-by: Michael Ellerman

    Naveen N. Rao
     
  • In case of error, the functions platform_get_resource() and kmalloc()
    returns NULL not ERR_PTR(). The IS_ERR() test in the return value check
    should be replaced with NULL test.

    Signed-off-by: Wei Yongjun
    Signed-off-by: Michael Ellerman

    Wei Yongjun
     
  • wf_find_control(), wf_find_sensor(), and wf_is_overtemp() are exported
    but unused. Remove these three functions.

    Signed-off-by: Paul Bolle
    Signed-off-by: Michael Ellerman

    Paul Bolle
     
  • wf_critical_overtemp() is exported. But nothing uses that export.
    That's unsurprising because there's no header that defines it. Stop
    exporting that function and make it static.

    Signed-off-by: Paul Bolle
    Signed-off-by: Michael Ellerman

    Paul Bolle
     
  • wf_unregister_client() increments the client count when a client
    unregisters. That is obviously incorrect. Decrement that client count
    instead.

    Fixes: 75722d3992f5 ("[PATCH] ppc64: Thermal control for SMU based machines")

    Signed-off-by: Paul Bolle
    Signed-off-by: Michael Ellerman

    Paul Bolle
     
  • break; break; isn't useful.

    Remove one.

    Signed-off-by: Joe Perches
    Signed-off-by: Michael Ellerman

    Joe Perches
     
  • Use %pR to simplify the debug code. This also make the debug info more
    readable.

    Signed-off-by: Kevin Hao
    [mpe: Unsplit multi-line printk strings]
    Signed-off-by: Michael Ellerman

    Kevin Hao
     
  • Currently when attaching a context in dedicated mode, we ignore the
    result of add_process_element(), which could potentially fail.

    If add_process_element() returns an error, pass it back to the caller.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Invoke new opal_cec_reboot2() call with reboot type
    OPAL_REBOOT_PLATFORM_ERROR (for unrecoverable HMI interrupts) to inform
    BMC/OCC about this error, so that BMC can collect relevant data for error
    analysis and decide what component to de-configure before rebooting.

    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman

    Mahesh Salgaonkar
     
  • On non-recoverable MCE errors in kernel space, Linux kernel panics
    and system reboots. On BMC based system opal-prd runs as a daemon
    in the host. Hence, kernel crash may prevent opal-prd to detect and
    analyze this MCE error. This may land us in a situation where the faulty
    memory never gets de-configured and Linux would keep hitting same MCE error
    again and again. If this happens in early stage of kernel initialization,
    then Linux will keep crashing and rebooting in a loop.

    This patch fixes this issue by invoking new opal_cec_reboot2() call with
    reboot type OPAL_REBOOT_PLATFORM_ERROR to inform BMC/OCC about this
    error, so that BMC can collect relevant data for error analysis and
    decide what component to de-configure before rebooting.

    This patch is dependent on OPAL patchset posted on skiboot mailing list
    at https://lists.ozlabs.org/pipermail/skiboot/2015-July/001771.html that
    introduces opal_cec_reboot2() opal call.

    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman

    Mahesh Salgaonkar
     
  • In the event of unrecovered HMI the existing code panics as soon as
    it receives the first unrecovered HMI event. This makes host to report
    partial information about HMIs before panic. There may be more errors
    which would have caused the HMI and hence more HMI event would have been
    generated waiting to be pulled by host. This patch implements a logic to
    pull and display all the HMI event before going down panic path.

    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman

    Mahesh Salgaonkar
     
  • The V2 version of HMI event now carries additional information for
    Malfunction Alert. It now contains error information about CORE and NX
    checkstop. This patch checks and displays the check stop reason before
    panic.

    Signed-off-by: Mahesh Salgaonkar
    Acked-by: Stewart Smith
    Signed-off-by: Michael Ellerman

    Mahesh Salgaonkar
     

30 Jul, 2015

3 commits

  • Wire up the syscall number and regs so the tests work on powerpc.

    With the powerpc kernel support just merged, all tests pass on ppc64,
    ppc64 (compat), ppc64le, ppc, ppc64e and ppc64e (compat).

    Acked-by: Kees Cook
    Signed-off-by: Michael Ellerman

    Michael Ellerman
     
  • The seccomp_bpf test uses BPF_LD|BPF_W|BPF_ABS to load 32-bit values
    from seccomp_data->args. On big endian machines this will load the high
    word of the argument, which is not what the test wants.

    Borrow a hack from samples/seccomp/bpf-helper.h which changes the offset
    on big endian to account for this.

    Signed-off-by: Michael Ellerman
    Acked-by: Kees Cook

    Michael Ellerman
     
  • This commit enables seccomp filter on powerpc, now that we have all the
    necessary pieces in place.

    To support seccomp's desire to modify the syscall return value under
    some circumstances, we use a different ABI to the ptrace ABI. That is we
    use r3 as the syscall return value, and orig_gpr3 is the first syscall
    parameter.

    This means the seccomp code, or a ptracer via SECCOMP_RET_TRACE, will
    see -ENOSYS preloaded in r3. This is identical to the behaviour on x86,
    and allows seccomp or the ptracer to either leave the -ENOSYS or change
    it to something else, as well as rejecting or not the syscall by
    modifying r0.

    If seccomp does not reject the syscall, we restore the register state to
    match what ptrace and audit expect, ie. r3 is the first syscall
    parameter again. We do this restore using orig_gpr3, which may have been
    modified by seccomp, which allows seccomp to modify the first syscall
    paramater and allow the syscall to proceed.

    We need to #ifdef the the additional handling of r3 for seccomp, so move
    it all out of line.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman
     

29 Jul, 2015

5 commits

  • SIG_SYS was added in commit a0727e8ce513 "signal, x86: add SIGSYS info
    and make it synchronous."

    Because we use the asm-generic struct siginfo, we got support for
    SIG_SYS for free as part of that commit.

    However there was no compat handling added for powerpc. That means we've
    been advertising the existence of signfo._sifields._sigsys to compat
    tasks, but not actually filling in the fields correctly.

    Luckily it looks like no one has noticed, presumably because the only
    user of SIGSYS in the kernel is seccomp filter, which we don't support
    yet.

    So before we enable seccomp filter, add compat handling for SIGSYS.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman
     
  • The documentation for syscall_get_nr() in asm-generic says:

    Note this returns int even on 64-bit machines. Only 32 bits of
    system call number can be meaningful. If the actual arch value
    is 64 bits, this truncates to 32 bits so 0xffffffff means -1.

    However our implementation was never updated to reflect this.

    Generally it's not important, but there is once case where it matters.

    For seccomp filter with SECCOMP_RET_TRACE, the tracer will set
    regs->gpr[0] to -1 to reject the syscall. When the task is a compat
    task, this means we end up with 0xffffffff in r0 because ptrace will
    zero extend the 32-bit value.

    If syscall_get_nr() returns an unsigned long, then a 64-bit kernel will
    see a positive value in r0 and will incorrectly allow the syscall
    through seccomp.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman
     
  • Currently syscall_get_arguments() is used by syscall tracepoints, and
    collect_syscall() which is used in some debugging as well as
    /proc/pid/syscall.

    The current implementation just copies regs->gpr[3 .. 5] out, which is
    fine for all the current use cases.

    When we enable seccomp filter, that will also start using
    syscall_get_arguments(). However for seccomp filter we want to use r3
    as the return value of the syscall, and orig_gpr3 as the first
    parameter. This will allow seccomp to modify the return value in r3.

    To support this we need to modify syscall_get_arguments() to return
    orig_gpr3 instead of r3. This is safe for all uses because orig_gpr3
    always contains the r3 value that was passed to the syscall. We store it
    in the syscall entry path and never modify it.

    Update syscall_set_arguments() while we're here, even though it's never
    used.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman
     
  • Currently syscall_get_arguments() has two loops, one for compat and one
    for regular tasks. In prepartion for the next patch, which changes which
    registers we use, switch it to only have one loop, so we only have one
    place to update.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman
     
  • Currently the only caller of syscall_set_return_value() is seccomp
    filter, which is not enabled on powerpc.

    This means we have not noticed that our implementation of
    syscall_set_return_value() negates error, even though the value passed
    in is already negative.

    So remove the negation in syscall_set_return_value(), and expect the
    caller to do it like all other implementations do.

    Also add a comment about the ccr handling.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Kees Cook

    Michael Ellerman