20 Oct, 2018

3 commits

  • commit 8183d99f4a22c2abbc543847a588df3666ef0c0c upstream.

    feature fixups need to use patch_instruction() early in the boot,
    even before the code is relocated to its final address, requiring
    patch_instruction() to use PTRRELOC() in order to address data.

    But feature fixups applies on code before it is set to read only,
    even for modules. Therefore, feature fixups can use
    raw_patch_instruction() instead.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Reported-by: David Gounaris
    Tested-by: David Gounaris
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit 96dc89d526ef77604376f06220e3d2931a0bfd58 ]

    Current we store the userspace r1 to PACATMSCRATCH before finally
    saving it to the thread struct.

    In theory an exception could be taken here (like a machine check or
    SLB miss) that could write PACATMSCRATCH and hence corrupt the
    userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
    others do.

    We've never actually seen this happen but it's theoretically
    possible. Either way, the code is fragile as it is.

    This patch saves r1 to the kernel stack (which can't fault) before we
    turn MSR[RI] back on. PACATMSCRATCH is still used but only with
    MSR[RI] off. We then copy r1 from the kernel stack to the thread
    struct once we have MSR[RI] back on.

    Suggested-by: Breno Leitao
    Signed-off-by: Michael Neuling
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     
  • [ Upstream commit cf13435b730a502e814c63c84d93db131e563f5f ]

    When we treclaim we store the userspace checkpointed r13 to a scratch
    SPR and then later save the scratch SPR to the user thread struct.

    Unfortunately, this doesn't work as accessing the user thread struct
    can take an SLB fault and the SLB fault handler will write the same
    scratch SPRG that now contains the userspace r13.

    To fix this, we store r13 to the kernel stack (which can't fault)
    before we access the user thread struct.

    Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
    as a random userspace segfault with r13 looking like a kernel address.

    Signed-off-by: Michael Neuling
    Reviewed-by: Breno Leitao
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     

18 Oct, 2018

1 commit

  • commit 4628a64591e6cee181237060961e98c615c33966 upstream.

    Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a
    result we will see warnings such as:

    BUG: Bad page map in process JobWrk0013 pte:800001803875ea25 pmd:7624381067
    addr:00007f0930720000 vm_flags:280000f9 anon_vma: (null) mapping:ffff97f2384056f0 index:0
    file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage: (null)
    CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G W 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased)
    Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
    Call Trace:
    dump_stack+0x5a/0x75
    print_bad_pte+0x217/0x2c0
    ? enqueue_task_fair+0x76/0x9f0
    _vm_normal_page+0xe5/0x100
    zap_pte_range+0x148/0x740
    unmap_page_range+0x39a/0x4b0
    unmap_vmas+0x42/0x90
    unmap_region+0x99/0xf0
    ? vma_gap_callbacks_rotate+0x1a/0x20
    do_munmap+0x255/0x3a0
    vm_munmap+0x54/0x80
    SyS_munmap+0x1d/0x30
    do_syscall_64+0x74/0x150
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    ...

    when mprotect(2) gets used on DAX mappings. Also there is a wide variety
    of other failures that can result from the missing _PAGE_DEVMAP flag
    when the area gets used by get_user_pages() later.

    Fix the problem by including _PAGE_DEVMAP in a set of flags that get
    preserved by mprotect(2).

    Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP")
    Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64")
    Cc:
    Signed-off-by: Jan Kara
    Acked-by: Michal Hocko
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

13 Oct, 2018

3 commits

  • commit b45ba4a51cde29b2939365ef0c07ad34c8321789 upstream.

    Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init
    sections") accesses 'init_mem_is_free' flag too early, before the
    kernel is relocated. This provokes early boot failure (before the
    console is active).

    As it is not necessary to do this verification that early, this
    patch moves the test into patch_instruction() instead of
    __patch_instruction().

    This modification also has the advantage of avoiding unnecessary
    remappings.

    Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections")
    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 51c3c62b58b357e8d35e4cc32f7b4ec907426fe3 upstream.

    This stops us from doing code patching in init sections after they've
    been freed.

    In this chain:
    kvm_guest_init() ->
    kvm_use_magic_page() ->
    fault_in_pages_readable() ->
    __get_user() ->
    __get_user_nocheck() ->
    barrier_nospec();

    We have a code patching location at barrier_nospec() and
    kvm_guest_init() is an init function. This whole chain gets inlined,
    so when we free the init section (hence kvm_guest_init()), this code
    goes away and hence should no longer be patched.

    We seen this as userspace memory corruption when using a memory
    checker while doing partition migration testing on powervm (this
    starts the code patching post migration via
    /sys/kernel/mobility/migration). In theory, it could also happen when
    using /sys/kernel/debug/powerpc/barrier_nospec.

    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Michael Neuling
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     
  • commit 8cf4c05712f04a405f0dacebcca8f042b391694a upstream.

    patch_instruction() uses almost the same sequence as
    __patch_instruction()

    This patch refactor it so that patch_instruction() uses
    __patch_instruction() instead of duplicating code.

    Signed-off-by: Christophe Leroy
    Acked-by: Balbir Singh
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

10 Oct, 2018

1 commit

  • [ Upstream commit 46dec40fb741f00f1864580130779aeeaf24fb3d ]

    This fixes a bug which causes guest virtual addresses to get translated
    to guest real addresses incorrectly when the guest is using the HPT MMU
    and has more than 256GB of RAM, or more specifically has a HPT larger
    than 2GB. This has showed up in testing as a failure of the host to
    emulate doorbell instructions correctly on POWER9 for HPT guests with
    more than 256GB of RAM.

    The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate()
    is stored as an int, and in forming the HPTE address, the index gets
    shifted left 4 bits as an int before being signed-extended to 64 bits.
    The simple fix is to make the variable a long int, matching the
    return type of kvmppc_hv_find_lock_hpte(), which is what calculates
    the index.

    Fixes: 697d3899dcb4 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests")
    Signed-off-by: Paul Mackerras
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     

04 Oct, 2018

2 commits

  • [ Upstream commit d3d4ffaae439981e1e441ebb125aa3588627c5d8 ]

    We use PHB in mode1 which uses bit 59 to select a correct DMA window.
    However there is mode2 which uses bits 59:55 and allows up to 32 DMA
    windows per a PE.

    Even though documentation does not clearly specify that, it seems that
    the actual hardware does not support bits 59:55 even in mode1, in other
    words we can create a window as big as 1<
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     
  • [ Upstream commit 8950329c4a64c6d3ca0bc34711a1afbd9ce05657 ]

    Memory reservation for crashkernel could fail if there are holes around
    kdump kernel offset (128M). Fail gracefully in such cases and print an
    error message.

    Signed-off-by: Hari Bathini
    Tested-by: David Gibson
    Reviewed-by: Dave Young
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hari Bathini
     

26 Sep, 2018

2 commits

  • [ Upstream commit 51eaa08f029c7343df846325d7cf047be8b96e81 ]

    The call to of_find_compatible_node() is returning a pointer with
    incremented refcount so it must be explicitly decremented after the
    last use. As here it is only being used for checking of node presence
    but the result is not actually used in the success path it can be
    dropped immediately.

    Signed-off-by: Nicholas Mc Guire
    Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
    Signed-off-by: Paul Mackerras
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Mc Guire
     
  • [ Upstream commit bd90284cc6c1c9e8e48c8eadd0c79574fcce0b81 ]

    The intention here is to consume and discard the remaining buffer
    upon error. This works if there has not been a previous partial write.
    If there has been, then total_len is no longer total number of bytes
    to copy. total_len is always "bytes left to copy", so it should be
    added to written bytes.

    This code may not be exercised any more if partial writes will not be
    hit, but this is a small bugfix before a larger change.

    Reviewed-by: Benjamin Herrenschmidt
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     

20 Sep, 2018

1 commit

  • [ Upstream commit 9eab9901b015f489199105c470de1ffc337cfabb ]

    We've encountered a performance issue when multiple processors stress
    {get,put}_mmio_atsd_reg(). These functions contend for
    mmio_atsd_usage, an unsigned long used as a bitmask.

    The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
    and clear_bit_unlock(). As implemented, both of these will require
    a (successful) stwcx to that same cache line.

    What we end up with is thread A, attempting to unlock, being slowed by
    other threads repeatedly attempting to lock. A's stwcx instructions
    fail and retry because the memory reservation is lost every time a
    different thread beats it to the punch.

    There may be a long-term way to fix this at a larger scale, but for
    now resolve the immediate problem by gating our call to
    test_and_set_bit_lock() with one to test_bit(), which is obviously
    implemented without using a store.

    Fixes: 1ab66d1fbada ("powerpc/powernv: Introduce address translation services for Nvlink2")
    Signed-off-by: Reza Arbab
    Acked-by: Alistair Popple
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Reza Arbab
     

15 Sep, 2018

5 commits

  • [ Upstream commit 74e96bf44f430cf7a01de19ba6cf49b361cdfd6e ]

    The global mce data buffer that used to copy rtas error log is of 2048
    (RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
    extended_log_length from rtas error log header, then use max of
    extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
    Ideally the platform (phyp) will never send extended error log with
    size > 2048. But if that happens, then we have a risk of buffer overrun
    and corruption. Fix this by using min_t instead.

    Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
    Reported-by: Michal Suchanek
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     
  • [ Upstream commit 78ee9946371f5848ddfc88ab1a43867df8f17d83 ]

    Because rfi_flush_fallback runs immediately before the return to
    userspace it currently runs with the user r1 (stack pointer). This
    means if we oops in there we will report a bad kernel stack pointer in
    the exception entry path, eg:

    Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
    Oops: Bad kernel stack pointer, sig: 6 [#1]
    LE SMP NR_CPUS=32 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
    NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
    REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
    MSR: 9000000002803031 CR: 44000442 XER: 20000000
    CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
    GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
    ...
    NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
    LR [0000000010053e00] 0x10053e00

    Although the NIP tells us where we were, and the TRAP number tells us
    what happened, it would still be nicer if we could report the actual
    exception rather than barfing about the stack pointer.

    We an do that fairly simply by loading the kernel stack pointer on
    entry and restoring the user value before returning. That way we see a
    regular oops such as:

    Unrecoverable exception 4100 at c00000000000239c
    Oops: Unrecoverable exception, sig: 6 [#1]
    LE SMP NR_CPUS=32 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
    NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
    REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
    MSR: 9000000002803031 CR: 44000442 XER: 20000000
    CFAR: c00000000000bac8 IRQMASK: 0
    ...
    NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
    LR [0000000010053e00] 0x10053e00
    Call Trace:
    [c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)

    Note this shouldn't make the kernel stack pointer vulnerable to a
    meltdown attack, because it should be flushed from the cache before we
    return to userspace. The user r1 value will be in the cache, because
    we load it in the return path, but that is harmless.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • [ Upstream commit f5daf77a55ef0e695cc90c440ed6503073ac5e07 ]

    Fix build errors and warnings in t1042rdb_diu.c by adding header files
    and MODULE_LICENSE().

    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: data definition has no type or storage class
    early_initcall(t1042rdb_diu_init);
    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: error: type defaults to 'int' in declaration of 'early_initcall' [-Werror=implicit-int]
    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: parameter names (without types) in function declaration

    and
    WARNING: modpost: missing MODULE_LICENSE() in arch/powerpc/platforms/85xx/t1042rdb_diu.o

    Signed-off-by: Randy Dunlap
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Scott Wood
    Cc: Kumar Gala
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     
  • [ Upstream commit c42d3be0c06f0c1c416054022aa535c08a1f9b39 ]

    The problem is the the calculation should be "end - start + 1" but the
    plus one is missing in this calculation.

    Fixes: 8626816e905e ("powerpc: add support for MPIC message register API")
    Signed-off-by: Dan Carpenter
    Reviewed-by: Tyrel Datwyler
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • [ Upstream commit f7a6947cd49b7ff4e03f1b4f7e7b223003d752ca ]

    Currently if you build a 32-bit powerpc kernel and use get_user() to
    load a u64 value it will fail to build with eg:

    kernel/rseq.o: In function `rseq_get_rseq_cs':
    kernel/rseq.c:123: undefined reference to `__get_user_bad'

    This is hitting the check in __get_user_size() that makes sure the
    size we're copying doesn't exceed the size of the destination:

    #define __get_user_size(x, ptr, size, retval)
    do {
    retval = 0;
    __chk_user_ptr(ptr);
    if (size > sizeof(x))
    (x) = __get_user_bad();

    Which doesn't immediately make sense because the size of the
    destination is u64, but it's not really, because __get_user_check()
    etc. internally create an unsigned long and copy into that:

    #define __get_user_check(x, ptr, size)
    ({
    long __gu_err = -EFAULT;
    unsigned long __gu_val = 0;

    The problem being that on 32-bit unsigned long is not big enough to
    hold a u64. We can fix this with a trick from hpa in the x86 code, we
    statically check the type of x and set the type of __gu_val to either
    unsigned long or unsigned long long.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     

10 Sep, 2018

4 commits

  • commit 8cfbdbdc24815417a3ab35101ccf706b9a23ff17 upstream.

    Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
    the pinned physical page", 2018-07-17) added some checks to ensure
    that guest DMA mappings don't attempt to map more than the guest is
    entitled to access. However, errors in the logic mean that legitimate
    guest requests to map pages for DMA are being denied in some
    situations. Specifically, if the first page of the range passed to
    mm_iommu_get() is mapped with a normal page, and subsequent pages are
    mapped with transparent huge pages, we end up with mem->pageshift ==
    0. That means that the page size checks in mm_iommu_ua_to_hpa() and
    mm_iommu_up_to_hpa_rm() will always fail for every page in that
    region, and thus the guest can never map any memory in that region for
    DMA, typically leading to a flood of error messages like this:

    qemu-system-ppc64: VFIO_MAP_DMA: -22
    qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)

    The logic errors in mm_iommu_get() are:

    (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
    call (meaning that find_linux_pte() returns the pte for the
    first address in the range, not the address we are currently up
    to);
    (b) use of 'pageshift' as the variable to receive the hugepage shift
    returned by find_linux_pte() - for a normal page this gets set
    to 0, leading to us setting mem->pageshift to 0 when we conclude
    that the pte returned by find_linux_pte() didn't match the page
    we were looking at;
    (c) comparing 'compshift', which is a page order, i.e. log base 2 of
    the number of pages, with 'pageshift', which is a log base 2 of
    the number of bytes.

    To fix these problems, this patch introduces 'cur_ua' to hold the
    current user address and uses that in the find_linux_pte() call;
    introduces 'pteshift' to hold the hugepage shift found by
    find_linux_pte(); and compares 'pteshift' with 'compshift +
    PAGE_SHIFT' rather than 'compshift'.

    The patch also moves the local_irq_restore to the point after the PTE
    pointer returned by find_linux_pte() has been dereferenced because
    otherwise the PTE could change underneath us, and adds a check to
    avoid doing the find_linux_pte() call once mem->pageshift has been
    reduced to PAGE_SHIFT, as an optimization.

    Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     
  • commit db2173198b9513f7add8009f225afa1f1c79bcc6 upstream.

    The generic code is racy when multiple children of a PCI bridge try to
    enable it simultaneously.

    This leads to drivers trying to access a device through a
    not-yet-enabled bridge, and this EEH errors under various
    circumstances when using parallel driver probing.

    There is work going on to fix that properly in the PCI core but it
    will take some time.

    x86 gets away with it because (outside of hotplug), the BIOS enables
    all the bridges at boot time.

    This patch does the same thing on powernv by enabling all bridges that
    have child devices at boot time, thus avoiding subsequent races. It's
    suitable for backporting to stable and distros, while the proper PCI
    fix will probably be significantly more invasive.

    Signed-off-by: Benjamin Herrenschmidt
    Cc: stable@vger.kernel.org
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     
  • commit cd813e1cd7122f2c261dce5b54d1e0c97f80e1a5 upstream.

    During Machine Check interrupt on pseries platform, register r3 points
    RTAS extended event log passed by hypervisor. Since hypervisor uses r3
    to pass pointer to rtas log, it stores the original r3 value at the
    start of the memory (first 8 bytes) pointed by r3. Since hypervisor
    stores this info and rtas log is in BE format, linux should make
    sure to restore r3 value in correct endian format.

    Without this patch when MCE handler, after recovery, returns to code that
    that caused the MCE may end up with Data SLB access interrupt for invalid
    address followed by kernel panic or hang.

    Severe Machine check interrupt [Recovered]
    NIP [d00000000ca301b8]: init_module+0x1b8/0x338 [bork_kernel]
    Initiator: CPU
    Error type: SLB [Multihit]
    Effective address: d00000000ca70000
    cpu 0xa: Vector: 380 (Data SLB Access) at [c0000000fc7775b0]
    pc: c0000000009694c0: vsnprintf+0x80/0x480
    lr: c0000000009698e0: vscnprintf+0x20/0x60
    sp: c0000000fc777830
    msr: 8000000002009033
    dar: a803a30c000000d0
    current = 0xc00000000bc9ef00
    paca = 0xc00000001eca5c00 softe: 3 irq_happened: 0x01
    pid = 8860, comm = insmod
    vscnprintf+0x20/0x60
    vprintk_emit+0xb4/0x4b0
    vprintk_func+0x5c/0xd0
    printk+0x38/0x4c
    init_module+0x1c0/0x338 [bork_kernel]
    do_one_initcall+0x54/0x230
    do_init_module+0x8c/0x248
    load_module+0x12b8/0x15b0
    sys_finit_module+0xa8/0x110
    system_call+0x58/0x6c
    --- Exception: c00 (System Call) at 00007fff8bda0644
    SP (7fffdfbfe980) is in userspace

    This patch fixes this issue.

    Fixes: a08a53ea4c97 ("powerpc/le: Enable RTAS events support")
    Cc: stable@vger.kernel.org # v3.15+
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     
  • commit 1bd6a1c4b80a28d975287630644e6b47d0f977a5 upstream.

    Crash memory ranges is an array of memory ranges of the crashing kernel
    to be exported as a dump via /proc/vmcore file. The size of the array
    is set based on INIT_MEMBLOCK_REGIONS, which works alright in most cases
    where memblock memory regions count is less than INIT_MEMBLOCK_REGIONS
    value. But this count can grow beyond INIT_MEMBLOCK_REGIONS value since
    commit 142b45a72e22 ("memblock: Add array resizing support").

    On large memory systems with a few DLPAR operations, the memblock memory
    regions count could be larger than INIT_MEMBLOCK_REGIONS value. On such
    systems, registering fadump results in crash or other system failures
    like below:

    task: c00007f39a290010 ti: c00000000b738000 task.ti: c00000000b738000
    NIP: c000000000047df4 LR: c0000000000f9e58 CTR: c00000000010f180
    REGS: c00000000b73b570 TRAP: 0300 Tainted: G L X (4.4.140+)
    MSR: 8000000000009033 CR: 22004484 XER: 20000000
    CFAR: c000000000008500 DAR: 000007a450000000 DSISR: 40000000 SOFTE: 0
    ...
    NIP [c000000000047df4] smp_send_reschedule+0x24/0x80
    LR [c0000000000f9e58] resched_curr+0x138/0x160
    Call Trace:
    resched_curr+0x138/0x160 (unreliable)
    check_preempt_curr+0xc8/0xf0
    ttwu_do_wakeup+0x38/0x150
    try_to_wake_up+0x224/0x4d0
    __wake_up_common+0x94/0x100
    ep_poll_callback+0xac/0x1c0
    __wake_up_common+0x94/0x100
    __wake_up_sync_key+0x70/0xa0
    sock_def_readable+0x58/0xa0
    unix_stream_sendmsg+0x2dc/0x4c0
    sock_sendmsg+0x68/0xa0
    ___sys_sendmsg+0x2cc/0x2e0
    __sys_sendmsg+0x5c/0xc0
    SyS_socketcall+0x36c/0x3f0
    system_call+0x3c/0x100

    as array index overflow is not checked for while setting up crash memory
    ranges causing memory corruption. To resolve this issue, dynamically
    allocate memory for crash memory ranges and resize it incrementally,
    in units of pagesize, on hitting array size limit.

    Fixes: 2df173d9e85d ("fadump: Initialize elfcore header and add PT_LOAD program headers.")
    Cc: stable@vger.kernel.org # v3.4+
    Signed-off-by: Hari Bathini
    Reviewed-by: Mahesh Salgaonkar
    [mpe: Just use PAGE_SIZE directly, fixup variable placement]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Hari Bathini
     

05 Sep, 2018

1 commit

  • [ Upstream commit b9c1e60e7bf4e64ac1b4f4d6d593f0bb57886973 ]

    None of the JITs is allowed to implement exit paths from the BPF
    insn mappings other than BPF_JMP | BPF_EXIT. In the BPF core code
    we have a couple of rewrites in eBPF (e.g. LD_ABS / LD_IND) and
    in eBPF to cBPF translation to retain old existing behavior where
    exceptions may occur; they are also tightly controlled by the
    verifier where it disallows some of the features such as BPF to
    BPF calls when legacy LD_ABS / LD_IND ops are present in the BPF
    program. During recent review of all BPF_XADD JIT implementations
    I noticed that the ppc64 one is buggy in that it contains two
    jumps to exit paths. This is problematic as this can bypass verifier
    expectations e.g. pointed out in commit f6b1b3bf0d5f ("bpf: fix
    subprog verifier bypass by div/mod by 0 exception"). The first
    exit path is obsoleted by the fix in ca36960211eb ("bpf: allow xadd
    only on aligned memory") anyway, and for the second one we need to
    do a fetch, add and store loop if the reservation from lwarx/ldarx
    was lost in the meantime.

    Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
    Reviewed-by: Naveen N. Rao
    Reviewed-by: Sandipan Das
    Tested-by: Sandipan Das
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

03 Aug, 2018

12 commits

  • [ Upstream commit 9dcb3df4281876731e4e8bff7940514d72375154 ]

    The interrupt controller inside the Wii's Hollywood chip is connected to
    two masters, the "Broadway" PowerPC and the "Starlet" ARM926, each with
    their own interrupt status and mask registers.

    When booting the Wii with mini[1], interrupts from the SD card
    controller (IRQ 7) are handled by the ARM, because mini provides SD
    access over IPC. Linux however can't currently use or disable this IPC
    service, so both sides try to handle IRQ 7 without coordination.

    Let's instead make sure that all interrupts that are unmasked on the PPC
    side are masked on the ARM side; this will also make sure that Linux can
    properly talk to the SD card controller (and potentially other devices).

    If access to a device through IPC is desired in the future, interrupts
    from that device should not be handled by Linux directly.

    [1]: https://github.com/lewurm/mini

    Signed-off-by: Jonathan Neuschäfer
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jonathan Neuschäfer
     
  • [ Upstream commit 4ea69b2fd623dee2bbc77d3b6b7d8c0924e2026a ]

    For multi-function programs, loading the address of a callee
    function to a register requires emitting instructions whose
    count varies from one to five depending on the nature of the
    address.

    Since we come to know of the callee's address only before the
    extra pass, the number of instructions required to load this
    address may vary from what was previously generated. This can
    make the JITed image grow or shrink.

    To avoid this, we should generate a constant five-instruction
    when loading function addresses by padding the optimized load
    sequence with NOPs.

    Signed-off-by: Sandipan Das
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sandipan Das
     
  • [ Upstream commit e4ccb1dae6bdef228d729c076c38161ef6e7ca34 ]

    New binutils generate the following warning

    AS arch/powerpc/kernel/head_8xx.o
    arch/powerpc/kernel/head_8xx.S: Assembler messages:
    arch/powerpc/kernel/head_8xx.S:916: Warning: invalid register expression

    This patch fixes it.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit eae5f709a4d738c52b6ab636981755d76349ea9e ]

    __printf is useful to verify format and arguments. Fix arg mismatch
    reported by gcc, remove the following warnings (with W=1):

    arch/powerpc/kernel/prom_init.c:1467:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1471:31: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1504:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1505:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1506:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1507:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1508:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1509:33: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:1975:39: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘unsigned int’
    arch/powerpc/kernel/prom_init.c:1986:27: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:2567:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:2567:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:2569:38: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 2 has type ‘long unsigned int’
    arch/powerpc/kernel/prom_init.c:2569:46: error: format ‘%x’ expects argument of type ‘unsigned int’, but argument 3 has type ‘long unsigned int’

    The patch also include arg mismatch fix for case with #define DEBUG_PROM
    (warning not listed here).

    This patch fix also the following warnings revealed by checkpatch:

    WARNING: Prefer using '"%s...", __func__' to using 'alloc_up', this function's name, in a string
    #101: FILE: arch/powerpc/kernel/prom_init.c:1235:
    + prom_debug("alloc_up(%lx, %lx)\n", size, align);

    and

    WARNING: Prefer using '"%s...", __func__' to using 'alloc_down', this function's name, in a string
    #138: FILE: arch/powerpc/kernel/prom_init.c:1278:
    + prom_debug("alloc_down(%lx, %lx, %s)\n", size, align,

    Signed-off-by: Mathieu Malaterre
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit 5a4b475cf8511da721f20ba432c244061db7139f ]

    Since the value of x is never intended to be read, declare it with gcc
    attribute as unused. Fix warning treated as error with W=1:

    arch/powerpc/platforms/powermac/bootx_init.c:471:21: error: variable ‘x’ set but not used [-Werror=unused-but-set-variable]

    Suggested-by: Christophe Leroy
    Signed-off-by: Mathieu Malaterre
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit f72cf3f1d49f2c35d6cb682af2e8c93550f264e4 ]

    Add a missing prototype for function `note_bootable_part` to silence a
    warning treated as error with W=1:

    arch/powerpc/platforms/powermac/setup.c:361:12: error: no previous prototype for ‘note_bootable_part’ [-Werror=missing-prototypes]

    Suggested-by: Christophe Leroy
    Signed-off-by: Mathieu Malaterre
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit b87a358b4a1421abd544c0b554b1b7159b2b36c0 ]

    Add a missing include .

    These functions can all be static, make it so. Fix warnings treated as
    errors with W=1:

    arch/powerpc/platforms/chrp/time.c:41:13: error: no previous prototype for ‘chrp_time_init’ [-Werror=missing-prototypes]
    arch/powerpc/platforms/chrp/time.c:66:5: error: no previous prototype for ‘chrp_cmos_clock_read’ [-Werror=missing-prototypes]
    arch/powerpc/platforms/chrp/time.c:74:6: error: no previous prototype for ‘chrp_cmos_clock_write’ [-Werror=missing-prototypes]
    arch/powerpc/platforms/chrp/time.c:86:5: error: no previous prototype for ‘chrp_set_rtc_time’ [-Werror=missing-prototypes]
    arch/powerpc/platforms/chrp/time.c:130:6: error: no previous prototype for ‘chrp_get_rtc_time’ [-Werror=missing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit c89ca593220931c150cffda24b4d4ccf82f13fc8 ]

    The header file was missing from the includes. Fix the
    following warning, treated as error with W=1:

    arch/powerpc/kernel/pci_32.c:286:6: error: no previous prototype for ‘sys_pciconfig_iobase’ [-Werror=missing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit 926bc2f100c24d4842b3064b5af44ae964c1d81c ]

    The stores to update the SLB shadow area must be made as they appear
    in the C code, so that the hypervisor does not see an entry with
    mismatched vsid and esid. Use WRITE_ONCE for this.

    GCC has been observed to elide the first store to esid in the update,
    which means that if the hypervisor interrupts the guest after storing
    to vsid, it could see an entry with old esid and new vsid, which may
    possibly result in memory corruption.

    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit 46d4be41b987a6b2d25a2ebdd94cafb44e21d6c5 ]

    Correct two cases where eeh_pcid_get() is used to reference the driver's
    module but the reference is dropped before the driver pointer is used.

    In eeh_rmv_device() also refactor a little so that only two calls to
    eeh_pcid_put() are needed, rather than three and the reference isn't
    taken at all if it wasn't needed.

    Signed-off-by: Sam Bobroff
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sam Bobroff
     
  • [ Upstream commit a6b3964ad71a61bb7c61d80a60bea7d42187b2eb ]

    A no-op form of ori (or immediate of 0 into r31 and the result stored
    in r31) has been re-tasked as a speculation barrier. The instruction
    only acts as a barrier on newer machines with appropriate firmware
    support. On older CPUs it remains a harmless no-op.

    Implement barrier_nospec using this instruction.

    mpe: The semantics of the instruction are believed to be that it
    prevents execution of subsequent instructions until preceding branches
    have been fully resolved and are no longer executing speculatively.
    There is no further documentation available at this time.

    Signed-off-by: Michal Suchanek
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michal Suchanek
     
  • [ Upstream commit 1128bb7813a896bd608fb622eee3c26aaf33b473 ]

    commit 87a156fb18fe1 ("Align hot loops of some string functions")
    degraded the performance of string functions by adding useless
    nops

    A simple benchmark on an 8xx calling 100000x a memchr() that
    matches the first byte runs in 41668 TB ticks before this patch
    and in 35986 TB ticks after this patch. So this gives an
    improvement of approx 10%

    Another benchmark doing the same with a memchr() matching the 128th
    byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
    after this patch, so regardless on the number of loops, removing
    those useless nops improves the test by 5683 TB ticks.

    Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

28 Jul, 2018

1 commit

  • commit 76fa4975f3ed12d15762bc979ca44078598ed8ee upstream.

    A VM which has:
    - a DMA capable device passed through to it (eg. network card);
    - running a malicious kernel that ignores H_PUT_TCE failure;
    - capability of using IOMMU pages bigger that physical pages
    can create an IOMMU mapping that exposes (for example) 16MB of
    the host physical memory to the device when only 64K was allocated to the VM.

    The remaining 16MB - 64K will be some other content of host memory, possibly
    including pages of the VM, but also pages of host kernel memory, host
    programs or other VMs.

    The attacking VM does not control the location of the page it can map,
    and is only allowed to map as many pages as it has pages of RAM.

    We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
    an IOMMU page is contained in the physical page so the PCI hardware won't
    get access to unassigned host memory; however this check is missing in
    the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
    did not hit this yet as the very first time when the mapping happens
    we do not have tbl::it_userspace allocated yet and fall back to
    the userspace which in turn calls VFIO IOMMU driver, this fails and
    the guest does not retry,

    This stores the smallest preregistered page size in the preregistered
    region descriptor and changes the mm_iommu_xxx API to check this against
    the IOMMU page size.

    This calculates maximum page size as a minimum of the natural region
    alignment and compound page size. For the page shift this uses the shift
    returned by find_linux_pte() which indicates how the page is mapped to
    the current userspace - if the page is huge and this is not a zero, then
    it is a leaf pte and the page is mapped within the range.

    Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     

25 Jul, 2018

1 commit

  • commit b03897cf318dfc47de33a7ecbc7655584266f034 upstream.

    On 64-bit servers, SPRN_SPRG3 and its userspace read-only mirror
    SPRN_USPRG3 are used as userspace VDSO write and read registers
    respectively.

    SPRN_SPRG3 is lost when we enter stop4 and above, and is currently not
    restored. As a result, any read from SPRN_USPRG3 returns zero on an
    exit from stop4 (Power9 only) and above.

    Thus in this situation, on POWER9, any call from sched_getcpu() always
    returns zero, as on powerpc, we call __kernel_getcpu() which relies
    upon SPRN_USPRG3 to report the CPU and NUMA node information.

    Fix this by restoring SPRN_SPRG3 on wake up from a deep stop state
    with the sprg_vdso value that is cached in PACA.

    Fixes: e1c1cfed5432 ("powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle")
    Cc: stable@vger.kernel.org # v4.14+
    Reported-by: Florian Weimer
    Signed-off-by: Gautham R. Shenoy
    Reviewed-by: Michael Ellerman
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Gautham R. Shenoy
     

03 Jul, 2018

3 commits

  • commit 722cde76d68e8cc4f3de42e71c82fd40dea4f7b9 upstream.

    Unregister fadump on kexec down path otherwise the fadump registration
    in new kexec-ed kernel complains that fadump is already registered.
    This makes new kernel to continue using fadump registered by previous
    kernel which may lead to invalid vmcore generation. Hence this patch
    fixes this issue by un-registering fadump in fadump_cleanup() which is
    called during kexec path so that new kernel can register fadump with
    new valid values.

    Fixes: b500afff11f6 ("fadump: Invalidate registration and release reserved memory for general use.")
    Cc: stable@vger.kernel.org # v3.4+
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     
  • commit ac9816dcbab53c57bcf1d7b15370b08f1e284318 upstream.

    Init all present cpus for deep states instead of "all possible" cpus.
    Init fails if a possible cpu is guarded. Resulting in making only
    non-deep states available for cpuidle/hotplug.

    Stewart says, this means that for single threaded workloads, if you
    guard out a CPU core you'll not get WoF (Workload Optimised
    Frequency), which means that performance goes down when you wouldn't
    expect it to.

    Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus")
    Cc: stable@vger.kernel.org # v3.19+
    Signed-off-by: Akshay Adiga
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Akshay Adiga
     
  • commit 75743649064ec0cf5ddd69f240ef23af66dde16e upstream.

    NX can set the 3rd bit in CR register for XER[SO] (Summary overflow)
    which is not related to paste request. The current paste function
    returns failure for a successful request when this bit is set. So mask
    this bit and check the proper return status.

    Fixes: 2392c8c8c045 ("powerpc/powernv/vas: Define copy/paste interfaces")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Haren Myneni
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Haren Myneni