26 Jan, 2019

1 commit

  • [ Upstream commit 8d4a862276a9c30a269d368d324fb56529e6d5fd ]

    Currently xmon needs to get devtree_lock (through rtas_token()) during its
    invocation (at crash time). If there is a crash while devtree_lock is being
    held, then xmon tries to get the lock but spins forever and never get into
    the interactive debugger, as in the following case:

    int *ptr = NULL;
    raw_spin_lock_irqsave(&devtree_lock, flags);
    *ptr = 0xdeadbeef;

    This patch avoids calling rtas_token(), thus trying to get the same lock,
    at crash time. This new mechanism proposes getting the token at
    initialization time (xmon_init()) and just consuming it at crash time.

    This would allow xmon to be possible invoked independent of devtree_lock
    being held or not.

    Signed-off-by: Breno Leitao
    Reviewed-by: Thiago Jung Bauermann
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Breno Leitao
     

13 Jan, 2019

6 commits

  • commit e1c3743e1a20647c53b719dbf28b48f45d23f2cd upstream.

    On a signal handler return, the user could set a context with MSR[TS] bits
    set, and these bits would be copied to task regs->msr.

    At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
    several __get_user() are called and then a recheckpoint is executed.

    This is a problem since a page fault (in kernel space) could happen when
    calling __get_user(). If it happens, the process MSR[TS] bits were
    already set, but recheckpoint was not executed, and SPRs are still invalid.

    The page fault can cause the current process to be de-scheduled, with
    MSR[TS] active and without tm_recheckpoint() being called. More
    importantly, without TEXASR[FS] bit set also.

    Since TEXASR might not have the FS bit set, and when the process is
    scheduled back, it will try to reclaim, which will be aborted because of
    the CPU is not in the suspended state, and, then, recheckpoint. This
    recheckpoint will restore thread->texasr into TEXASR SPR, which might be
    zero, hitting a BUG_ON().

    kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
    cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
    pc: c000000000054550: restore_gprs+0xb0/0x180
    lr: 0000000000000000
    sp: c00000041f157950
    msr: 8000000100021033
    current = 0xc00000041f143000
    paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
    pid = 1021, comm = kworker/11:1
    kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
    Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
    enter ? for help
    [c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
    [c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
    [c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
    [c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
    [c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
    [c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
    [c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c

    This patch simply delays the MSR[TS] set, so, if there is any page fault in
    the __get_user() section, it does not have regs->msr[TS] set, since the TM
    structures are still invalid, thus avoiding doing TM operations for
    in-kernel exceptions and possible process reschedule.

    With this patch, the MSR[TS] will only be set just before recheckpointing
    and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
    invalid state.

    Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
    after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
    be atomic from a preemption perspective, thus, calling
    preempt_disable/enable() on this code.

    It is not possible to move tm_recheckpoint to happen earlier, because it is
    required to get the checkpointed registers from userspace, with
    __get_user(), thus, the only way to avoid this undesired behavior is
    delaying the MSR[TS] set.

    The 32-bits signal handler seems to be safe this current issue, but, it
    might be exposed to the preemption issue, thus, disabling preemption in
    this chunk of code.

    Changes from v2:
    * Run the critical section with preempt_disable.

    Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
    Cc: stable@vger.kernel.org (v3.9+)
    Signed-off-by: Breno Leitao
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Breno Leitao
     
  • commit 813af51f5d30a2da6a2523c08465f9726e51772e upstream.

    Clang needs to be told which target it is building for when cross
    compiling.

    Link: https://github.com/ClangBuiltLinux/linux/issues/259
    Signed-off-by: Joel Stanley
    Tested-by: Daniel Axtens # powerpc 64-bit BE
    Acked-by: Michael Ellerman
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Masahiro Yamada
    [nc: Use 'ifeq ($(cc-name),clang)' instead of 'ifdef CONFIG_CC_IS_CLANG'
    because that config does not exist in 4.14; the Kconfig rewrite
    that added that config happened in 4.18]
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Joel Stanley
     
  • commit aea447141c7e7824b81b49acd1bc785506fba46e upstream.

    The powerpc kernel uses setjmp which causes a warning when building
    with clang:

    In file included from arch/powerpc/xmon/xmon.c:51:
    ./arch/powerpc/include/asm/setjmp.h:15:13: error: declaration of
    built-in function 'setjmp' requires inclusion of the header
    [-Werror,-Wbuiltin-requires-header]
    extern long setjmp(long *);
    ^
    ./arch/powerpc/include/asm/setjmp.h:16:13: error: declaration of
    built-in function 'longjmp' requires inclusion of the header
    [-Werror,-Wbuiltin-requires-header]
    extern void longjmp(long *, long);
    ^

    This *is* the header and we're not using the built-in setjump but
    rather the one in arch/powerpc/kernel/misc.S. As the compiler warning
    does not make sense, it for the files where setjmp is used.

    Signed-off-by: Joel Stanley
    Reviewed-by: Nick Desaulniers
    [mpe: Move subdir-ccflags in xmon/Makefile to not clobber -Werror]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Joel Stanley
     
  • commit 6977f95e63b9b3fb4a5973481a800dd9f48a1338 upstream.

    Signed-off-by: Nicholas Piggin
    Reviewed-by: Joel Stanley
    Signed-off-by: Michael Ellerman
    [nc: Adjust context due to lack of f2910f0e6835 and 2a056f58fd33]
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit 462951cd32e1496dc64b00051dfb777efc8ae5d8 ]

    For some configs the build fails with:

    arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers':
    arch/powerpc/mm/dump_linuxpagetables.c:306:39: error: 'PKMAP_BASE' undeclared (first use in this function)
    arch/powerpc/mm/dump_linuxpagetables.c:314:50: error: 'LAST_PKMAP' undeclared (first use in this function)

    These come from highmem.h, including that fixes the build.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Michael Ellerman
     
  • [ Upstream commit 5564597d51c8ff5b88d95c76255e18b13b760879 ]

    Commit 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper
    as a relocatable ET_DYN", 2011-04-12) changed the procedure descriptor
    at the start of crt0.S to have a hard-coded start address of 0x500000
    rather than a reference to _zimage_start, presumably because having
    a reference to a symbol introduced a relocation which is awkward to
    handle in a position-independent executable. Unfortunately, what is
    at 0x500000 in the COFF image is not the first instruction, but the
    procedure descriptor itself, that is, a word containing 0x500000,
    which is not a valid instruction. Hence, booting a COFF zImage
    results in a "DEFAULT CATCH!, code=FFF00700" message from Open
    Firmware.

    This fixes the problem by (a) putting the procedure descriptor in the
    data section and (b) adding a branch to _zimage_start as the first
    instruction in the program.

    Fixes: 6975a783d7b4 ("powerpc/boot: Allow building the zImage wrapper as a relocatable ET_DYN")
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Paul Mackerras
     

21 Dec, 2018

1 commit

  • commit 78e7b15e17ac175e7eed9e21c6f92d03d3b0a6fa upstream.

    The arch_teardown_msi_irqs() function assumes that controller ops
    pointers were already checked in arch_setup_msi_irqs(), but this
    assumption is wrong: arch_teardown_msi_irqs() can be called even when
    arch_setup_msi_irqs() returns an error (-ENOSYS).

    This can happen in the following scenario:
    - msi_capability_init() calls pci_msi_setup_msi_irqs()
    - pci_msi_setup_msi_irqs() returns -ENOSYS
    - msi_capability_init() notices the error and calls free_msi_irqs()
    - free_msi_irqs() calls pci_msi_teardown_msi_irqs()

    This is easier to see when CONFIG_PCI_MSI_IRQ_DOMAIN is not set and
    pci_msi_setup_msi_irqs() and pci_msi_teardown_msi_irqs() are just
    aliases to arch_setup_msi_irqs() and arch_teardown_msi_irqs().

    The call to free_msi_irqs() upon pci_msi_setup_msi_irqs() failure
    seems legit, as it does additional cleanup; e.g.
    list_del(&entry->list) and kfree(entry) inside free_msi_irqs() do
    happen (MSI descriptors are allocated before pci_msi_setup_msi_irqs()
    is called and need to be cleaned up if that fails).

    Fixes: 6b2fd7efeb88 ("PCI/MSI/PPC: Remove arch_msi_check_device()")
    Cc: stable@vger.kernel.org # v3.18+
    Signed-off-by: Radu Rendec
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Radu Rendec
     

01 Dec, 2018

3 commits

  • [ Upstream commit 437ccdc8ce629470babdda1a7086e2f477048cbd ]

    When VPHN function is not supported and during cpu hotplug event,
    kernel prints message 'VPHN function not supported. Disabling
    polling...'. Currently it prints on every hotplug event, it floods
    dmesg when a KVM guest tries to hotplug huge number of vcpus, let's
    just print once and suppress further kernel prints.

    Signed-off-by: Satheesh Rajendran
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Satheesh Rajendran
     
  • [ Upstream commit 43c6494fa1499912c8177e71450c0279041152a6 ]

    Back in 2006 Ben added some workarounds for a misbehaviour in the
    Spider IO bridge used on early Cell machines, see commit
    014da7ff47b5 ("[POWERPC] Cell "Spider" MMIO workarounds"). Later these
    were made to be generic, ie. not tied specifically to Spider.

    The code stashes a token in the high bits (59-48) of virtual addresses
    used for IO (eg. returned from ioremap()). This works fine when using
    the Hash MMU, but when we're using the Radix MMU the bits used for the
    token overlap with some of the bits of the virtual address.

    This is because the maximum virtual address is larger with Radix, up
    to c00fffffffffffff, and in fact we use that high part of the address
    range for ioremap(), see RADIX_KERN_IO_START.

    As it happens the bits that are used overlap with the bits that
    differentiate an IO address vs a linear map address. If the resulting
    address lies outside the linear mapping we will crash (see below), if
    not we just corrupt memory.

    virtio-pci 0000:00:00.0: Using 64-bit direct DMA at offset 800000000000000
    Unable to handle kernel paging request for data at address 0xc000000080000014
    ...
    CFAR: c000000000626b98 DAR: c000000080000014 DSISR: 42000000 IRQMASK: 0
    GPR00: c0000000006c54fc c00000003e523378 c0000000016de600 0000000000000000
    GPR04: c00c000080000014 0000000000000007 0fffffff000affff 0000000000000030
    ^^^^
    ...
    NIP [c000000000626c5c] .iowrite8+0xec/0x100
    LR [c0000000006c992c] .vp_reset+0x2c/0x90
    Call Trace:
    .pci_bus_read_config_dword+0xc4/0x120 (unreliable)
    .register_virtio_device+0x13c/0x1c0
    .virtio_pci_probe+0x148/0x1f0
    .local_pci_probe+0x68/0x140
    .pci_device_probe+0x164/0x220
    .really_probe+0x274/0x3b0
    .driver_probe_device+0x80/0x170
    .__driver_attach+0x14c/0x150
    .bus_for_each_dev+0xb8/0x130
    .driver_attach+0x34/0x50
    .bus_add_driver+0x178/0x2f0
    .driver_register+0x90/0x1a0
    .__pci_register_driver+0x6c/0x90
    .virtio_pci_driver_init+0x2c/0x40
    .do_one_initcall+0x64/0x280
    .kernel_init_freeable+0x36c/0x474
    .kernel_init+0x24/0x160
    .ret_from_kernel_thread+0x58/0x7c

    This hasn't been a problem because CONFIG_PPC_IO_WORKAROUNDS which
    enables this code is usually not enabled. It is only enabled when it's
    selected by PPC_CELL_NATIVE which is only selected by
    PPC_IBM_CELL_BLADE and that in turn depends on BIG_ENDIAN. So in order
    to hit the bug you need to build a big endian kernel, with IBM Cell
    Blade support enabled, as well as Radix MMU support, and then boot
    that on Power9 using Radix MMU.

    Still we can fix the bug, so let's do that. We simply use fewer bits
    for the token, taking the union of the restrictions on the address
    from both Hash and Radix, we end up with 8 bits we can use for the
    token. The only user of the token is iowa_mem_find_bus() which only
    supports 8 token values, so 8 bits is plenty for that.

    Fixes: 566ca99af026 ("powerpc/mm/radix: Add dummy radix_enabled()")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Michael Ellerman
     
  • [ Upstream commit 28c5bcf74fa07c25d5bd118d1271920f51ce2a98 ]

    TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
    , so like that #include, they should
    be outside #ifdef protection.

    They also need to be #undefed before defining, in case multiple trace
    headers are included by the same C file. This became the case on
    book3e after commit cf4a6085151a ("powerpc/mm: Add missing tracepoint for
    tlbie"), leading to the following build error:

    CC arch/powerpc/kvm/powerpc.o
    In file included from arch/powerpc/kvm/powerpc.c:51:0:
    arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
    [-Werror]
    #define TRACE_INCLUDE_PATH .
    ^
    In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
    from arch/powerpc/kvm/powerpc.c:48:
    ./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
    the previous definition
    #define TRACE_INCLUDE_PATH asm
    ^
    cc1: all warnings being treated as errors

    Reported-by: Christian Zigotzky
    Signed-off-by: Scott Wood
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Scott Wood
     

21 Nov, 2018

8 commits

  • [ Upstream commit 3f7daf3d7582dc6628ac40a9045dd1bbd80c5f35 ]

    When hot-removing memory release_mem_region_adjustable() splits iomem
    resources if they are not the exact size of the memory being
    hot-deleted. Adding this memory back to the kernel adds a new resource.

    Eg a node has memory 0x0 - 0xfffffffff. Hot-removing 1GB from
    0xf40000000 results in the single resource 0x0-0xfffffffff being split
    into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.

    When we hot-add the memory back we now have three resources:
    0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.

    This is an issue if we try to remove some memory that overlaps
    resources. Eg when trying to remove 2GB at address 0xf40000000,
    release_mem_region_adjustable() fails as it expects the chunk of memory
    to be within the boundaries of a single resource. We then get the
    warning: "Unable to release resource" and attempting to use memtrace
    again gives us this error: "bash: echo: write error: Resource
    temporarily unavailable"

    This patch makes memtrace remove memory in chunks that are always the
    same size from an address that is always equal to end_of_memory -
    n*size, for some n. So hotremoving and hotadding memory of different
    sizes will now not attempt to remove memory that spans multiple
    resources.

    Signed-off-by: Rashmica Gupta
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Rashmica Gupta
     
  • [ Upstream commit ee9d21b3b3583712029a0db65a4b7c081d08d3b3 ]

    When building with clang crt0's _zimage_start is not marked weak, which
    breaks the build when linking the kernel image:

    $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$
    0000000000000058 g .text 0000000000000000 _zimage_start

    ld: arch/powerpc/boot/wrapper.a(crt0.o): in function '_zimage_start':
    (.text+0x58): multiple definition of '_zimage_start';
    arch/powerpc/boot/pseries-head.o:(.text+0x0): first defined here

    Clang requires the .weak directive to appear after the symbol is
    declared. The binutils manual says:

    This directive sets the weak attribute on the comma separated list of
    symbol names. If the symbols do not already exist, they will be
    created.

    So it appears this is different with clang. The only reference I could
    see for this was an OpenBSD mailing list post[1].

    Changing it to be after the declaration fixes building with Clang, and
    still works with GCC.

    $ objdump -t arch/powerpc/boot/crt0.o |grep _zimage_start$
    0000000000000058 w .text 0000000000000000 _zimage_start

    Reported to clang as https://bugs.llvm.org/show_bug.cgi?id=38921

    [1] https://groups.google.com/forum/#!topic/fa.openbsd.tech/PAgKKen2YCY

    Signed-off-by: Joel Stanley
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joel Stanley
     
  • [ Upstream commit 803d690e68f0c5230183f1a42c7d50a41d16e380 ]

    When a process allocates a hugepage, the following leak is
    reported by kmemleak. This is a false positive which is
    due to the pointer to the table being stored in the PGD
    as physical memory address and not virtual memory pointer.

    unreferenced object 0xc30f8200 (size 512):
    comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] huge_pte_alloc+0xdc/0x1f8
    [] hugetlb_fault+0x560/0x8f8
    [] follow_hugetlb_page+0x14c/0x44c
    [] __get_user_pages+0x1c4/0x3dc
    [] __mm_populate+0xac/0x140
    [] vm_mmap_pgoff+0xb4/0xb8
    [] ksys_mmap_pgoff+0xcc/0x1fc
    [] ret_from_syscall+0x0/0x38

    See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as
    memory leaks when using kmemleak") for detailed explanation.

    To fix that, this patch tells kmemleak to ignore the allocated
    hugepage table.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit f5e284803a7206d43e26f9ffcae5de9626d95e37 ]

    When enumerating page size definitions to check hardware support,
    we construct a constant which is (1U << (def->shift - 10)).

    However, the array of page size definitions is only initalised for
    various MMU_PAGE_* constants, so it contains a number of 0-initialised
    elements with def->shift == 0. This means we end up shifting by a
    very large number, which gives the following UBSan splat:

    ================================================================================
    UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
    shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
    Call Trace:
    [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
    [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
    [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
    [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
    [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
    [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
    ================================================================================

    Fix this by first checking if the element exists (shift != 0) before
    constructing the constant.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • [ Upstream commit f9bc28aedfb5bbd572d2d365f3095c1becd7209b ]

    If an error occurs during an unplug operation, it's possible for
    eeh_dump_dev_log() to be called when edev->pdn is null, which
    currently leads to dereferencing a null pointer.

    Handle this by skipping the error log for those devices.

    Signed-off-by: Sam Bobroff
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sam Bobroff
     
  • [ Upstream commit 0d923962ab69c27cca664a2d535e90ef655110ca ]

    When we're running on Book3S with the Radix MMU enabled the page table
    dump currently prints the wrong addresses because it uses the wrong
    start address.

    Fix it to use PAGE_OFFSET rather than KERN_VIRT_START.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • [ Upstream commit b851ba02a6f3075f0f99c60c4bc30a4af80cf428 ]

    The recent module relocation overflow crash demonstrated that we
    have no range checking on REL32 relative relocations. This patch
    implements a basic check, the same kernel that previously oopsed
    and rebooted now continues with some of these errors when loading
    the module:

    module_64: x_tables: REL32 527703503449812 out of range!

    Possibly other relocations (ADDR32, REL16, TOC16, etc.) should also have
    overflow checks.

    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • [ Upstream commit daf00ae71dad8aa05965713c62558aeebf2df48e ]

    commit b96672dd840f ("powerpc: Machine check interrupt is a non-
    maskable interrupt") added a call to nmi_enter() at the beginning of
    machine check restart exception handler. Due to that, in_interrupt()
    always returns true regardless of the state before entering the
    exception, and die() panics even when the system was not already in
    interrupt.

    This patch calls nmi_exit() before calling die() in order to restore
    the interrupt state we had before calling nmi_enter()

    Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
    Signed-off-by: Christophe Leroy
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

14 Nov, 2018

1 commit

  • commit 0f99153def98134403c9149128e59d3e1786cf04 upstream.

    mpic_get_primary_version() is not defined when not using MPIC.
    The compile error log like:

    arch/powerpc/sysdev/built-in.o: In function `fsl_of_msi_probe':
    fsl_msi.c:(.text+0x150c): undefined reference to `fsl_mpic_primary_get_version'

    Signed-off-by: Jia Hongtao
    Signed-off-by: Scott Wood
    Reported-by: Radu Rendec
    Fixes: 807d38b73b6 ("powerpc/mpic: Add get_version API both for internal and external use")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

04 Nov, 2018

1 commit

  • [ Upstream commit c1e150ceb61e4a585bad156da15c33bfe89f5858 ]

    When CONFIG_NUMA is not set, the build fails with:

    arch/powerpc/platforms/pseries/hotplug-cpu.c:335:4:
    error: déclaration implicite de la fonction « update_numa_cpu_lookup_table »

    So we have to add update_numa_cpu_lookup_table() as an empty function
    when CONFIG_NUMA is not set.

    Fixes: 1d9a090783be ("powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove")
    Signed-off-by: Corentin Labbe
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Corentin Labbe
     

20 Oct, 2018

3 commits

  • commit 8183d99f4a22c2abbc543847a588df3666ef0c0c upstream.

    feature fixups need to use patch_instruction() early in the boot,
    even before the code is relocated to its final address, requiring
    patch_instruction() to use PTRRELOC() in order to address data.

    But feature fixups applies on code before it is set to read only,
    even for modules. Therefore, feature fixups can use
    raw_patch_instruction() instead.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Reported-by: David Gounaris
    Tested-by: David Gounaris
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit 96dc89d526ef77604376f06220e3d2931a0bfd58 ]

    Current we store the userspace r1 to PACATMSCRATCH before finally
    saving it to the thread struct.

    In theory an exception could be taken here (like a machine check or
    SLB miss) that could write PACATMSCRATCH and hence corrupt the
    userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
    others do.

    We've never actually seen this happen but it's theoretically
    possible. Either way, the code is fragile as it is.

    This patch saves r1 to the kernel stack (which can't fault) before we
    turn MSR[RI] back on. PACATMSCRATCH is still used but only with
    MSR[RI] off. We then copy r1 from the kernel stack to the thread
    struct once we have MSR[RI] back on.

    Suggested-by: Breno Leitao
    Signed-off-by: Michael Neuling
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     
  • [ Upstream commit cf13435b730a502e814c63c84d93db131e563f5f ]

    When we treclaim we store the userspace checkpointed r13 to a scratch
    SPR and then later save the scratch SPR to the user thread struct.

    Unfortunately, this doesn't work as accessing the user thread struct
    can take an SLB fault and the SLB fault handler will write the same
    scratch SPRG that now contains the userspace r13.

    To fix this, we store r13 to the kernel stack (which can't fault)
    before we access the user thread struct.

    Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
    as a random userspace segfault with r13 looking like a kernel address.

    Signed-off-by: Michael Neuling
    Reviewed-by: Breno Leitao
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     

18 Oct, 2018

1 commit

  • commit 4628a64591e6cee181237060961e98c615c33966 upstream.

    Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a
    result we will see warnings such as:

    BUG: Bad page map in process JobWrk0013 pte:800001803875ea25 pmd:7624381067
    addr:00007f0930720000 vm_flags:280000f9 anon_vma: (null) mapping:ffff97f2384056f0 index:0
    file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage: (null)
    CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G W 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased)
    Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
    Call Trace:
    dump_stack+0x5a/0x75
    print_bad_pte+0x217/0x2c0
    ? enqueue_task_fair+0x76/0x9f0
    _vm_normal_page+0xe5/0x100
    zap_pte_range+0x148/0x740
    unmap_page_range+0x39a/0x4b0
    unmap_vmas+0x42/0x90
    unmap_region+0x99/0xf0
    ? vma_gap_callbacks_rotate+0x1a/0x20
    do_munmap+0x255/0x3a0
    vm_munmap+0x54/0x80
    SyS_munmap+0x1d/0x30
    do_syscall_64+0x74/0x150
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    ...

    when mprotect(2) gets used on DAX mappings. Also there is a wide variety
    of other failures that can result from the missing _PAGE_DEVMAP flag
    when the area gets used by get_user_pages() later.

    Fix the problem by including _PAGE_DEVMAP in a set of flags that get
    preserved by mprotect(2).

    Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP")
    Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64")
    Cc:
    Signed-off-by: Jan Kara
    Acked-by: Michal Hocko
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

13 Oct, 2018

3 commits

  • commit b45ba4a51cde29b2939365ef0c07ad34c8321789 upstream.

    Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init
    sections") accesses 'init_mem_is_free' flag too early, before the
    kernel is relocated. This provokes early boot failure (before the
    console is active).

    As it is not necessary to do this verification that early, this
    patch moves the test into patch_instruction() instead of
    __patch_instruction().

    This modification also has the advantage of avoiding unnecessary
    remappings.

    Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections")
    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 51c3c62b58b357e8d35e4cc32f7b4ec907426fe3 upstream.

    This stops us from doing code patching in init sections after they've
    been freed.

    In this chain:
    kvm_guest_init() ->
    kvm_use_magic_page() ->
    fault_in_pages_readable() ->
    __get_user() ->
    __get_user_nocheck() ->
    barrier_nospec();

    We have a code patching location at barrier_nospec() and
    kvm_guest_init() is an init function. This whole chain gets inlined,
    so when we free the init section (hence kvm_guest_init()), this code
    goes away and hence should no longer be patched.

    We seen this as userspace memory corruption when using a memory
    checker while doing partition migration testing on powervm (this
    starts the code patching post migration via
    /sys/kernel/mobility/migration). In theory, it could also happen when
    using /sys/kernel/debug/powerpc/barrier_nospec.

    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Michael Neuling
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     
  • commit 8cf4c05712f04a405f0dacebcca8f042b391694a upstream.

    patch_instruction() uses almost the same sequence as
    __patch_instruction()

    This patch refactor it so that patch_instruction() uses
    __patch_instruction() instead of duplicating code.

    Signed-off-by: Christophe Leroy
    Acked-by: Balbir Singh
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

10 Oct, 2018

1 commit

  • [ Upstream commit 46dec40fb741f00f1864580130779aeeaf24fb3d ]

    This fixes a bug which causes guest virtual addresses to get translated
    to guest real addresses incorrectly when the guest is using the HPT MMU
    and has more than 256GB of RAM, or more specifically has a HPT larger
    than 2GB. This has showed up in testing as a failure of the host to
    emulate doorbell instructions correctly on POWER9 for HPT guests with
    more than 256GB of RAM.

    The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate()
    is stored as an int, and in forming the HPTE address, the index gets
    shifted left 4 bits as an int before being signed-extended to 64 bits.
    The simple fix is to make the variable a long int, matching the
    return type of kvmppc_hv_find_lock_hpte(), which is what calculates
    the index.

    Fixes: 697d3899dcb4 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests")
    Signed-off-by: Paul Mackerras
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     

04 Oct, 2018

2 commits

  • [ Upstream commit d3d4ffaae439981e1e441ebb125aa3588627c5d8 ]

    We use PHB in mode1 which uses bit 59 to select a correct DMA window.
    However there is mode2 which uses bits 59:55 and allows up to 32 DMA
    windows per a PE.

    Even though documentation does not clearly specify that, it seems that
    the actual hardware does not support bits 59:55 even in mode1, in other
    words we can create a window as big as 1<
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     
  • [ Upstream commit 8950329c4a64c6d3ca0bc34711a1afbd9ce05657 ]

    Memory reservation for crashkernel could fail if there are holes around
    kdump kernel offset (128M). Fail gracefully in such cases and print an
    error message.

    Signed-off-by: Hari Bathini
    Tested-by: David Gibson
    Reviewed-by: Dave Young
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hari Bathini
     

26 Sep, 2018

2 commits

  • [ Upstream commit 51eaa08f029c7343df846325d7cf047be8b96e81 ]

    The call to of_find_compatible_node() is returning a pointer with
    incremented refcount so it must be explicitly decremented after the
    last use. As here it is only being used for checking of node presence
    but the result is not actually used in the success path it can be
    dropped immediately.

    Signed-off-by: Nicholas Mc Guire
    Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
    Signed-off-by: Paul Mackerras
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Mc Guire
     
  • [ Upstream commit bd90284cc6c1c9e8e48c8eadd0c79574fcce0b81 ]

    The intention here is to consume and discard the remaining buffer
    upon error. This works if there has not been a previous partial write.
    If there has been, then total_len is no longer total number of bytes
    to copy. total_len is always "bytes left to copy", so it should be
    added to written bytes.

    This code may not be exercised any more if partial writes will not be
    hit, but this is a small bugfix before a larger change.

    Reviewed-by: Benjamin Herrenschmidt
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     

20 Sep, 2018

1 commit

  • [ Upstream commit 9eab9901b015f489199105c470de1ffc337cfabb ]

    We've encountered a performance issue when multiple processors stress
    {get,put}_mmio_atsd_reg(). These functions contend for
    mmio_atsd_usage, an unsigned long used as a bitmask.

    The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
    and clear_bit_unlock(). As implemented, both of these will require
    a (successful) stwcx to that same cache line.

    What we end up with is thread A, attempting to unlock, being slowed by
    other threads repeatedly attempting to lock. A's stwcx instructions
    fail and retry because the memory reservation is lost every time a
    different thread beats it to the punch.

    There may be a long-term way to fix this at a larger scale, but for
    now resolve the immediate problem by gating our call to
    test_and_set_bit_lock() with one to test_bit(), which is obviously
    implemented without using a store.

    Fixes: 1ab66d1fbada ("powerpc/powernv: Introduce address translation services for Nvlink2")
    Signed-off-by: Reza Arbab
    Acked-by: Alistair Popple
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Reza Arbab
     

15 Sep, 2018

5 commits

  • [ Upstream commit 74e96bf44f430cf7a01de19ba6cf49b361cdfd6e ]

    The global mce data buffer that used to copy rtas error log is of 2048
    (RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
    extended_log_length from rtas error log header, then use max of
    extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
    Ideally the platform (phyp) will never send extended error log with
    size > 2048. But if that happens, then we have a risk of buffer overrun
    and corruption. Fix this by using min_t instead.

    Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
    Reported-by: Michal Suchanek
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     
  • [ Upstream commit 78ee9946371f5848ddfc88ab1a43867df8f17d83 ]

    Because rfi_flush_fallback runs immediately before the return to
    userspace it currently runs with the user r1 (stack pointer). This
    means if we oops in there we will report a bad kernel stack pointer in
    the exception entry path, eg:

    Bad kernel stack pointer 7ffff7150e40 at c0000000000023b4
    Oops: Bad kernel stack pointer, sig: 6 [#1]
    LE SMP NR_CPUS=32 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1246 Comm: klogd Not tainted 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
    NIP: c0000000000023b4 LR: 0000000010053e00 CTR: 0000000000000040
    REGS: c0000000fffe7d40 TRAP: 4100 Not tainted (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
    MSR: 9000000002803031 CR: 44000442 XER: 20000000
    CFAR: c00000000000bac8 IRQMASK: c0000000f1e66a80
    GPR00: 0000000002000000 00007ffff7150e40 00007fff93a99900 0000000000000020
    ...
    NIP [c0000000000023b4] rfi_flush_fallback+0x34/0x80
    LR [0000000010053e00] 0x10053e00

    Although the NIP tells us where we were, and the TRAP number tells us
    what happened, it would still be nicer if we could report the actual
    exception rather than barfing about the stack pointer.

    We an do that fairly simply by loading the kernel stack pointer on
    entry and restoring the user value before returning. That way we see a
    regular oops such as:

    Unrecoverable exception 4100 at c00000000000239c
    Oops: Unrecoverable exception, sig: 6 [#1]
    LE SMP NR_CPUS=32 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1251 Comm: klogd Not tainted 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
    NIP: c00000000000239c LR: 0000000010053e00 CTR: 0000000000000040
    REGS: c0000000f1e17bb0 TRAP: 4100 Not tainted (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
    MSR: 9000000002803031 CR: 44000442 XER: 20000000
    CFAR: c00000000000bac8 IRQMASK: 0
    ...
    NIP [c00000000000239c] rfi_flush_fallback+0x3c/0x80
    LR [0000000010053e00] 0x10053e00
    Call Trace:
    [c0000000f1e17e30] [c00000000000b9e4] system_call+0x5c/0x70 (unreliable)

    Note this shouldn't make the kernel stack pointer vulnerable to a
    meltdown attack, because it should be flushed from the cache before we
    return to userspace. The user r1 value will be in the cache, because
    we load it in the return path, but that is harmless.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • [ Upstream commit f5daf77a55ef0e695cc90c440ed6503073ac5e07 ]

    Fix build errors and warnings in t1042rdb_diu.c by adding header files
    and MODULE_LICENSE().

    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: data definition has no type or storage class
    early_initcall(t1042rdb_diu_init);
    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: error: type defaults to 'int' in declaration of 'early_initcall' [-Werror=implicit-int]
    ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: parameter names (without types) in function declaration

    and
    WARNING: modpost: missing MODULE_LICENSE() in arch/powerpc/platforms/85xx/t1042rdb_diu.o

    Signed-off-by: Randy Dunlap
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Scott Wood
    Cc: Kumar Gala
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     
  • [ Upstream commit c42d3be0c06f0c1c416054022aa535c08a1f9b39 ]

    The problem is the the calculation should be "end - start + 1" but the
    plus one is missing in this calculation.

    Fixes: 8626816e905e ("powerpc: add support for MPIC message register API")
    Signed-off-by: Dan Carpenter
    Reviewed-by: Tyrel Datwyler
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • [ Upstream commit f7a6947cd49b7ff4e03f1b4f7e7b223003d752ca ]

    Currently if you build a 32-bit powerpc kernel and use get_user() to
    load a u64 value it will fail to build with eg:

    kernel/rseq.o: In function `rseq_get_rseq_cs':
    kernel/rseq.c:123: undefined reference to `__get_user_bad'

    This is hitting the check in __get_user_size() that makes sure the
    size we're copying doesn't exceed the size of the destination:

    #define __get_user_size(x, ptr, size, retval)
    do {
    retval = 0;
    __chk_user_ptr(ptr);
    if (size > sizeof(x))
    (x) = __get_user_bad();

    Which doesn't immediately make sense because the size of the
    destination is u64, but it's not really, because __get_user_check()
    etc. internally create an unsigned long and copy into that:

    #define __get_user_check(x, ptr, size)
    ({
    long __gu_err = -EFAULT;
    unsigned long __gu_val = 0;

    The problem being that on 32-bit unsigned long is not big enough to
    hold a u64. We can fix this with a trick from hpa in the x86 code, we
    statically check the type of x and set the type of __gu_val to either
    unsigned long or unsigned long long.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     

10 Sep, 2018

1 commit

  • commit 8cfbdbdc24815417a3ab35101ccf706b9a23ff17 upstream.

    Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
    the pinned physical page", 2018-07-17) added some checks to ensure
    that guest DMA mappings don't attempt to map more than the guest is
    entitled to access. However, errors in the logic mean that legitimate
    guest requests to map pages for DMA are being denied in some
    situations. Specifically, if the first page of the range passed to
    mm_iommu_get() is mapped with a normal page, and subsequent pages are
    mapped with transparent huge pages, we end up with mem->pageshift ==
    0. That means that the page size checks in mm_iommu_ua_to_hpa() and
    mm_iommu_up_to_hpa_rm() will always fail for every page in that
    region, and thus the guest can never map any memory in that region for
    DMA, typically leading to a flood of error messages like this:

    qemu-system-ppc64: VFIO_MAP_DMA: -22
    qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)

    The logic errors in mm_iommu_get() are:

    (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
    call (meaning that find_linux_pte() returns the pte for the
    first address in the range, not the address we are currently up
    to);
    (b) use of 'pageshift' as the variable to receive the hugepage shift
    returned by find_linux_pte() - for a normal page this gets set
    to 0, leading to us setting mem->pageshift to 0 when we conclude
    that the pte returned by find_linux_pte() didn't match the page
    we were looking at;
    (c) comparing 'compshift', which is a page order, i.e. log base 2 of
    the number of pages, with 'pageshift', which is a log base 2 of
    the number of bytes.

    To fix these problems, this patch introduces 'cur_ua' to hold the
    current user address and uses that in the find_linux_pte() call;
    introduces 'pteshift' to hold the hugepage shift found by
    find_linux_pte(); and compares 'pteshift' with 'compshift +
    PAGE_SHIFT' rather than 'compshift'.

    The patch also moves the local_irq_restore to the point after the PTE
    pointer returned by find_linux_pte() has been dereferenced because
    otherwise the PTE could change underneath us, and adds a check to
    avoid doing the find_linux_pte() call once mem->pageshift has been
    reduced to PAGE_SHIFT, as an optimization.

    Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras