13 Jan, 2019

1 commit

  • [ Upstream commit 462951cd32e1496dc64b00051dfb777efc8ae5d8 ]

    For some configs the build fails with:

    arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers':
    arch/powerpc/mm/dump_linuxpagetables.c:306:39: error: 'PKMAP_BASE' undeclared (first use in this function)
    arch/powerpc/mm/dump_linuxpagetables.c:314:50: error: 'LAST_PKMAP' undeclared (first use in this function)

    These come from highmem.h, including that fixes the build.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Michael Ellerman
     

01 Dec, 2018

1 commit

  • [ Upstream commit 437ccdc8ce629470babdda1a7086e2f477048cbd ]

    When VPHN function is not supported and during cpu hotplug event,
    kernel prints message 'VPHN function not supported. Disabling
    polling...'. Currently it prints on every hotplug event, it floods
    dmesg when a KVM guest tries to hotplug huge number of vcpus, let's
    just print once and suppress further kernel prints.

    Signed-off-by: Satheesh Rajendran
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Satheesh Rajendran
     

21 Nov, 2018

3 commits

  • [ Upstream commit 803d690e68f0c5230183f1a42c7d50a41d16e380 ]

    When a process allocates a hugepage, the following leak is
    reported by kmemleak. This is a false positive which is
    due to the pointer to the table being stored in the PGD
    as physical memory address and not virtual memory pointer.

    unreferenced object 0xc30f8200 (size 512):
    comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] huge_pte_alloc+0xdc/0x1f8
    [] hugetlb_fault+0x560/0x8f8
    [] follow_hugetlb_page+0x14c/0x44c
    [] __get_user_pages+0x1c4/0x3dc
    [] __mm_populate+0xac/0x140
    [] vm_mmap_pgoff+0xb4/0xb8
    [] ksys_mmap_pgoff+0xcc/0x1fc
    [] ret_from_syscall+0x0/0x38

    See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as
    memory leaks when using kmemleak") for detailed explanation.

    To fix that, this patch tells kmemleak to ignore the allocated
    hugepage table.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit f5e284803a7206d43e26f9ffcae5de9626d95e37 ]

    When enumerating page size definitions to check hardware support,
    we construct a constant which is (1U << (def->shift - 10)).

    However, the array of page size definitions is only initalised for
    various MMU_PAGE_* constants, so it contains a number of 0-initialised
    elements with def->shift == 0. This means we end up shifting by a
    very large number, which gives the following UBSan splat:

    ================================================================================
    UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
    shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
    Call Trace:
    [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
    [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
    [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
    [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
    [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
    [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
    ================================================================================

    Fix this by first checking if the element exists (shift != 0) before
    constructing the constant.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • [ Upstream commit 0d923962ab69c27cca664a2d535e90ef655110ca ]

    When we're running on Book3S with the Radix MMU enabled the page table
    dump currently prints the wrong addresses because it uses the wrong
    start address.

    Fix it to use PAGE_OFFSET rather than KERN_VIRT_START.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     

13 Oct, 2018

1 commit

  • commit 51c3c62b58b357e8d35e4cc32f7b4ec907426fe3 upstream.

    This stops us from doing code patching in init sections after they've
    been freed.

    In this chain:
    kvm_guest_init() ->
    kvm_use_magic_page() ->
    fault_in_pages_readable() ->
    __get_user() ->
    __get_user_nocheck() ->
    barrier_nospec();

    We have a code patching location at barrier_nospec() and
    kvm_guest_init() is an init function. This whole chain gets inlined,
    so when we free the init section (hence kvm_guest_init()), this code
    goes away and hence should no longer be patched.

    We seen this as userspace memory corruption when using a memory
    checker while doing partition migration testing on powervm (this
    starts the code patching post migration via
    /sys/kernel/mobility/migration). In theory, it could also happen when
    using /sys/kernel/debug/powerpc/barrier_nospec.

    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Michael Neuling
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     

10 Sep, 2018

1 commit

  • commit 8cfbdbdc24815417a3ab35101ccf706b9a23ff17 upstream.

    Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
    the pinned physical page", 2018-07-17) added some checks to ensure
    that guest DMA mappings don't attempt to map more than the guest is
    entitled to access. However, errors in the logic mean that legitimate
    guest requests to map pages for DMA are being denied in some
    situations. Specifically, if the first page of the range passed to
    mm_iommu_get() is mapped with a normal page, and subsequent pages are
    mapped with transparent huge pages, we end up with mem->pageshift ==
    0. That means that the page size checks in mm_iommu_ua_to_hpa() and
    mm_iommu_up_to_hpa_rm() will always fail for every page in that
    region, and thus the guest can never map any memory in that region for
    DMA, typically leading to a flood of error messages like this:

    qemu-system-ppc64: VFIO_MAP_DMA: -22
    qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)

    The logic errors in mm_iommu_get() are:

    (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
    call (meaning that find_linux_pte() returns the pte for the
    first address in the range, not the address we are currently up
    to);
    (b) use of 'pageshift' as the variable to receive the hugepage shift
    returned by find_linux_pte() - for a normal page this gets set
    to 0, leading to us setting mem->pageshift to 0 when we conclude
    that the pte returned by find_linux_pte() didn't match the page
    we were looking at;
    (c) comparing 'compshift', which is a page order, i.e. log base 2 of
    the number of pages, with 'pageshift', which is a log base 2 of
    the number of bytes.

    To fix these problems, this patch introduces 'cur_ua' to hold the
    current user address and uses that in the find_linux_pte() call;
    introduces 'pteshift' to hold the hugepage shift found by
    find_linux_pte(); and compares 'pteshift' with 'compshift +
    PAGE_SHIFT' rather than 'compshift'.

    The patch also moves the local_irq_restore to the point after the PTE
    pointer returned by find_linux_pte() has been dereferenced because
    otherwise the PTE could change underneath us, and adds a check to
    avoid doing the find_linux_pte() call once mem->pageshift has been
    reduced to PAGE_SHIFT, as an optimization.

    Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     

03 Aug, 2018

1 commit

  • [ Upstream commit 926bc2f100c24d4842b3064b5af44ae964c1d81c ]

    The stores to update the SLB shadow area must be made as they appear
    in the C code, so that the hypervisor does not see an entry with
    mismatched vsid and esid. Use WRITE_ONCE for this.

    GCC has been observed to elide the first store to esid in the update,
    which means that if the hypervisor interrupts the guest after storing
    to vsid, it could see an entry with old esid and new vsid, which may
    possibly result in memory corruption.

    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     

28 Jul, 2018

1 commit

  • commit 76fa4975f3ed12d15762bc979ca44078598ed8ee upstream.

    A VM which has:
    - a DMA capable device passed through to it (eg. network card);
    - running a malicious kernel that ignores H_PUT_TCE failure;
    - capability of using IOMMU pages bigger that physical pages
    can create an IOMMU mapping that exposes (for example) 16MB of
    the host physical memory to the device when only 64K was allocated to the VM.

    The remaining 16MB - 64K will be some other content of host memory, possibly
    including pages of the VM, but also pages of host kernel memory, host
    programs or other VMs.

    The attacking VM does not control the location of the page it can map,
    and is only allowed to map as many pages as it has pages of RAM.

    We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
    an IOMMU page is contained in the physical page so the PCI hardware won't
    get access to unassigned host memory; however this check is missing in
    the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
    did not hit this yet as the very first time when the mapping happens
    we do not have tbl::it_userspace allocated yet and fall back to
    the userspace which in turn calls VFIO IOMMU driver, this fails and
    the guest does not retry,

    This stores the smallest preregistered page size in the preregistered
    region descriptor and changes the mm_iommu_xxx API to check this against
    the IOMMU page size.

    This calculates maximum page size as a minimum of the natural region
    alignment and compound page size. For the page shift this uses the shift
    returned by find_linux_pte() which indicates how the page is mapped to
    the current userspace - if the page is huge and this is not a zero, then
    it is a leaf pte and the page is mapped within the range.

    Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     

05 Jun, 2018

3 commits

  • commit aa0ab02ba992eb956934b21373e0138211486ddd upstream.

    On the 8xx, the page size is set in the PMD entry and applies to
    all pages of the page table pointed by the said PMD entry.

    When an app has some regular pages allocated (e.g. see below) and tries
    to mmap() a huge page at a hint address covered by the same PMD entry,
    the kernel accepts the hint allthough the 8xx cannot handle different
    page sizes in the same PMD entry.

    10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
    10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc

    mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000

    This results the app remaining forever in do_page_fault()/hugetlb_fault()
    and when interrupting that app, we get the following warning:

    [162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
    [162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W 4.14.6 #85
    [162980.035744] task: c67e2c00 task.stack: c668e000
    [162980.035783] NIP: c000fe18 LR: c00e1eec CTR: c00f90c0
    [162980.035830] REGS: c668fc20 TRAP: 0700 Tainted: G W (4.14.6)
    [162980.035854] MSR: 00029032 CR: 24044224 XER: 20000000
    [162980.036003]
    [162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
    [162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
    [162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
    [162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
    [162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
    [162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
    [162980.036861] Call Trace:
    [162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
    [162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
    [162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
    [162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
    [162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
    [162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
    [162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
    [162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
    [162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
    [162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
    [162980.037781] Instruction dump:
    [162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
    [162980.038014] 2f890000 419e0010 712a0ff0 408200e0 54a9000a 7f984840 419d0094
    [162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
    [162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
    [162985.363322] BUG: non-zero nr_ptes on freeing mm: -1

    In order to fix this, this patch uses the address space "slices"
    implemented for BOOK3S/64 and enhanced to support PPC32 by the
    preceding patch.

    This patch modifies the context.id on the 8xx to be in the range
    [1:16] instead of [0:15] in order to identify context.id == 0 as
    not initialised contexts as done on BOOK3S

    This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
    selected for the 8xx

    Alltough we could in theory have as many slices as PMD entries, the
    current slices implementation limits the number of low slices to 16.
    This limitation is not preventing us to fix the initial issue allthough
    it is suboptimal. It will be cured in a subsequent patch.

    Fixes: 4b91428699477 ("powerpc/8xx: Implement support of hugepages")
    Signed-off-by: Christophe Leroy
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit db3a528db41caaa6dfd4c64e9f5efb1c81a80467 upstream.

    In preparation for the following patch which will fix an issue on
    the 8xx by re-using the 'slices', this patch enhances the
    'slices' implementation to support 32 bits CPUs.

    On PPC32, the address space is limited to 4Gbytes, hence only the low
    slices will be used.

    The high slices use bitmaps. As bitmap functions are not prepared to
    handle bitmaps of size 0, this patch ensures that bitmap functions
    are called only when SLICE_NUM_HIGH is not nul.

    Signed-off-by: Christophe Leroy
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 326691ad4f179e6edc7eb1271e618dd673e4736d upstream.

    bitmap_or() and bitmap_andnot() can work properly with dst identical
    to src1 or src2. There is no need of an intermediate result bitmap
    that is copied back to dst in a second step.

    Signed-off-by: Christophe Leroy
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

02 May, 2018

1 commit

  • commit fb5924fddf9ee31db04da7ad4e8c3434a387101b upstream.

    This patch adds support for flushing potentially dirty cache lines
    when memory is hot-plugged/hot-un-plugged. The support is currently
    limited to 64 bit systems.

    The bug was exposed when mappings for a device were actually
    hot-unplugged and plugged in back later. A similar issue was observed
    during the development of memtrace, but memtrace does it's own
    flushing of region via a custom routine.

    These patches do a flush both on hotplug/unplug to clear any stale
    data in the cache w.r.t mappings, there is a small race window where a
    clean cache line may be created again just prior to tearing down the
    mapping.

    The patches were tested by disabling the flush routines in memtrace
    and doing I/O on the trace file. The system immediately
    checkstops (quite reliablly if prior to the hot-unplug of the memtrace
    region, we memset the regions we are about to hot unplug). After these
    patches no custom flushing is needed in the memtrace code.

    Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory tracing")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Balbir Singh
    Acked-by: Reza Arbab
    Reviewed-by: Rashmica Gupta
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Balbir Singh
     

26 Apr, 2018

2 commits

  • [ Upstream commit ea05ba7c559c8e5a5946c3a94a2a266e9a6680a6 ]

    This patch fixes some problems encountered at runtime with
    configurations that support memory-less nodes, or that hot-add CPUs
    into nodes that are memoryless during system execution after boot. The
    problems of interest include:

    * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
    them are allowed to be 'possible' and 'online'. Memory allocations
    for those nodes are taken from another node that does have memory
    until and if memory is hot-added to the node.

    * Nodes which have no resources assigned at boot, but which may still
    be referenced subsequently by affinity or associativity attributes,
    are kept in the list of 'possible' nodes for powerpc. Hot-add of
    memory or CPUs to the system can reference these nodes and bring
    them online instead of redirecting the references to one of the set
    of nodes known to have memory at boot.

    Note that this software operates under the context of CPU hotplug. We
    are not doing memory hotplug in this code, but rather updating the
    kernel's CPU topology (i.e. arch_update_cpu_topology /
    numa_update_cpu_topology). We are initializing a node that may be used
    by CPUs or memory before it can be referenced as invalid by a CPU
    hotplug operation. CPU hotplug operations are protected by a range of
    APIs including cpu_maps_update_begin/cpu_maps_update_done,
    cpus_read/write_lock / cpus_read/write_unlock, device locks, and more.
    Memory hotplug operations, including try_online_node, are protected by
    mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the
    case of CPUs being hot-added to a previously memoryless node, the
    try_online_node operation occurs wholly within the CPU locks with no
    overlap. Using HMC hot-add/hot-remove operations, we have been able to
    add and remove CPUs to any possible node without failures. HMC
    operations involve a degree self-serialization, though.

    Signed-off-by: Michael Bringmann
    Reviewed-by: Nathan Fontenot
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Bringmann
     
  • [ Upstream commit a346137e9142b039fd13af2e59696e3d40c487ef ]

    On powerpc systems which allow 'hot-add' of CPU or memory resources,
    it may occur that the new resources are to be inserted into nodes that
    were not used for these resources at bootup. In the kernel, any node
    that is used must be defined and initialized. These empty nodes may
    occur when,

    * Dedicated vs. shared resources. Shared resources require information
    such as the VPHN hcall for CPU assignment to nodes. Associativity
    decisions made based on dedicated resource rules, such as
    associativity properties in the device tree, may vary from decisions
    made using the values returned by the VPHN hcall.

    * memoryless nodes at boot. Nodes need to be defined as 'possible' at
    boot for operation with other code modules. Previously, the powerpc
    code would limit the set of possible nodes to those which have
    memory assigned at boot, and were thus online. Subsequent add/remove
    of CPUs or memory would only work with this subset of possible
    nodes.

    * memoryless nodes with CPUs at boot. Due to the previous restriction
    on nodes, nodes that had CPUs but no memory were being collapsed
    into other nodes that did have memory at boot. In practice this
    meant that the node assignment presented by the runtime kernel
    differed from the affinity and associativity attributes presented by
    the device tree or VPHN hcalls. Nodes that might be known to the
    pHyp were not 'possible' in the runtime kernel because they did not
    have memory at boot.

    This patch ensures that sufficient nodes are defined to support
    configuration requirements after boot, as well as at boot. This patch
    set fixes a couple of problems.

    * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
    them are allowed to be 'possible' and 'online'. Memory allocations
    for those nodes are taken from another node that does have memory
    until and if memory is hot-added to the node. * Nodes which have no
    resources assigned at boot, but which may still be referenced
    subsequently by affinity or associativity attributes, are kept in
    the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs
    to the system can reference these nodes and bring them online
    instead of redirecting to one of the set of nodes that were known to
    have memory at boot.

    This patch extracts the value of the lowest domain level (number of
    allocable resources) from the device tree property
    "ibm,max-associativity-domains" to use as the maximum number of nodes
    to setup as possibly available in the system. This new setting will
    override the instruction:

    nodes_and(node_possible_map, node_possible_map, node_online_map);

    presently seen in the function arch/powerpc/mm/numa.c:initmem_init().

    If the "ibm,max-associativity-domains" property is not present at
    boot, no operation will be performed to define or enable additional
    nodes, or enable the above 'nodes_and()'.

    Signed-off-by: Michael Bringmann
    Reviewed-by: Nathan Fontenot
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Bringmann
     

24 Apr, 2018

1 commit

  • commit dbfcf3cb9c681aa0c5d0bb46068f98d5b1823dd3 upstream.

    On POWER9, since commit cc3d2940133d ("powerpc/64: Enable use of radix
    MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and
    HPT bits in the client-architecture-support (CAS) vector, which tells
    the hypervisor that we can do either radix or HPT. According to PAPR,
    if we use this combination we are promising to do a H_REGISTER_PROC_TBL
    hcall later on to let the hypervisor know whether we are doing radix
    or HPT. We currently do this call if we are doing radix but not if
    we are doing HPT. If the hypervisor is able to support both radix
    and HPT guests, it would be entitled to defer allocation of the HPT
    until the H_REGISTER_PROC_TBL call, and to fail any attempts to create
    HPTEs until the H_REGISTER_PROC_TBL call. Thus we need to do a
    H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may
    crash at boot time.

    This adds the code to call H_REGISTER_PROC_TBL in this case, before
    we attempt to create any HPT entries using H_ENTER.

    Fixes: cc3d2940133d ("powerpc/64: Enable use of radix MMU under hypervisor on POWER9")
    Cc: stable@vger.kernel.org # v4.11+
    Signed-off-by: Paul Mackerras
    Reviewed-by: Suraj Jitindar Singh
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     

22 Feb, 2018

4 commits

  • commit 4dd5f8a99e791a8c6500e3592f3ce81ae7edcde1 upstream.

    This patch splits the linear mapping if the hot-unplug range is
    smaller than the mapping size. The code detects if the mapping needs
    to be split into a smaller size and if so, uses the stop machine
    infrastructure to clear the existing mapping and then remap the
    remaining range using a smaller page size.

    The code will skip any region of the mapping that overlaps with kernel
    text and warn about it once. We don't want to remove a mapping where
    the kernel text and the LMB we intend to remove overlap in the same
    TLB mapping as it may affect the currently executing code.

    I've tested these changes under a kvm guest with 2 vcpus, from a split
    mapping point of view, some of the caveats mentioned above applied to
    the testing I did.

    Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")
    Signed-off-by: Balbir Singh
    [mpe: Tweak change log to match updated behaviour]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Balbir Singh
     
  • commit 62e984ddfd6b056d399e24113f5e6a7145e579d8 upstream.

    Radix guests do normally invalidate process-scoped translations when a
    new pid is allocated but migrated guests do not invalidate these so
    migrated guests crash sometime, especially easy to reproduce with
    migration happening within first 10 seconds after the guest boot start
    on the same machine.

    This adds the "Invalidate process-scoped translations" flush to fix
    radix guests migration.

    Fixes: 2ee13be34b13 ("KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00")
    Cc: stable@vger.kernel.org # v4.10+
    Signed-off-by: Alexey Kardashevskiy
    Tested-by: Laurent Vivier
    Tested-by: Daniel Henrique Barboza
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     
  • commit 1d9a090783bef19fe8cdec878620d22f05191316 upstream.

    When DLPAR removing a CPU, the unmapping of the cpu from a node in
    unmap_cpu_from_node() should also invalidate the CPUs entry in the
    numa_cpu_lookup_table. There is not a guarantee that on a subsequent
    DLPAR add of the CPU the associativity will be the same and thus
    could be in a different node. Invalidating the entry in the
    numa_cpu_lookup_table causes the associativity to be read from the
    device tree at the time of the add.

    The current behavior of not invalidating the CPUs entry in the
    numa_cpu_lookup_table can result in scenarios where the the topology
    layout of CPUs in the partition does not match the device tree
    or the topology reported by the HMC.

    This bug looks like it was introduced in 2004 in the commit titled
    "ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the
    linux-fullhist tree. Hence tag it for all stable releases.

    Cc: stable@vger.kernel.org
    Signed-off-by: Nathan Fontenot
    Reviewed-by: Tyrel Datwyler
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nathan Fontenot
     
  • commit 8d81296cfcce89013a714feb8d25004a156f8181 upstream.

    radix__flush_tlb_all() is called only in kexec path in real mode and any
    tracepoints at this stage will make kexec to fail if enabled.

    To verify enable tlbie trace before kexec.

    $ echo 1 > /sys/kernel/debug/tracing/events/powerpc/tlbie/enable
    == kexec into new kernel and kexec fails.

    Fix this by not calling trace_tlbie from radix__flush_tlb_all().

    Fixes: 0428491cba92 ("powerpc/mm: Trace tlbie(l) instructions")
    Cc: stable@vger.kernel.org # v4.13+
    Signed-off-by: Mahesh Salgaonkar
    Acked-by: Balbir Singh
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     

10 Jan, 2018

1 commit

  • commit ecb101aed86156ec7cd71e5dca668e09146e6994 upstream.

    The recent refactoring of the powerpc page fault handler in commit
    c3350602e876 ("powerpc/mm: Make bad_area* helper functions") caused
    access to protected memory regions to indicate SEGV_MAPERR instead of
    the traditional SEGV_ACCERR in the si_code field of a user-space
    signal handler. This can confuse debug libraries that temporarily
    change the protection of memory regions, and expect to use SEGV_ACCERR
    as an indication to restore access to a region.

    This commit restores the previous behavior. The following program
    exhibits the issue:

    $ ./repro read || echo "FAILED"
    $ ./repro write || echo "FAILED"
    $ ./repro exec || echo "FAILED"

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static void segv_handler(int n, siginfo_t *info, void *arg) {
    _exit(info->si_code == SEGV_ACCERR ? 0 : 1);
    }

    int main(int argc, char **argv)
    {
    void *p = NULL;
    struct sigaction act = {
    .sa_sigaction = segv_handler,
    .sa_flags = SA_SIGINFO,
    };

    assert(argc == 2);
    p = mmap(NULL, getpagesize(),
    (strcmp(argv[1], "write") == 0) ? PROT_READ : 0,
    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    assert(p != MAP_FAILED);

    assert(sigaction(SIGSEGV, &act, NULL) == 0);
    if (strcmp(argv[1], "read") == 0)
    printf("%c", *(unsigned char *)p);
    else if (strcmp(argv[1], "write") == 0)
    *(unsigned char *)p = 0;
    else if (strcmp(argv[1], "exec") == 0)
    ((void (*)(void))p)();
    return 1; /* failed to generate SEGV */
    }

    Fixes: c3350602e876 ("powerpc/mm: Make bad_area* helper functions")
    Signed-off-by: John Sperbeck
    Acked-by: Benjamin Herrenschmidt
    [mpe: Add commit references in change log]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    John Sperbeck
     

05 Dec, 2017

1 commit

  • commit a3961f824cdbe7eb431254dc7d8f6f6767f474aa upstream.

    Rebooting into a new kernel with kexec fails in trace_tlbie() which is
    called from native_hpte_clear(). This happens if the running kernel
    has CONFIG_LOCKDEP enabled. With lockdep enabled, the tracepoints
    always execute few RCU checks regardless of whether tracing is on or
    off. We are already in the last phase of kexec sequence in real mode
    with HILE_BE set. At this point the RCU check ends up in
    RCU_LOCKDEP_WARN and causes kexec to fail.

    Fix this by not calling trace_tlbie() from native_hpte_clear().

    mpe: It's not safe to call trace points at this point in the kexec
    path, even if we could avoid the RCU checks/warnings. The only
    solution is to not call them.

    Fixes: 0428491cba92 ("powerpc/mm: Trace tlbie(l) instructions")
    Signed-off-by: Mahesh Salgaonkar
    Reported-by: Aneesh Kumar K.V
    Suggested-by: Michael Ellerman
    Acked-by: Naveen N. Rao
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Mahesh Salgaonkar
     

30 Nov, 2017

6 commits

  • commit 35602f82d0c765f991420e319c8d3a596c921eb8 upstream.

    While mapping hints with a length that cross 128TB are disallowed,
    MAP_FIXED allocations that cross 128TB are allowed. These are failing
    on hash (on radix they succeed). Add an additional case for fixed
    mappings to expand the addr_limit when crossing 128TB.

    Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • commit effc1b25088502fbd30305c79773de2d1f7470a6 upstream.

    Hash unconditionally resets the addr_limit to default (128TB) when the
    mm context is initialised. If a process has > 128TB mappings when it
    forks, the child will not get the 512TB addr_limit, so accesses to
    valid > 128TB mappings will fail in the child.

    Fix this by only resetting the addr_limit to default if it was 0. Non
    zero indicates it was duplicated from the parent (0 means exec()).

    Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • commit 6a72dc038b615229a1b285829d6c8378d15c2347 upstream.

    When allocating VA space with a hint that crosses 128TB, the SLB
    addr_limit variable is not expanded if addr is not > 128TB, but the
    slice allocation looks at task_size, which is 512TB. This results in
    slice_check_fit() incorrectly succeeding because the slice_count
    truncates off bit 128 of the requested mask, so the comparison to the
    available mask succeeds.

    Fix this by using mm->context.addr_limit instead of mm->task_size for
    testing allocation limits. This causes such allocations to fail.

    Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
    Reported-by: Florian Weimer
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • commit 7ece370996b694ae263025e056ad785afc1be5ab upstream.

    Currently userspace is able to request mmap() search between 128T-512T
    by specifying a hint address that is greater than 128T. But that means
    a hint of 128T exactly will return an address below 128T, which is
    confusing and wrong.

    So fix the logic to check the hint is greater than *or equal* to 128T.

    Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
    Suggested-by: Aneesh Kumar K.V
    Suggested-by: Nicholas Piggin
    [mpe: Split out of Nick's bigger patch]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 85e3f1adcb9d49300b0a943bb93f9604be375bfb upstream.

    Radix VA space allocations test addresses against mm->task_size which
    is 512TB, even in cases where the intention is to limit allocation to
    below 128TB.

    This results in mmap with a hint address below 128TB but address +
    length above 128TB succeeding when it should fail (as hash does after
    the previous patch).

    Set the high address limit to be considered up front, and base
    subsequent allocation checks on that consistently.

    Fixes: f4ea6dcb08ea ("powerpc/mm: Enable mappings above 128TB")
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     
  • commit f79ad50ea3c73fb1ea5b09e95c864e5bb263adfb upstream.

    When using the radix MMU on Power9 DD1, to work around a hardware
    problem, radix__pte_update() is required to do a two stage update of
    the PTE. First we write a zero value into the PTE, then we flush the
    TLB, and then we write the new PTE value.

    In the normal case that works OK, but it does not work if we're
    updating the PTE that maps the code we're executing, because the
    mapping is removed by the TLB flush and we can no longer execute from
    it. Unfortunately the STRICT_RWX code needs to do exactly that.

    The exact symptoms when we hit this case vary, sometimes we print an
    oops and then get stuck after that, but I've also seen a machine just
    get stuck continually page faulting with no oops printed. The variance
    is presumably due to the exact layout of the text and the page size
    used for the mappings. In all cases we are unable to boot to a shell.

    There are possible solutions such as creating a second mapping of the
    TLB flush code, executing from that, and then jumping back to the
    original. However we don't want to add that level of complexity for a
    DD1 work around.

    So just detect that we're running on Power9 DD1 and refrain from
    changing the permissions, effectively disabling STRICT_RWX on Power9
    DD1.

    Fixes: 7614ff3272a1 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix")
    Reported-by: Andrew Jeffery
    [Changelog as suggested by Michael Ellerman ]
    Signed-off-by: Balbir Singh
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Balbir Singh
     

04 Nov, 2017

1 commit

  • Pull powerpc fixes from Michael Ellerman:
    "Some more powerpc fixes for 4.14.

    This is bigger than I like to send at rc7, but that's at least partly
    because I didn't send any fixes last week. If it wasn't for the IMC
    driver, which is new and getting heavy testing, the diffstat would
    look a bit better. I've also added ftrace on big endian to my test
    suite, so we shouldn't break that again in future.

    - A fix to the handling of misaligned paste instructions (P9 only),
    where a change to a #define has caused the check for the
    instruction to always fail.

    - The preempt handling was unbalanced in the radix THP flush (P9
    only). Though we don't generally use preempt we want to keep it
    working as much as possible.

    - Two fixes for IMC (P9 only), one when booting with restricted
    number of CPUs and one in the error handling when initialisation
    fails due to firmware etc.

    - A revert to fix function_graph on big endian machines, and then a
    rework of the reverted patch to fix kprobes blacklist handling on
    big endian machines.

    Thanks to: Anju T Sudhakar, Guilherme G. Piccoli, Madhavan Srinivasan,
    Naveen N. Rao, Nicholas Piggin, Paul Mackerras"

    * tag 'powerpc-4.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/perf: Fix core-imc hotplug callback failure during imc initialization
    powerpc/kprobes: Dereference function pointers only if the address does not belong to kernel text
    Revert "powerpc64/elfv1: Only dereference function descriptor for non-text symbols"
    powerpc/64s/radix: Fix preempt imbalance in TLB flush
    powerpc: Fix check for copy/paste instructions in alignment handler
    powerpc/perf: Fix IMC allocation routine

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

26 Oct, 2017

1 commit


10 Oct, 2017

1 commit

  • It turns out that not all paths calling arch_update_cpu_topology() hold
    cpu_hotplug_lock, but that's OK because those paths can't race with
    any concurrent hotplug events.

    Warnings were reported with the following trace:

    lockdep_assert_cpus_held
    arch_update_cpu_topology
    sched_init_domains
    sched_init_smp
    kernel_init_freeable
    kernel_init
    ret_from_kernel_thread

    Which is safe because it's called early in boot when hotplug is not
    live yet.

    And also this trace:

    lockdep_assert_cpus_held
    arch_update_cpu_topology
    partition_sched_domains
    cpuset_update_active_cpus
    sched_cpu_deactivate
    cpuhp_invoke_callback
    cpuhp_down_callbacks
    cpuhp_thread_fun
    smpboot_thread_fn
    kthread
    ret_from_kernel_thread

    Which is safe because it's called as part of CPU hotplug, so although
    we don't hold the CPU hotplug lock, there is another thread driving
    the CPU hotplug operation which does hold the lock, and there is no
    race.

    Thanks to tglx for deciphering it for us.

    Fixes: 3e401f7a2e51 ("powerpc: Only obtain cpu_hotplug_lock if called by rtasd")
    Signed-off-by: Thiago Jung Bauermann
    Signed-off-by: Michael Ellerman

    Thiago Jung Bauermann
     

04 Oct, 2017

1 commit

  • flush_tlb_kernel_range() may call smp_call_function_many() which expects
    interrupts to be enabled. This results in a traceback.

    WARNING: CPU: 0 PID: 1 at kernel/smp.c:416 smp_call_function_many+0xcc/0x2fc
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-rc1-00009-g0666f56 #1
    task: cf830000 task.stack: cf82e000
    NIP: c00a93c8 LR: c00a9634 CTR: 00000001
    REGS: cf82fde0 TRAP: 0700 Not tainted (4.14.0-rc1-00009-g0666f56)
    MSR: 00021000 CR: 24000082 XER: 00000000

    GPR00: c00a9634 cf82fe90 cf830000 c050ad3c c0015a54 00000000 00000001 00000001
    GPR08: 00000001 00000000 00000000 cf82e000 24000084 00000000 c0003150 00000000
    GPR16: 00000000 00000000 00000000 00000000 00000000 00000001 00000000 c0510000
    GPR24: 00000000 c0015a54 00000000 c050ad3c c051823c c050ad3c 00000025 00000000
    NIP [c00a93c8] smp_call_function_many+0xcc/0x2fc
    LR [c00a9634] smp_call_function+0x3c/0x50
    Call Trace:
    [cf82fe90] [00000010] 0x10 (unreliable)
    [cf82fed0] [c00a9634] smp_call_function+0x3c/0x50
    [cf82fee0] [c0015d2c] flush_tlb_kernel_range+0x20/0x38
    [cf82fef0] [c001524c] mark_initmem_nx+0x154/0x16c
    [cf82ff20] [c001484c] free_initmem+0x20/0x4c
    [cf82ff30] [c000316c] kernel_init+0x1c/0x108
    [cf82ff40] [c000f3a8] ret_from_kernel_thread+0x5c/0x64
    Instruction dump:
    7c0803a6 7d808120 38210040 4e800020 3d20c052 812981a0 2f890000 40beffac
    3d20c051 8929ac64 2f890000 40beff9c 4bffff94 7fc3f378 7f64db78

    Fixes: 3184cc4b6f6a ("powerpc/mm: Fix kernel RAM protection after freeing ...")
    Fixes: e611939fc8ec ("powerpc/mm: Ensure change_page_attr() doesn't ...")
    Cc: Christophe Leroy
    Signed-off-by: Guenter Roeck
    Reviewed-by: Christophe Leroy
    Signed-off-by: Michael Ellerman

    Guenter Roeck
     

01 Sep, 2017

2 commits


31 Aug, 2017

2 commits

  • When we map memory at boot we print out the ranges of real addresses
    that we mapped and the page size that was used.

    Currently it's a bit ugly:

    Mapped range 0x0 - 0x2000000000 with 0x40000000
    Mapped range 0x200000000000 - 0x202000000000 with 0x40000000

    Pad the addresses so they line up, and print the page size using
    actual units, eg:

    Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages
    Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages
    Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages

    Signed-off-by: Michael Ellerman

    Michael Ellerman
     
  • Make the printks look a bit nicer by adding a prefix.

    Signed-off-by: Michael Ellerman

    Michael Ellerman
     

23 Aug, 2017

3 commits