06 Apr, 2019

2 commits

  • [ Upstream commit 81b61324922c67f73813d8a9c175f3c153f6a1c6 ]

    On pseries systems, performing a partition migration can result in
    altering the nodes a CPU is assigned to on the destination system. For
    exampl, pre-migration on the source system CPUs are in node 1 and 3,
    post-migration on the destination system CPUs are in nodes 2 and 3.

    Handling the node change for a CPU can cause corruption in the slab
    cache if we hit a timing where a CPUs node is changed while cache_reap()
    is invoked. The corruption occurs because the slab cache code appears
    to rely on the CPU and slab cache pages being on the same node.

    The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
    does not prevent us from hitting this scenario.

    Changing the device tree property update notification handler that
    recognizes an affinity change for a CPU to do a full DLPAR remove and
    add of the CPU instead of dynamically changing its node resolves this
    issue.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Michael W. Bringmann
    Tested-by: Michael W. Bringmann
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Nathan Fontenot
     
  • [ Upstream commit 5330367fa300742a97e20e953b1f77f48392faae ]

    After we ALIGN up the address we need to make sure we didn't overflow
    and resulted in zero address. In that case, we need to make sure that
    the returned address is greater than mmap_min_addr.

    This fixes selftest va_128TBswitch --run-hugetlb reporting failures when
    run as non root user for

    mmap(-1, MAP_HUGETLB)

    The bug is that a non-root user requesting address -1 will be given address 0
    which will then fail, whereas they should have been given something else that
    would have succeeded.

    We also avoid the first mmap(-1, MAP_HUGETLB) returning NULL address as mmap address
    with this change. So we think this is not a security issue, because it only affects
    whether we choose an address below mmap_min_addr, not whether we
    actually allow that address to be mapped. ie. there are existing capability
    checks to prevent a user mapping below mmap_min_addr and those will still be
    honoured even without this fix.

    Fixes: 484837601d4d ("powerpc/mm: Add radix support for hugetlb")
    Reviewed-by: Laurent Dufour
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Aneesh Kumar K.V
     

03 Apr, 2019

1 commit

  • commit 10c5e83afd4a3f01712d97d3bb1ae34d5b74a185 upstream.

    In order to protect against speculation attacks on
    indirect branches, the branch predictor is flushed at
    kernel entry to protect for the following situations:
    - userspace process attacking another userspace process
    - userspace process attacking the kernel
    Basically when the privillege level change (i.e. the
    kernel is entered), the branch predictor state is flushed.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     

15 Feb, 2019

1 commit

  • commit 579b9239c1f38665b21e8d0e6ee83ecc96dbd6bb upstream.

    With support for split pmd lock, we use pmd page pmd_huge_pte pointer
    to store the deposited page table. In those config when we move page
    tables we need to make sure we move the deposited page table to the
    correct pmd page. Otherwise this can result in crash when we withdraw
    of deposited page table because we can find the pmd_huge_pte NULL.

    eg:

    __split_huge_pmd+0x1070/0x1940
    __split_huge_pmd+0xe34/0x1940 (unreliable)
    vma_adjust_trans_huge+0x110/0x1c0
    __vma_adjust+0x2b4/0x9b0
    __split_vma+0x1b8/0x280
    __do_munmap+0x13c/0x550
    sys_mremap+0x220/0x7e0
    system_call+0x5c/0x70

    Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
    Cc: stable@vger.kernel.org # v4.18+
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

13 Feb, 2019

1 commit

  • [ Upstream commit ffca395b11c4a5a6df6d6345f794b0e3d578e2d0 ]

    On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
    a no-exec fault generates DSISR_PROTFAULT error bits,
    not DSISR_NOEXEC_OR_G.

    This patch adds DSISR_PROTFAULT in the test mask.

    Fixes: d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute faults")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Christophe Leroy
     

13 Jan, 2019

2 commits

  • [ Upstream commit 9ef34630a4614ee1cd478f9859ebea55d55f10ec ]

    The "altmap" is used to provide a pool of memory that is reserved for
    the vmemmap backing of hot-plugged memory. This is useful when adding
    large amount of ZONE_DEVICE memory to a system with a limited amount of
    normal memory.

    On ppc64 we use huge pages to map the vmemmap which requires the backing
    storage to be contigious and aligned to the hugepage size. The altmap
    implementation allows for the altmap provider to reserve a few PFNs at
    the start of the range for it's own uses and when this occurs the
    first chunk of the altmap is not usable for hugepage mappings. On hash
    there is no sane way to fall back to a normal sized page mapping so we
    fail the allocation. This results in memory hotplug failing with
    ENOMEM when the new range doesn't fall into an existing vmemmap block.

    This patch handles this case by falling back to using system memory
    rather than failing if we cannot allocate from the altmap. This
    fallback should only ever be used for the first vmemmap block so it
    should not cause excess memory consumption.

    Fixes: 7b73d978a5d0 ("mm: pass the vmem_altmap to vmemmap_populate")
    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Oliver O'Halloran
     
  • [ Upstream commit 462951cd32e1496dc64b00051dfb777efc8ae5d8 ]

    For some configs the build fails with:

    arch/powerpc/mm/dump_linuxpagetables.c: In function 'populate_markers':
    arch/powerpc/mm/dump_linuxpagetables.c:306:39: error: 'PKMAP_BASE' undeclared (first use in this function)
    arch/powerpc/mm/dump_linuxpagetables.c:314:50: error: 'LAST_PKMAP' undeclared (first use in this function)

    These come from highmem.h, including that fixes the build.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Michael Ellerman
     

01 Dec, 2018

1 commit

  • [ Upstream commit 437ccdc8ce629470babdda1a7086e2f477048cbd ]

    When VPHN function is not supported and during cpu hotplug event,
    kernel prints message 'VPHN function not supported. Disabling
    polling...'. Currently it prints on every hotplug event, it floods
    dmesg when a KVM guest tries to hotplug huge number of vcpus, let's
    just print once and suppress further kernel prints.

    Signed-off-by: Satheesh Rajendran
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Satheesh Rajendran
     

21 Nov, 2018

5 commits

  • commit cc4ebf5c0a3440ed0a32d25c55ebdb6ce5f3c0bc upstream.

    This reverts commit 4f94b2c7462d9720b2afa7e8e8d4c19446bb31ce.

    That commit was buggy, as it used rlwinm instead of rlwimi.
    Instead of fixing that bug, we revert the previous commit in order to
    reduce the dependency between L1 entries and L2 entries

    Fixes: 4f94b2c7462d9 ("powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit 803d690e68f0c5230183f1a42c7d50a41d16e380 ]

    When a process allocates a hugepage, the following leak is
    reported by kmemleak. This is a false positive which is
    due to the pointer to the table being stored in the PGD
    as physical memory address and not virtual memory pointer.

    unreferenced object 0xc30f8200 (size 512):
    comm "mmap", pid 374, jiffies 4872494 (age 627.630s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] huge_pte_alloc+0xdc/0x1f8
    [] hugetlb_fault+0x560/0x8f8
    [] follow_hugetlb_page+0x14c/0x44c
    [] __get_user_pages+0x1c4/0x3dc
    [] __mm_populate+0xac/0x140
    [] vm_mmap_pgoff+0xb4/0xb8
    [] ksys_mmap_pgoff+0xcc/0x1fc
    [] ret_from_syscall+0x0/0x38

    See commit a984506c542e2 ("powerpc/mm: Don't report PUDs as
    memory leaks when using kmemleak") for detailed explanation.

    To fix that, this patch tells kmemleak to ignore the allocated
    hugepage table.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit f5e284803a7206d43e26f9ffcae5de9626d95e37 ]

    When enumerating page size definitions to check hardware support,
    we construct a constant which is (1U << (def->shift - 10)).

    However, the array of page size definitions is only initalised for
    various MMU_PAGE_* constants, so it contains a number of 0-initialised
    elements with def->shift == 0. This means we end up shifting by a
    very large number, which gives the following UBSan splat:

    ================================================================================
    UBSAN: Undefined behaviour in /home/dja/dev/linux/linux/arch/powerpc/mm/tlb_nohash.c:506:21
    shift exponent 4294967286 is too large for 32-bit type 'unsigned int'
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc3-00045-ga604f927b012-dirty #6
    Call Trace:
    [c00000000101bc20] [c000000000a13d54] .dump_stack+0xa8/0xec (unreliable)
    [c00000000101bcb0] [c0000000004f20a8] .ubsan_epilogue+0x18/0x64
    [c00000000101bd30] [c0000000004f2b10] .__ubsan_handle_shift_out_of_bounds+0x110/0x1a4
    [c00000000101be20] [c000000000d21760] .early_init_mmu+0x1b4/0x5a0
    [c00000000101bf10] [c000000000d1ba28] .early_setup+0x100/0x130
    [c00000000101bf90] [c000000000000528] start_here_multiplatform+0x68/0x80
    ================================================================================

    Fix this by first checking if the element exists (shift != 0) before
    constructing the constant.

    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • [ Upstream commit 37e9c674e7e6f445e12cb1151017bd4bacdd1e2d ]

    This patch fixes the following warnings (obtained with make W=1).

    arch/powerpc/mm/slice.c: In function 'slice_range_to_mask':
    arch/powerpc/mm/slice.c:73:12: error: comparison is always true due to limited range of data type [-Werror=type-limits]
    if (start < SLICE_LOW_TOP) {
    ^
    arch/powerpc/mm/slice.c:81:20: error: comparison is always false due to limited range of data type [-Werror=type-limits]
    if ((start + len) > SLICE_LOW_TOP) {
    ^
    arch/powerpc/mm/slice.c: In function 'slice_mask_for_free':
    arch/powerpc/mm/slice.c:136:17: error: comparison is always true due to limited range of data type [-Werror=type-limits]
    if (high_limit < SLICE_LOW_TOP) {
    ^
    arch/powerpc/mm/slice.c:195:39: error: comparison is always false due to limited range of data type [-Werror=type-limits]
    if (SLICE_NUM_HIGH && ((start + len) > SLICE_LOW_TOP)) {
    ^
    arch/powerpc/mm/slice.c: In function 'slice_scan_available':
    arch/powerpc/mm/slice.c:306:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
    if (addr < SLICE_LOW_TOP) {
    ^
    arch/powerpc/mm/slice.c: In function 'get_slice_psize':
    arch/powerpc/mm/slice.c:709:11: error: comparison is always true due to limited range of data type [-Werror=type-limits]
    if (addr < SLICE_LOW_TOP) {
    ^

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • [ Upstream commit 0d923962ab69c27cca664a2d535e90ef655110ca ]

    When we're running on Book3S with the Radix MMU enabled the page table
    dump currently prints the wrong addresses because it uses the wrong
    start address.

    Fix it to use PAGE_OFFSET rather than KERN_VIRT_START.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     

14 Nov, 2018

1 commit

  • commit bc276ecba132caccb1fda5863a652c15def2b8c6 upstream.

    PPC_INVALIDATE_ERAT is slbia IH=7 which is a new variant introduced
    with POWER9, and the result is undefined on earlier CPUs.

    Commits 7b9f71f974 ("powerpc/64s: POWER9 machine check handler") and
    d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on
    POWER9") caused POWER7/8 code to use this instruction. Remove it. An
    ERAT flush can be made by invalidatig the SLB, but before POWER9 that
    requires a flush and rebolt.

    Fixes: 7b9f71f974 ("powerpc/64s: POWER9 machine check handler")
    Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
    Cc: stable@vger.kernel.org # v4.11+
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Nicholas Piggin
     

07 Oct, 2018

1 commit

  • Michael writes:
    "powerpc fixes for 4.19 #4

    Four regression fixes.

    A fix for a change to lib/xz which broke our zImage loader when
    building with XZ compression. OK'ed by Herbert who merged the
    original patch.

    The recent fix we did to avoid patching __init text broke some 32-bit
    machines, fix that.

    Our show_user_instructions() could be tricked into printing kernel
    memory, add a check to avoid that.

    And a fix for a change to our NUMA initialisation logic, which causes
    crashes in some kdump configurations.

    Thanks to:
    Christophe Leroy, Hari Bathini, Jann Horn, Joel Stanley, Meelis
    Roos, Murilo Opsfelder Araujo, Srikar Dronamraju."

    * tag 'powerpc-4.19-4' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/numa: Skip onlining a offline node in kdump path
    powerpc: Don't print kernel instructions in show_user_instructions()
    powerpc/lib: fix book3s/32 boot failure due to code patching
    lib/xz: Put CRC32_POLY_LE in xz_private.h

    Greg Kroah-Hartman
     

05 Oct, 2018

1 commit

  • With commit 2ea626306810 ("powerpc/topology: Get topology for shared
    processors at boot"), kdump kernel on shared LPAR may crash.

    The necessary conditions are
    - Shared LPAR with at least 2 nodes having memory and CPUs.
    - Memory requirement for kdump kernel must be met by the first N-1
    nodes where there are at least N nodes with memory and CPUs.

    Example numactl of such a machine.
    $ numactl -H
    available: 5 nodes (0,2,5-7)
    node 0 cpus:
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 255 MB
    node 2 free: 189 MB
    node 5 cpus: 24 25 26 27 28 29 30 31
    node 5 size: 4095 MB
    node 5 free: 4024 MB
    node 6 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
    node 6 size: 6353 MB
    node 6 free: 5998 MB
    node 7 cpus: 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39
    node 7 size: 7640 MB
    node 7 free: 7164 MB
    node distances:
    node 0 2 5 6 7
    0: 10 40 40 40 40
    2: 40 10 40 40 40
    5: 40 40 10 40 40
    6: 40 40 40 10 20
    7: 40 40 40 20 10

    Steps to reproduce.
    1. Load / start kdump service.
    2. Trigger a kdump (for example : echo c > /proc/sysrq-trigger)

    When booting a kdump kernel with 2048M:

    kexec: Starting switchover sequence.
    I'm in purgatory
    Using 1TB segments
    hash-mmu: Initializing hash mmu with SLB
    Linux version 4.19.0-rc5-master+ (srikar@linux-xxu6) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Thu Sep 27 19:45:00 IST 2018
    Found initrd at 0xc000000009e70000:0xc00000000ae554b4
    Using pSeries machine description
    -----------------------------------------------------
    ppc64_pft_size = 0x1e
    phys_mem_size = 0x88000000
    dcache_bsize = 0x80
    icache_bsize = 0x80
    cpu_features = 0x000000ff8f5d91a7
    possible = 0x0000fbffcf5fb1a7
    always = 0x0000006f8b5c91a1
    cpu_user_features = 0xdc0065c2 0xef000000
    mmu_features = 0x7c006001
    firmware_features = 0x00000007c45bfc57
    htab_hash_mask = 0x7fffff
    physical_start = 0x8000000
    -----------------------------------------------------
    numa: NODE_DATA [mem 0x87d5e300-0x87d67fff]
    numa: NODE_DATA(0) on node 6
    numa: NODE_DATA [mem 0x87d54600-0x87d5e2ff]
    Top of RAM: 0x88000000, Total RAM: 0x88000000
    Memory hole size: 0MB
    Zone ranges:
    DMA [mem 0x0000000000000000-0x0000000087ffffff]
    DMA32 empty
    Normal empty
    Movable zone start for each node
    Early memory node ranges
    node 6: [mem 0x0000000000000000-0x0000000087ffffff]
    Could not find start_pfn for node 0
    Initmem setup node 0 [mem 0x0000000000000000-0x0000000000000000]
    On node 0 totalpages: 0
    Initmem setup node 6 [mem 0x0000000000000000-0x0000000087ffffff]
    On node 6 totalpages: 34816

    Unable to handle kernel paging request for data at address 0x00000060
    Faulting instruction address: 0xc000000008703a54
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in:
    CPU: 11 PID: 1 Comm: swapper/11 Not tainted 4.19.0-rc5-master+ #1
    NIP: c000000008703a54 LR: c000000008703a38 CTR: 0000000000000000
    REGS: c00000000b673440 TRAP: 0380 Not tainted (4.19.0-rc5-master+)
    MSR: 8000000002009033 CR: 24022022 XER: 20000002
    CFAR: c0000000086fc238 IRQMASK: 0
    GPR00: c000000008703a38 c00000000b6736c0 c000000009281900 0000000000000000
    GPR04: 0000000000000000 0000000000000000 fffffffffffff001 c00000000b660080
    GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000220
    GPR12: 0000000000002200 c000000009e51400 0000000000000000 0000000000000008
    GPR16: 0000000000000000 c000000008c152e8 c000000008c152a8 0000000000000000
    GPR20: c000000009422fd8 c000000009412fd8 c000000009426040 0000000000000008
    GPR24: 0000000000000000 0000000000000000 c000000009168bc8 c000000009168c78
    GPR28: c00000000b126410 0000000000000000 c00000000916a0b8 c00000000b126400
    NIP [c000000008703a54] bus_add_device+0x84/0x1e0
    LR [c000000008703a38] bus_add_device+0x68/0x1e0
    Call Trace:
    [c00000000b6736c0] [c000000008703a38] bus_add_device+0x68/0x1e0 (unreliable)
    [c00000000b673740] [c000000008700194] device_add+0x454/0x7c0
    [c00000000b673800] [c00000000872e660] __register_one_node+0xb0/0x240
    [c00000000b673860] [c00000000839a6bc] __try_online_node+0x12c/0x180
    [c00000000b673900] [c00000000839b978] try_online_node+0x58/0x90
    [c00000000b673930] [c0000000080846d8] find_and_online_cpu_nid+0x158/0x190
    [c00000000b673a10] [c0000000080848a0] numa_update_cpu_topology+0x190/0x580
    [c00000000b673c00] [c000000008d3f2e4] smp_cpus_done+0x94/0x108
    [c00000000b673c70] [c000000008d5c00c] smp_init+0x174/0x19c
    [c00000000b673d00] [c000000008d346b8] kernel_init_freeable+0x1e0/0x450
    [c00000000b673dc0] [c0000000080102e8] kernel_init+0x28/0x160
    [c00000000b673e30] [c00000000800b65c] ret_from_kernel_thread+0x5c/0x80
    Instruction dump:
    60000000 60000000 e89e0020 7fe3fb78 4bff87d5 60000000 7c7d1b79 4082008c
    e8bf0050 e93e0098 3b9f0010 2fa50000 38630018 419e0114 7f84e378
    ---[ end trace 593577668c2daa65 ]---

    However a regular kernel with 4096M (2048 gets reserved for crash
    kernel) boots properly.

    Unlike regular kernels, which mark all available nodes as online,
    kdump kernel only marks just enough nodes as online and marks the rest
    as offline at boot. However kdump kernel boots with all available
    CPUs. With Commit 2ea626306810 ("powerpc/topology: Get topology for
    shared processors at boot"), all CPUs are onlined on their respective
    nodes at boot time. try_online_node() tries to online the offline
    nodes but fails as all needed subsystems are not yet initialized.

    As part of fix, detect and skip early onlining of a offline node.

    Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
    Reported-by: Pavithra Prakash
    Signed-off-by: Srikar Dronamraju
    Tested-by: Hari Bathini
    Signed-off-by: Michael Ellerman

    Srikar Dronamraju
     

29 Sep, 2018

1 commit

  • Michael writes:
    "powerpc fixes for 4.19 #3

    A reasonably big batch of fixes due to me being away for a few weeks.

    A fix for the TM emulation support on Power9, which could result in
    corrupting the guest r11 when running under KVM.

    Two fixes to the TM code which could lead to userspace GPR corruption
    if we take an SLB miss at exactly the wrong time.

    Our dynamic patching code had a bug that meant we could patch freed
    __init text, which could lead to corrupting userspace memory.

    csum_ipv6_magic() didn't work on little endian platforms since we
    optimised it recently.

    A fix for an endian bug when reading a device tree property telling
    us how many storage keys the machine has available.

    Fix a crash seen on some configurations of PowerVM when migrating the
    partition from one machine to another.

    A fix for a regression in the setup of our CPU to NUMA node mapping
    in KVM guests.

    A fix to our selftest Makefiles to make them work since a recent
    change to the shared Makefile logic."

    * tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    selftests/powerpc: Fix Makefiles for headers_install change
    powerpc/numa: Use associativity if VPHN hcall is successful
    powerpc/tm: Avoid possible userspace r1 corruption on reclaim
    powerpc/tm: Fix userspace r13 corruption
    powerpc/pseries: Fix unitialized timer reset on migration
    powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
    powerpc: fix csum_ipv6_magic() on little endian platforms
    powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
    powerpc: Avoid code patching freed init sections
    KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds

    Greg Kroah-Hartman
     

25 Sep, 2018

1 commit

  • Currently associativity is used to lookup node-id even if the
    preceding VPHN hcall failed. However this can cause CPU to be made
    part of the wrong node, (most likely to be node 0). This is because
    VPHN is not enabled on KVM guests.

    With 2ea6263 ("powerpc/topology: Get topology for shared processors at
    boot"), associativity is used to set to the wrong node. Hence KVM
    guest topology is broken.

    For example : A 4 node KVM guest before would have reported.

    [root@localhost ~]# numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3
    node 0 size: 1746 MB
    node 0 free: 1604 MB
    node 1 cpus: 4 5 6 7
    node 1 size: 2044 MB
    node 1 free: 1765 MB
    node 2 cpus: 8 9 10 11
    node 2 size: 2044 MB
    node 2 free: 1837 MB
    node 3 cpus: 12 13 14 15
    node 3 size: 2044 MB
    node 3 free: 1903 MB
    node distances:
    node 0 1 2 3
    0: 10 40 40 40
    1: 40 10 40 40
    2: 40 40 10 40
    3: 40 40 40 10

    Would now report:

    [root@localhost ~]# numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 1746 MB
    node 0 free: 1244 MB
    node 1 cpus:
    node 1 size: 2044 MB
    node 1 free: 2032 MB
    node 2 cpus: 1
    node 2 size: 2044 MB
    node 2 free: 2028 MB
    node 3 cpus:
    node 3 size: 2044 MB
    node 3 free: 2032 MB
    node distances:
    node 0 1 2 3
    0: 10 40 40 40
    1: 40 10 40 40
    2: 40 40 10 40
    3: 40 40 40 10

    Fix this by skipping associativity lookup if the VPHN hcall failed.

    Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman

    Srikar Dronamraju
     

24 Sep, 2018

1 commit

  • After migration of a powerpc LPAR, the kernel executes code to
    update the system state to reflect new platform characteristics.

    Such changes include modifications to device tree properties provided
    to the system by PHYP. Property notifications received by the
    post_mobility_fixup() code are passed along to the kernel in general
    through a call to of_update_property() which in turn passes such
    events back to all modules through entries like the '.notifier_call'
    function within the NUMA module.

    When the NUMA module updates its state, it resets its event timer. If
    this occurs after a previous call to stop_topology_update() or on a
    system without VPHN enabled, the code runs into an unitialized timer
    structure and crashes. This patch adds a safety check along this path
    toward the problem code.

    An example crash log is as follows.

    ibmvscsi 30000081: Re-enabling adapter!
    ------------[ cut here ]------------
    kernel BUG at kernel/time/timer.c:958!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
    CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
    ...
    NIP mod_timer+0x4c/0x400
    LR reset_topology_timer+0x40/0x60
    Call Trace:
    0xc0000003f9407830 (unreliable)
    reset_topology_timer+0x40/0x60
    dt_update_callback+0x100/0x120
    notifier_call_chain+0x90/0x100
    __blocking_notifier_call_chain+0x60/0x90
    of_property_notify+0x90/0xd0
    of_update_property+0x104/0x150
    update_dt_property+0xdc/0x1f0
    pseries_devicetree_update+0x2d0/0x510
    post_mobility_fixup+0x7c/0xf0
    migration_store+0xa4/0xc0
    kobj_attr_store+0x30/0x60
    sysfs_kf_write+0x64/0xa0
    kernfs_fop_write+0x16c/0x240
    __vfs_write+0x40/0x200
    vfs_write+0xc8/0x240
    ksys_write+0x5c/0x100
    system_call+0x58/0x6c

    Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
    Cc: stable@vger.kernel.org # v3.10+
    Signed-off-by: Michael Bringmann
    Signed-off-by: Michael Ellerman

    Michael Bringmann
     

20 Sep, 2018

1 commit

  • scan_pkey_feature() uses of_property_read_u32_array() to read the
    ibm,processor-storage-keys property and calls be32_to_cpu() on the
    value it gets. The problem is that of_property_read_u32_array() already
    returns the value converted to the CPU byte order.

    The value of pkeys_total ends up more or less sane because there's a min()
    call in pkey_initialize() which reduces pkeys_total to 32. So in practice
    the kernel ignores the fact that the hypervisor reserved one key for
    itself (the device tree advertises 31 keys in my test VM).

    This is wrong, but the effect in practice is that when a process tries to
    allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
    would indicate that there aren't any keys available

    Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
    Cc: stable@vger.kernel.org # v4.16+
    Signed-off-by: Thiago Jung Bauermann
    Signed-off-by: Michael Ellerman

    Thiago Jung Bauermann
     

18 Sep, 2018

1 commit

  • This stops us from doing code patching in init sections after they've
    been freed.

    In this chain:
    kvm_guest_init() ->
    kvm_use_magic_page() ->
    fault_in_pages_readable() ->
    __get_user() ->
    __get_user_nocheck() ->
    barrier_nospec();

    We have a code patching location at barrier_nospec() and
    kvm_guest_init() is an init function. This whole chain gets inlined,
    so when we free the init section (hence kvm_guest_init()), this code
    goes away and hence should no longer be patched.

    We seen this as userspace memory corruption when using a memory
    checker while doing partition migration testing on powervm (this
    starts the code patching post migration via
    /sys/kernel/mobility/migration). In theory, it could also happen when
    using /sys/kernel/debug/powerpc/barrier_nospec.

    Cc: stable@vger.kernel.org # 4.13+
    Signed-off-by: Michael Neuling
    Reviewed-by: Nicholas Piggin
    Reviewed-by: Christophe Leroy
    Signed-off-by: Michael Ellerman

    Michael Neuling
     

12 Sep, 2018

1 commit

  • At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
    which in turn reads the old TCE and if it was a valid entry, marks
    the physical page dirty if it was mapped for writing. Since it is in
    real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
    to get the page struct. However SetPageDirty() itself reads the compound
    page head and returns a virtual address for the head page struct and
    setting dirty bit for that kills the system.

    This adds additional dirty bit tracking into the MM/IOMMU API for use
    in the real mode. Note that this does not change how VFIO and
    KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
    - use the lowest bit of the cached host phys address to carry
    the dirty bit;
    - mark pages dirty when they are unpinned which happens when
    the preregistered memory is released which always happens in virtual
    mode;
    - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
    - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
    in the new mm_iommu_ua_mark_dirty_rm() helper;
    - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
    caller anyway) to reduce the real mode KVM and IOMMU knowledge
    across different subsystems.

    This removes realmode_pfn_to_page() as it is not used anymore.

    While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
    the real mode only and modules cannot call it anyway.

    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Alexey Kardashevskiy
     

27 Aug, 2018

1 commit

  • Pull IDA updates from Matthew Wilcox:
    "A better IDA API:

    id = ida_alloc(ida, GFP_xxx);
    ida_free(ida, id);

    rather than the cumbersome ida_simple_get(), ida_simple_remove().

    The new IDA API is similar to ida_simple_get() but better named. The
    internal restructuring of the IDA code removes the bitmap
    preallocation nonsense.

    I hope the net -200 lines of code is convincing"

    * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
    ida: Change ida_get_new_above to return the id
    ida: Remove old API
    test_ida: check_ida_destroy and check_ida_alloc
    test_ida: Convert check_ida_conv to new API
    test_ida: Move ida_check_max
    test_ida: Move ida_check_leaf
    idr-test: Convert ida_check_nomem to new API
    ida: Start new test_ida module
    target/iscsi: Allocate session IDs from an IDA
    iscsi target: fix session creation failure handling
    drm/vmwgfx: Convert to new IDA API
    dmaengine: Convert to new IDA API
    ppc: Convert vas ID allocation to new IDA API
    media: Convert entity ID allocation to new IDA API
    ppc: Convert mmu context allocation to new IDA API
    Convert net_namespace to new IDA API
    cb710: Convert to new IDA API
    rsxx: Convert to new IDA API
    osd: Convert to new IDA API
    sd: Convert to new IDA API
    ...

    Linus Torvalds
     

25 Aug, 2018

1 commit

  • Pull powerpc fixes from Michael Ellerman:

    - An implementation for the newly added hv_ops->flush() for the OPAL
    hvc console driver backends, I forgot to apply this after merging the
    hvc driver changes before the merge window.

    - Enable all PCI bridges at boot on powernv, to avoid races when
    multiple children of a bridge try to enable it simultaneously. This
    is a workaround until the PCI core can be enhanced to fix the races.

    - A fix to query PowerVM for the correct system topology at boot before
    initialising sched domains, seen in some configurations to cause
    broken scheduling etc.

    - A fix for pte_access_permitted() on "nohash" platforms.

    - Two commits to fix SIGBUS when using remap_pfn_range() seen on Power9
    due to a workaround when using the nest MMU (GPUs, accelerators).

    - Another fix to the VFIO code used by KVM, the previous fix had some
    bugs which caused guests to not start in some configurations.

    - A handful of other minor fixes.

    Thanks to: Aneesh Kumar K.V, Benjamin Herrenschmidt, Christophe Leroy,
    Hari Bathini, Luke Dashjr, Mahesh Salgaonkar, Nicholas Piggin, Paul
    Mackerras, Srikar Dronamraju.

    * tag 'powerpc-4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/mce: Fix SLB rebolting during MCE recovery path.
    KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages
    powerpc/mm/radix: Only need the Nest MMU workaround for R -> RW transition
    powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.
    powerpc/nohash: fix pte_access_permitted()
    powerpc/topology: Get topology for shared processors at boot
    powerpc64/ftrace: Include ftrace.h needed for enable/disable calls
    powerpc/powernv/pci: Work around races in PCI bridge enabling
    powerpc/fadump: cleanup crash memory ranges support
    powerpc/powernv: provide a console flush operation for opal hvc driver
    powerpc/traps: Avoid rate limit messages from show unhandled signals
    powerpc/64s: Fix PACA_IRQ_HARD_DIS accounting in idle_power4()

    Linus Torvalds
     

23 Aug, 2018

3 commits

  • The commit e7e81847478 ("powerpc/64s: move machine check SLB flushing
    to mm/slb.c") introduced a bug in reloading bolted SLB entries. Unused
    bolted entries are stored with .esid=0 in the slb_shadow area, and
    that value is now used directly as the RB input to slbmte, which means
    the RB[52:63] index field is set to 0, which causes SLB entry 0 to be
    cleared.

    Fix this by storing the index bits in the unused bolted entries, which
    directs the slbmte to the right place.

    The SLB shadow area is also used by the hypervisor, but PAPR is okay
    with that, from LoPAPR v1.1, 14.11.1.3 SLB Shadow Buffer:

    Note: SLB is filled sequentially starting at index 0
    from the shadow buffer ignoring the contents of
    RB field bits 52-63

    Fixes: e7e81847478b ("powerpc/64s: move machine check SLB flushing to mm/slb.c")
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman

    Mahesh Salgaonkar
     
  • Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in
    the pinned physical page", 2018-07-17) added some checks to ensure
    that guest DMA mappings don't attempt to map more than the guest is
    entitled to access. However, errors in the logic mean that legitimate
    guest requests to map pages for DMA are being denied in some
    situations. Specifically, if the first page of the range passed to
    mm_iommu_get() is mapped with a normal page, and subsequent pages are
    mapped with transparent huge pages, we end up with mem->pageshift ==
    0. That means that the page size checks in mm_iommu_ua_to_hpa() and
    mm_iommu_up_to_hpa_rm() will always fail for every page in that
    region, and thus the guest can never map any memory in that region for
    DMA, typically leading to a flood of error messages like this:

    qemu-system-ppc64: VFIO_MAP_DMA: -22
    qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800000000000000, 0x10000, 0x7fff63ff0000) = -22 (Invalid argument)

    The logic errors in mm_iommu_get() are:

    (a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte()
    call (meaning that find_linux_pte() returns the pte for the
    first address in the range, not the address we are currently up
    to);
    (b) use of 'pageshift' as the variable to receive the hugepage shift
    returned by find_linux_pte() - for a normal page this gets set
    to 0, leading to us setting mem->pageshift to 0 when we conclude
    that the pte returned by find_linux_pte() didn't match the page
    we were looking at;
    (c) comparing 'compshift', which is a page order, i.e. log base 2 of
    the number of pages, with 'pageshift', which is a log base 2 of
    the number of bytes.

    To fix these problems, this patch introduces 'cur_ua' to hold the
    current user address and uses that in the find_linux_pte() call;
    introduces 'pteshift' to hold the hugepage shift found by
    find_linux_pte(); and compares 'pteshift' with 'compshift +
    PAGE_SHIFT' rather than 'compshift'.

    The patch also moves the local_irq_restore to the point after the PTE
    pointer returned by find_linux_pte() has been dereferenced because
    otherwise the PTE could change underneath us, and adds a check to
    avoid doing the find_linux_pte() call once mem->pageshift has been
    reduced to PAGE_SHIFT, as an optimization.

    Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the pinned physical page")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman

    Paul Mackerras
     
  • The Nest MMU workaround is only needed for RW upgrades. Avoid doing
    that for other PTE updates.

    We also avoid clearing the PTE while marking it invalid. This is
    because other page table walkers will find this PTE none and can
    result in unexpected behaviour due to that. Instead we clear
    _PAGE_PRESENT and set the software PTE bit _PAGE_INVALID.
    pte_present() is already updated to check for both bits. This makes
    sure page table walkers will find the PTE present and things like
    pte_pfn(pte) returns the right value.

    Based on an original patch from Benjamin Herrenschmidt

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     

22 Aug, 2018

1 commit


21 Aug, 2018

1 commit

  • On a shared LPAR, Phyp will not update the CPU associativity at boot
    time. Just after the boot system does recognize itself as a shared
    LPAR and trigger a request for correct CPU associativity. But by then
    the scheduler would have already created/destroyed its sched domains.

    This causes
    - Broken load balance across Nodes causing islands of cores.
    - Performance degradation esp if the system is lightly loaded
    - dmesg to wrongly report all CPUs to be in Node 0.
    - Messages in dmesg saying borken topology.
    - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
    node sched domain"), can cause rcu stalls at boot up.

    The sched_domains_numa_masks table which is used to generate cpumasks
    is only created at boot time just before creating sched domains and
    never updated. Hence, its better to get the topology correct before
    the sched domains are created.

    For example on 64 core Power 8 shared LPAR, dmesg reports

    Brought up 512 CPUs
    Node 0 CPUs: 0-511
    Node 1 CPUs:
    Node 2 CPUs:
    Node 3 CPUs:
    Node 4 CPUs:
    Node 5 CPUs:
    Node 6 CPUs:
    Node 7 CPUs:
    Node 8 CPUs:
    Node 9 CPUs:
    Node 10 CPUs:
    Node 11 CPUs:
    ...
    BUG: arch topology borken
    the DIE domain not a subset of the NUMA domain
    BUG: arch topology borken
    the DIE domain not a subset of the NUMA domain

    numactl/lscpu output will still be correct with cores spreading across
    all nodes:

    Socket(s): 64
    NUMA node(s): 12
    Model: 2.0 (pvr 004d 0200)
    Model name: POWER8 (architected), altivec supported
    Hypervisor vendor: pHyp
    Virtualization type: para
    L1d cache: 64K
    L1i cache: 32K
    NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
    NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
    NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
    NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
    NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
    NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
    NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
    NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
    NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
    NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
    NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
    NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

    Currently on this LPAR, the scheduler detects 2 levels of Numa and
    created numa sched domains for all CPUs, but it finds a single DIE
    domain consisting of all CPUs. Hence it deletes all numa sched
    domains.

    To address this, detect the shared processor and update topology soon
    after CPUs are setup so that correct topology is updated just before
    scheduler creates sched domain.

    With the fix, dmesg reports:

    numa: Node 0 CPUs: 0-7 32-39 64-71 96-103 176-183 272-279 368-375 464-471
    numa: Node 1 CPUs: 8-15 40-47 72-79 104-111 184-191 280-287 376-383 472-479
    numa: Node 2 CPUs: 16-23 48-55 80-87 112-119 192-199 288-295 384-391 480-487
    numa: Node 3 CPUs: 24-31 56-63 88-95 120-127 200-207 296-303 392-399 488-495
    numa: Node 4 CPUs: 208-215 304-311 400-407 496-503
    numa: Node 5 CPUs: 168-175 264-271 360-367 456-463
    numa: Node 6 CPUs: 128-135 224-231 320-327 416-423
    numa: Node 7 CPUs: 136-143 232-239 328-335 424-431
    numa: Node 8 CPUs: 216-223 312-319 408-415 504-511
    numa: Node 9 CPUs: 144-151 240-247 336-343 432-439
    numa: Node 10 CPUs: 152-159 248-255 344-351 440-447
    numa: Node 11 CPUs: 160-167 256-263 352-359 448-455

    and lscpu also reports:

    Socket(s): 64
    NUMA node(s): 12
    Model: 2.0 (pvr 004d 0200)
    Model name: POWER8 (architected), altivec supported
    Hypervisor vendor: pHyp
    Virtualization type: para
    L1d cache: 64K
    L1i cache: 32K
    NUMA node0 CPU(s): 0-7,32-39,64-71,96-103,176-183,272-279,368-375,464-471
    NUMA node1 CPU(s): 8-15,40-47,72-79,104-111,184-191,280-287,376-383,472-479
    NUMA node2 CPU(s): 16-23,48-55,80-87,112-119,192-199,288-295,384-391,480-487
    NUMA node3 CPU(s): 24-31,56-63,88-95,120-127,200-207,296-303,392-399,488-495
    NUMA node4 CPU(s): 208-215,304-311,400-407,496-503
    NUMA node5 CPU(s): 168-175,264-271,360-367,456-463
    NUMA node6 CPU(s): 128-135,224-231,320-327,416-423
    NUMA node7 CPU(s): 136-143,232-239,328-335,424-431
    NUMA node8 CPU(s): 216-223,312-319,408-415,504-511
    NUMA node9 CPU(s): 144-151,240-247,336-343,432-439
    NUMA node10 CPU(s): 152-159,248-255,344-351,440-447
    NUMA node11 CPU(s): 160-167,256-263,352-359,448-455

    Reported-by: Manjunatha H R
    Signed-off-by: Srikar Dronamraju
    [mpe: Trim / format change log]
    Tested-by: Michael Ellerman
    Signed-off-by: Michael Ellerman

    Srikar Dronamraju
     

18 Aug, 2018

2 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - a few Y2038 fixes

    - ntfs fixes

    - arch/sh tweaks

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (111 commits)
    mm/hmm.c: remove unused variables align_start and align_end
    fs/userfaultfd.c: remove redundant pointer uwq
    mm, vmacache: hash addresses based on pmd
    mm/list_lru: introduce list_lru_shrink_walk_irq()
    mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
    mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
    mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
    mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
    mm/sparse: delete old sparse_init and enable new one
    mm/sparse: add new sparse_init_nid() and sparse_init()
    mm/sparse: move buffer init/fini to the common place
    mm/sparse: use the new sparse buffer functions in non-vmemmap
    mm/sparse: abstract sparse buffer allocations
    mm/hugetlb.c: don't zero 1GiB bootmem pages
    mm, page_alloc: double zone's batchsize
    mm/oom_kill.c: document oom_lock
    mm/hugetlb: remove gigantic page support for HIGHMEM
    mm, oom: remove sleep from under oom_lock
    kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
    mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
    ...

    Linus Torvalds
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    In this patch all the caller of handle_mm_fault() are changed to return
    vm_fault_t type.

    Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Cc: Matthew Wilcox
    Cc: Richard Henderson
    Cc: Tony Luck
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Jonas Bonn
    Cc: James E.J. Bottomley
    Cc: Benjamin Herrenschmidt
    Cc: Palmer Dabbelt
    Cc: Yoshinori Sato
    Cc: David S. Miller
    Cc: Richard Weinberger
    Cc: Guan Xuetao
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Levin, Alexander (Sasha Levin)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

13 Aug, 2018

2 commits

  • Add statistics that show how memory is mapped within the kernel linear mapping.
    This is similar to commit 37cd944c8d8f ("s390/pgtable: add mapping statistics")

    We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize)
    to map the kernel linear mapping and we print the linear psize during boot as
    below.

    "Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24"

    A sample output looks like:

    DirectMap4k: 0 kB
    DirectMap64k: 18432 kB
    DirectMap2M: 1030144 kB
    DirectMap1G: 11534336 kB

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     
  • Merge our fixes branch from the 4.18 cycle to resolve some minor
    conflicts.

    Michael Ellerman
     

10 Aug, 2018

2 commits


07 Aug, 2018

3 commits

  • In Makefiles if we're testing a CONFIG_FOO symbol for equality with 'y'
    we can instead just use ifdef. The latter reads easily, so convert to
    it where possible.

    Signed-off-by: Rodrigo R. Galvao
    Reviewed-by: Mauro S. M. Rodrigues
    Signed-off-by: Michael Ellerman

    Rodrigo R. Galvao
     
  • The page table fragment allocator uses the main page refcount racily
    with respect to speculative references. A customer observed a BUG due
    to page table page refcount underflow in the fragment allocator. This
    can be caused by the fragment allocator set_page_count stomping on a
    speculative reference, and then the speculative failure handler
    decrements the new reference, and the underflow eventually pops when
    the page tables are freed.

    Fix this by using a dedicated field in the struct page for the page
    table fragment allocator.

    Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
    Cc: stable@vger.kernel.org # v3.10+
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Michael Ellerman

    Nicholas Piggin
     
  • The kernel page table caches are tied to init_mm, so there is no
    more need for them after userspace is finished.

    destroy_context() gets called when we drop the last reference for an
    mm, which can be much later than the task exit due to other lazy mm
    references to it. We can free the page table cache pages on task exit
    because they only cache the userspace page tables and kernel threads
    should not access user space addresses.

    The mapping for kernel threads itself is maintained in init_mm and
    page table cache for that is attached to init_mm.

    Signed-off-by: Nicholas Piggin
    Reviewed-by: Aneesh Kumar K.V
    [mpe: Merge change log additions from Aneesh]
    Signed-off-by: Michael Ellerman

    Nicholas Piggin
     

30 Jul, 2018

2 commits