17 Apr, 2019

1 commit

  • commit 897bc3df8c5aebb54c32d831f917592e873d0559 upstream.

    Commit e1c3743e1a20 ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
    moved a code block around and this block uses a 'msr' variable outside of
    the CONFIG_PPC_TRANSACTIONAL_MEM, however the 'msr' variable is declared
    inside a CONFIG_PPC_TRANSACTIONAL_MEM block, causing a possible error when
    CONFIG_PPC_TRANSACTION_MEM is not defined.

    error: 'msr' undeclared (first use in this function)

    This is not causing a compilation error in the mainline kernel, because
    'msr' is being used as an argument of MSR_TM_ACTIVE(), which is defined as
    the following when CONFIG_PPC_TRANSACTIONAL_MEM is *not* set:

    #define MSR_TM_ACTIVE(x) 0

    This patch just fixes this issue avoiding the 'msr' variable usage outside
    the CONFIG_PPC_TRANSACTIONAL_MEM block, avoiding trusting in the
    MSR_TM_ACTIVE() definition.

    Cc: stable@vger.kernel.org
    Reported-by: Christoph Biedl
    Fixes: e1c3743e1a20 ("powerpc/tm: Set MSR[TS] just prior to recheckpoint")
    Signed-off-by: Breno Leitao
    Signed-off-by: Michael Ellerman
    Signed-off-by: Michael Neuling
    Signed-off-by: Sasha Levin

    Breno Leitao
     

06 Apr, 2019

5 commits

  • [ Upstream commit 81b61324922c67f73813d8a9c175f3c153f6a1c6 ]

    On pseries systems, performing a partition migration can result in
    altering the nodes a CPU is assigned to on the destination system. For
    exampl, pre-migration on the source system CPUs are in node 1 and 3,
    post-migration on the destination system CPUs are in nodes 2 and 3.

    Handling the node change for a CPU can cause corruption in the slab
    cache if we hit a timing where a CPUs node is changed while cache_reap()
    is invoked. The corruption occurs because the slab cache code appears
    to rely on the CPU and slab cache pages being on the same node.

    The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
    does not prevent us from hitting this scenario.

    Changing the device tree property update notification handler that
    recognizes an affinity change for a CPU to do a full DLPAR remove and
    add of the CPU instead of dynamically changing its node resolves this
    issue.

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Michael W. Bringmann
    Tested-by: Michael W. Bringmann
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Nathan Fontenot
     
  • [ Upstream commit eddd0b332304d554ad6243942f87c2fcea98c56b ]

    The ppc64 specific implementation of the reliable stacktracer,
    save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
    trace" whenever it finds an exception frame on the stack. Stack frames
    are classified as exception frames if the STACK_FRAME_REGS_MARKER
    magic, as written by exception prologues, is found at a particular
    location.

    However, as observed by Joe Lawrence, it is possible in practice that
    non-exception stack frames can alias with prior exception frames and
    thus, that the reliable stacktracer can find a stale
    STACK_FRAME_REGS_MARKER on the stack. It in turn falsely reports an
    unreliable stacktrace and blocks any live patching transition to
    finish. Said condition lasts until the stack frame is
    overwritten/initialized by function call or other means.

    In principle, we could mitigate this by making the exception frame
    classification condition in save_stack_trace_tsk_reliable() stronger:
    in addition to testing for STACK_FRAME_REGS_MARKER, we could also take
    into account that for all exceptions executing on the kernel stack
    - their stack frames's backlink pointers always match what is saved
    in their pt_regs instance's ->gpr[1] slot and that
    - their exception frame size equals STACK_INT_FRAME_SIZE, a value
    uncommonly large for non-exception frames.

    However, while these are currently true, relying on them would make
    the reliable stacktrace implementation more sensitive towards future
    changes in the exception entry code. Note that false negatives, i.e.
    not detecting exception frames, would silently break the live patching
    consistency model.

    Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
    rely on STACK_FRAME_REGS_MARKER as well.

    Make the exception exit code clear the on-stack
    STACK_FRAME_REGS_MARKER for those exceptions running on the "normal"
    kernel stack and returning to kernelspace: because the topmost frame
    is ignored by the reliable stack tracer anyway, returns to userspace
    don't need to take care of clearing the marker.

    Furthermore, as I don't have the ability to test this on Book 3E or 32
    bits, limit the change to Book 3S and 64 bits.

    Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for the consistency model")
    Reported-by: Joe Lawrence
    Signed-off-by: Nicolai Stange
    Signed-off-by: Joe Lawrence
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Nicolai Stange
     
  • [ Upstream commit 5330367fa300742a97e20e953b1f77f48392faae ]

    After we ALIGN up the address we need to make sure we didn't overflow
    and resulted in zero address. In that case, we need to make sure that
    the returned address is greater than mmap_min_addr.

    This fixes selftest va_128TBswitch --run-hugetlb reporting failures when
    run as non root user for

    mmap(-1, MAP_HUGETLB)

    The bug is that a non-root user requesting address -1 will be given address 0
    which will then fail, whereas they should have been given something else that
    would have succeeded.

    We also avoid the first mmap(-1, MAP_HUGETLB) returning NULL address as mmap address
    with this change. So we think this is not a security issue, because it only affects
    whether we choose an address below mmap_min_addr, not whether we
    actually allow that address to be mapped. ie. there are existing capability
    checks to prevent a user mapping below mmap_min_addr and those will still be
    honoured even without this fix.

    Fixes: 484837601d4d ("powerpc/mm: Add radix support for hugetlb")
    Reviewed-by: Laurent Dufour
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Aneesh Kumar K.V
     
  • [ Upstream commit e7140639b1de65bba435a6bd772d134901141f86 ]

    When building with -Wsometimes-uninitialized, Clang warns:

    arch/powerpc/xmon/ppc-dis.c:157:7: warning: variable 'opcode' is used
    uninitialized whenever 'if' condition is false
    [-Wsometimes-uninitialized]
    if (cpu_has_feature(CPU_FTRS_POWER9))
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    arch/powerpc/xmon/ppc-dis.c:167:7: note: uninitialized use occurs here
    if (opcode == NULL)
    ^~~~~~
    arch/powerpc/xmon/ppc-dis.c:157:3: note: remove the 'if' if its
    condition is always true
    if (cpu_has_feature(CPU_FTRS_POWER9))
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    arch/powerpc/xmon/ppc-dis.c:132:38: note: initialize the variable
    'opcode' to silence this warning
    const struct powerpc_opcode *opcode;
    ^
    = NULL
    1 warning generated.

    This warning seems to make no sense on the surface because opcode is set
    to NULL right below this statement. However, there is a comma instead of
    semicolon to end the dialect assignment, meaning that the opcode
    assignment only happens in the if statement. Properly terminate that
    line so that Clang no longer warns.

    Fixes: 5b102782c7f4 ("powerpc/xmon: Enable disassembly files (compilation changes)")
    Signed-off-by: Nathan Chancellor
    Reviewed-by: Nick Desaulniers
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Nathan Chancellor
     
  • [ Upstream commit 11f5acce2fa43b015a8120fa7620fa4efd0a2952 ]

    We store 2 multilevel tables in iommu_table - one for the hardware and
    one with the corresponding userspace addresses. Before allocating
    the tables, the iommu_table_group_ops::get_table_size() hook returns
    the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
    the locked_vm counter correctly. When the table is actually allocated,
    the amount of allocated memory is stored in iommu_table::it_allocated_size
    and used to decrement the locked_vm counter when we release the memory
    used by the table; .get_table_size() and .create_table() calculate it
    independently but the result is expected to be the same.

    However the allocator does not add the userspace table size to
    .it_allocated_size so when we destroy the table because of VFIO PCI
    unplug (i.e. VFIO container is gone but the userspace keeps running),
    we decrement locked_vm by just a half of size of memory we are
    releasing.

    To make things worse, since we enabled on-demand allocation of
    indirect levels, it_allocated_size contains only the amount of memory
    actually allocated at the table creation time which can just be a
    fraction. It is not a problem with incrementing locked_vm (as
    get_table_size() value is used) but it is with decrementing.

    As the result, we leak locked_vm and may not be able to allocate more
    IOMMU tables after few iterations of hotplug/unplug.

    This sets it_allocated_size in the pnv_pci_ioda2_ops::create_table()
    hook to what pnv_pci_ioda2_get_table_size() returns so from now on we
    have a single place which calculates the maximum memory a table can
    occupy. The original meaning of it_allocated_size is somewhat lost now
    though.

    We do not ditch it_allocated_size whatsoever here and we do not call
    get_table_size() from vfio_iommu_spapr_tce.c when decrementing
    locked_vm as we may have multiple IOMMU groups per container and even
    though they all are supposed to have the same get_table_size()
    implementation, there is a small chance for failure or confusion.

    Fixes: 090bad39b237 ("powerpc/powernv: Add indirect levels to it_userspace")
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Alexey Kardashevskiy
     

03 Apr, 2019

15 commits

  • commit d9470757398a700d9450a43508000bcfd010c7a4 upstream.

    Chandan reported that fstests' generic/026 test hit a crash:

    BUG: Unable to handle kernel data access at 0xc00000062ac40000
    Faulting instruction address: 0xc000000000092240
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
    CPU: 0 PID: 27828 Comm: chacl Not tainted 5.0.0-rc2-next-20190115-00001-g6de6dba64dda #1
    NIP: c000000000092240 LR: c00000000066a55c CTR: 0000000000000000
    REGS: c00000062c0c3430 TRAP: 0300 Not tainted (5.0.0-rc2-next-20190115-00001-g6de6dba64dda)
    MSR: 8000000002009033 CR: 44000842 XER: 20000000
    CFAR: 00007fff7f3108ac DAR: c00000062ac40000 DSISR: 40000000 IRQMASK: 0
    GPR00: 0000000000000000 c00000062c0c36c0 c0000000017f4c00 c00000000121a660
    GPR04: c00000062ac3fff9 0000000000000004 0000000000000020 00000000275b19c4
    GPR08: 000000000000000c 46494c4500000000 5347495f41434c5f c0000000026073a0
    GPR12: 0000000000000000 c0000000027a0000 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: c00000062ea70020 c00000062c0c38d0 0000000000000002 0000000000000002
    GPR24: c00000062ac3ffe8 00000000275b19c4 0000000000000001 c00000062ac30000
    GPR28: c00000062c0c38d0 c00000062ac30050 c00000062ac30058 0000000000000000
    NIP memcmp+0x120/0x690
    LR xfs_attr3_leaf_lookup_int+0x53c/0x5b0
    Call Trace:
    xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
    xfs_da3_node_lookup_int+0x32c/0x5a0
    xfs_attr_node_addname+0x170/0x6b0
    xfs_attr_set+0x2ac/0x340
    __xfs_set_acl+0xf0/0x230
    xfs_set_acl+0xd0/0x160
    set_posix_acl+0xc0/0x130
    posix_acl_xattr_set+0x68/0x110
    __vfs_setxattr+0xa4/0x110
    __vfs_setxattr_noperm+0xac/0x240
    vfs_setxattr+0x128/0x130
    setxattr+0x248/0x600
    path_setxattr+0x108/0x120
    sys_setxattr+0x28/0x40
    system_call+0x5c/0x70
    Instruction dump:
    7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 2c050000
    4182ff6c 20c50008 54c61838 7d201c28 7d293436 7d4a3436 7c295040

    The instruction dump decodes as:
    subfic r6,r5,8
    rlwinm r6,r6,3,0,28
    ldbrx r9,0,r3
    ldbrx r10,0,r4
    Signed-off-by: Michael Ellerman
    Reviewed-by: Segher Boessenkool
    Tested-by: Chandan Rajendra
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit ce9afe08e71e3f7d64f337a6e932e50849230fc2 upstream.

    In cpu_to_drc_index() in the case when FW_FEATURE_DRC_INFO is absent,
    we currently use of_read_property() to obtain the pointer to the array
    corresponding to the property "ibm,drc-indexes". The elements of this
    array are of type __be32, but are accessed without any conversion to
    the OS-endianness, which is buggy on a Little Endian OS.

    Fix this by using of_property_read_u32_index() accessor function to
    safely read the elements of the array.

    Fixes: e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes")
    Cc: stable@vger.kernel.org # v4.16+
    Reported-by: Pavithra R. Prakash
    Signed-off-by: Gautham R. Shenoy
    Reviewed-by: Vaidyanathan Srinivasan
    [mpe: Make the WARN_ON a WARN_ON_ONCE so it's not retriggerable]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Gautham R. Shenoy
     
  • commit 86be36f6502c52ddb4b85938145324fd07332da1 upstream.

    Yauheni Kaliuta pointed out that PTR_TO_STACK store/load verifier test
    was failing on powerpc64 BE, and rightfully indicated that the PPC_LD()
    macro is not masking away the last two bits of the offset per the ISA,
    resulting in the generation of 'lwa' instruction instead of the intended
    'ld' instruction.

    Segher also pointed out that we can't simply mask away the last two bits
    as that will result in loading/storing from/to a memory location that
    was not intended.

    This patch addresses this by using ldx/stdx if the offset is not
    word-aligned. We load the offset into a temporary register (TMP_REG_2)
    and use that as the index register in a subsequent ldx/stdx. We fix
    PPC_LD() macro to mask off the last two bits, but enhance PPC_BPF_LL()
    and PPC_BPF_STL() to factor in the offset value and generate the proper
    instruction sequence. We also convert all existing users of PPC_LD() and
    PPC_STD() to use these macros. All existing uses of these macros have
    been audited to ensure that TMP_REG_2 can be clobbered.

    Fixes: 156d0e290e96 ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
    Cc: stable@vger.kernel.org # v4.9+

    Reported-by: Yauheni Kaliuta
    Signed-off-by: Naveen N. Rao
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Naveen N. Rao
     
  • commit 92edf8df0ff2ae86cc632eeca0e651fd8431d40d upstream.

    When I updated the spectre_v2 reporting to handle software count cache
    flush I got the logic wrong when there's no software count cache
    enabled at all.

    The result is that on systems with the software count cache flush
    disabled we print:

    Mitigation: Indirect branch cache disabled, Software count cache flush

    Which correctly indicates that the count cache is disabled, but
    incorrectly says the software count cache flush is enabled.

    The root of the problem is that we are trying to handle all
    combinations of options. But we know now that we only expect to see
    the software count cache flush enabled if the other options are false.

    So split the two cases, which simplifies the logic and fixes the bug.
    We were also missing a space before "(hardware accelerated)".

    The result is we see one of:

    Mitigation: Indirect branch serialisation (kernel only)
    Mitigation: Indirect branch cache disabled
    Mitigation: Software count cache flush
    Mitigation: Software count cache flush (hardware accelerated)

    Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache flush")
    Cc: stable@vger.kernel.org # v4.19+
    Signed-off-by: Michael Ellerman
    Reviewed-by: Michael Neuling
    Reviewed-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 27da80719ef132cf8c80eb406d5aeb37dddf78cc upstream.

    The commit identified below adds MC_BTB_FLUSH macro only when
    CONFIG_PPC_FSL_BOOK3E is defined. This results in the following error
    on some configs (seen several times with kisskb randconfig_defconfig)

    arch/powerpc/kernel/exceptions-64e.S:576: Error: Unrecognized opcode: `mc_btb_flush'
    make[3]: *** [scripts/Makefile.build:367: arch/powerpc/kernel/exceptions-64e.o] Error 1
    make[2]: *** [scripts/Makefile.build:492: arch/powerpc/kernel] Error 2
    make[1]: *** [Makefile:1043: arch/powerpc] Error 2
    make: *** [Makefile:152: sub-make] Error 2

    This patch adds a blank definition of MC_BTB_FLUSH for other cases.

    Fixes: 10c5e83afd4a ("powerpc/fsl: Flush the branch predictor at each kernel entry (64bit)")
    Cc: Diana Craciun
    Signed-off-by: Christophe Leroy
    Reviewed-by: Daniel Axtens
    Reviewed-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 039daac5526932ec731e4499613018d263af8b3e upstream.

    Fixed the following build warning:
    powerpc-linux-gnu-ld: warning: orphan section `__btb_flush_fixup' from
    `arch/powerpc/kernel/head_44x.o' being placed in section
    `__btb_flush_fixup'.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit dfa88658fb0583abb92e062c7a9cd5a5b94f2a46 upstream.

    Report branch predictor state flush as a mitigation for
    Spectre variant 2.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 3bc8ea8603ae4c1e09aca8de229ad38b8091fcb3 upstream.

    If the user choses not to use the mitigations, replace
    the code sequence with nops.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit e7aa61f47b23afbec41031bc47ca8d6cb6516abc upstream.

    Switching from the guest to host is another place
    where the speculative accesses can be exploited.
    Flush the branch predictor when entering KVM.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 7fef436295bf6c05effe682c8797dfcb0deb112a upstream.

    In order to protect against speculation attacks on
    indirect branches, the branch predictor is flushed at
    kernel entry to protect for the following situations:
    - userspace process attacking another userspace process
    - userspace process attacking the kernel
    Basically when the privillege level change (i.e.the kernel
    is entered), the branch predictor state is flushed.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 10c5e83afd4a3f01712d97d3bb1ae34d5b74a185 upstream.

    In order to protect against speculation attacks on
    indirect branches, the branch predictor is flushed at
    kernel entry to protect for the following situations:
    - userspace process attacking another userspace process
    - userspace process attacking the kernel
    Basically when the privillege level change (i.e. the
    kernel is entered), the branch predictor state is flushed.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit f633a8ad636efb5d4bba1a047d4a0f1ef719aa06 upstream.

    When the command line argument is present, the Spectre variant 2
    mitigations are disabled.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 98518c4d8728656db349f875fcbbc7c126d4c973 upstream.

    In order to flush the branch predictor the guest kernel performs
    writes to the BUCSR register which is hypervisor privilleged. However,
    the branch predictor is flushed at each KVM entry, so the branch
    predictor has been already flushed, so just return as soon as possible
    to guest.

    Signed-off-by: Diana Craciun
    [mpe: Tweak comment formatting]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 1cbf8990d79ff69da8ad09e8a3df014e1494462b upstream.

    The BUCSR register can be used to invalidate the entries in the
    branch prediction mechanisms.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     
  • commit 76a5eaa38b15dda92cd6964248c39b5a6f3a4e9d upstream.

    In order to protect against speculation attacks (Spectre
    variant 2) on NXP PowerPC platforms, the branch predictor
    should be flushed when the privillege level is changed.
    This patch is adding the infrastructure to fixup at runtime
    the code sections that are performing the branch predictor flush
    depending on a boot arg parameter which is added later in a
    separate patch.

    Signed-off-by: Diana Craciun
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Diana Craciun
     

27 Mar, 2019

1 commit

  • commit b5b4453e7912f056da1ca7572574cada32ecb60c upstream.

    Jakub Drnec reported:
    Setting the realtime clock can sometimes make the monotonic clock go
    back by over a hundred years. Decreasing the realtime clock across
    the y2k38 threshold is one reliable way to reproduce. Allegedly this
    can also happen just by running ntpd, I have not managed to
    reproduce that other than booting with rtc at >2038 and then running
    ntp. When this happens, anything with timers (e.g. openjdk) breaks
    rather badly.

    And included a test case (slightly edited for brevity):
    #define _POSIX_C_SOURCE 199309L
    #include
    #include
    #include
    #include

    long get_time(void) {
    struct timespec tp;
    clock_gettime(CLOCK_MONOTONIC, &tp);
    return tp.tv_sec + tp.tv_nsec / 1000000000;
    }

    int main(void) {
    long last = get_time();
    while(1) {
    long now = get_time();
    if (now < last) {
    printf("clock went backwards by %ld seconds!\n", last - now);
    }
    last = now;
    sleep(1);
    }
    return 0;
    }

    Which when run concurrently with:
    # date -s 2040-1-1
    # date -s 2037-1-1

    Will detect the clock going backward.

    The root cause is that wtom_clock_sec in struct vdso_data is only a
    32-bit signed value, even though we set its value to be equal to
    tk->wall_to_monotonic.tv_sec which is 64-bits.

    Because the monotonic clock starts at zero when the system boots the
    wall_to_montonic.tv_sec offset is negative for current and future
    dates. Currently on a freshly booted system the offset will be in the
    vicinity of negative 1.5 billion seconds.

    However if the wall clock is set past the Y2038 boundary, the offset
    from wall to monotonic becomes less than negative 2^31, and no longer
    fits in 32-bits. When that value is assigned to wtom_clock_sec it is
    truncated and becomes positive, causing the VDSO assembly code to
    calculate CLOCK_MONOTONIC incorrectly.

    That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
    it is not meant to do. Worse, if the time is then set back before the
    Y2038 boundary CLOCK_MONOTONIC will jump backward.

    We can fix it simply by storing the full 64-bit offset in the
    vdso_data, and using that in the VDSO assembly code. We also shuffle
    some of the fields in vdso_data to avoid creating a hole.

    The original commit that added the CLOCK_MONOTONIC support to the VDSO
    did actually use a 64-bit value for wtom_clock_sec, see commit
    a7f290dad32e ("[PATCH] powerpc: Merge vdso's and add vdso support to
    32 bits kernel") (Nov 2005). However just 3 days later it was
    converted to 32-bits in commit 0c37ec2aa88b ("[PATCH] powerpc: vdso
    fixes (take #2)"), and the bug has existed since then AFAICS.

    Fixes: 0c37ec2aa88b ("[PATCH] powerpc: vdso fixes (take #2)")
    Cc: stable@vger.kernel.org # v2.6.15+
    Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.cz
    Reported-by: Jakub Drnec
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     

24 Mar, 2019

11 commits

  • commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.

    kvm_arch_memslots_updated() is at this point in time an x86-specific
    hook for handling MMIO generation wraparound. x86 stashes 19 bits of
    the memslots generation number in its MMIO sptes in order to avoid
    full page fault walks for repeat faults on emulated MMIO addresses.
    Because only 19 bits are used, wrapping the MMIO generation number is
    possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
    the generation has changed so that it can invalidate all MMIO sptes in
    case the effective MMIO generation has wrapped so as to avoid using a
    stale spte, e.g. a (very) old spte that was created with generation==0.

    Given that the purpose of kvm_arch_memslots_updated() is to prevent
    consuming stale entries, it needs to be called before the new generation
    is propagated to memslots. Invalidating the MMIO sptes after updating
    memslots means that there is a window where a vCPU could dereference
    the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
    spte that was created with (pre-wrap) generation==0.

    Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 9bf3d3c4e4fd82c7174f4856df372ab2a71005b9 upstream.

    Today's message is useless:

    [ 42.253267] Kernel stack overflow in process (ptrval), r1=c65500b0

    This patch fixes it:

    [ 66.905235] Kernel stack overflow in process sh[356], r1=c65560b0

    Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
    Cc: stable@vger.kernel.org # v4.15+
    Signed-off-by: Christophe Leroy
    [mpe: Use task_pid_nr()]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 0bbea75c476b77fa7d7811d6be911cc7583e640f upstream.

    Looks like book3s/32 doesn't set RI on machine check, so
    checking RI before calling die() will always be fatal
    allthought this is not an issue in most cases.

    Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable interrupt")
    Fixes: daf00ae71dad ("powerpc/traps: restore recoverability of machine_check interrupts")
    Signed-off-by: Christophe Leroy
    Cc: stable@vger.kernel.org
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 35f2806b481f5b9207f25e1886cba5d1c4d12cc7 upstream.

    We added runtime allocation of 16G pages in commit 4ae279c2c96a
    ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.") That was done
    to enable 16G allocation on PowerNV and KVM config. In case of KVM
    config, we mostly would have the entire guest RAM backed by 16G
    hugetlb pages for this to work. PAPR do support partial backing of
    guest RAM with hugepages via ibm,expected#pages node of memory node in
    the device tree. This means rest of the guest RAM won't be backed by
    16G contiguous pages in the host and hence a hash page table insertion
    can fail in such case.

    An example error message will look like

    hash-mmu: mm: Hashing failure ! EA=0x7efc00000000 access=0x8000000000000006 current=readback
    hash-mmu: trap=0x300 vsid=0x67af789 ssize=1 base psize=14 psize 14 pte=0xc000000400000386
    readback[12260]: unhandled signal 7 at 00007efc00000000 nip 00000000100012d0 lr 000000001000127c code 2

    This patch address that by preventing runtime allocation of 16G
    hugepages in LPAR config. To allocate 16G hugetlb one need to kernel
    command line hugepagesz=16G hugepages=

    With radix translation mode we don't run into this issue.

    This change will prevent runtime allocation of 16G hugetlb pages on
    kvm with hash translation mode. However, with the current upstream it
    was observed that 16G hugetlbfs backed guest doesn't boot at all.

    We observe boot failure with the below message:
    [131354.647546] KVM: map_vrma at 0 failed, ret=-4

    That means this patch is not resulting in an observable regression.
    Once we fix the boot issue with 16G hugetlb backed memory, we need to
    use ibm,expected#pages memory node attribute to indicate 16G page
    reservation to the guest. This will also enable partial backing of
    guest RAM with 16G pages.

    Fixes: 4ae279c2c96a ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit ca6d5149d2ad0a8d2f9c28cbe379802260a0a5e0 upstream.

    GCC 8 warns about the logic in vr_get/set(), which with -Werror breaks
    the build:

    In function ‘user_regset_copyin’,
    inlined from ‘vr_set’ at arch/powerpc/kernel/ptrace.c:628:9:
    include/linux/regset.h:295:4: error: ‘memcpy’ offset [-527, -529] is
    out of the bounds [0, 16] of object ‘vrsave’ with type ‘union
    ’ [-Werror=array-bounds]
    arch/powerpc/kernel/ptrace.c: In function ‘vr_set’:
    arch/powerpc/kernel/ptrace.c:623:5: note: ‘vrsave’ declared here
    } vrsave;

    This has been identified as a regression in GCC, see GCC bug 88273.

    However we can avoid the warning and also simplify the logic and make
    it more robust.

    Currently we pass -1 as end_pos to user_regset_copyout(). This says
    "copy up to the end of the regset".

    The definition of the regset is:
    [REGSET_VMX] = {
    .core_note_type = NT_PPC_VMX, .n = 34,
    .size = sizeof(vector128), .align = sizeof(vector128),
    .active = vr_active, .get = vr_get, .set = vr_set
    },

    The end is calculated as (n * size), ie. 34 * sizeof(vector128).

    In vr_get/set() we pass start_pos as 33 * sizeof(vector128), meaning
    we can copy up to sizeof(vector128) into/out-of vrsave.

    The on-stack vrsave is defined as:
    union {
    elf_vrreg_t reg;
    u32 word;
    } vrsave;

    And elf_vrreg_t is:
    typedef __vector128 elf_vrreg_t;

    So there is no bug, but we rely on all those sizes lining up,
    otherwise we would have a kernel stack exposure/overwrite on our
    hands.

    Rather than relying on that we can pass an explict end_pos based on
    the sizeof(vrsave). The result should be exactly the same but it's
    more obviously not over-reading/writing the stack and it avoids the
    compiler warning.

    Reported-by: Meelis Roos
    Reported-by: Mathieu Malaterre
    Cc: stable@vger.kernel.org
    Tested-by: Mathieu Malaterre
    Tested-by: Meelis Roos
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit fe1ef6bcdb4fca33434256a802a3ed6aacf0bd2f upstream.

    Commit 8792468da5e1 "powerpc: Add the ability to save FPU without
    giving it up" unexpectedly removed the MSR_FE0 and MSR_FE1 bits from
    the bitmask used to update the MSR of the previous thread in
    __giveup_fpu() causing a KVM-PR MacOS guest to lockup and panic the
    host kernel.

    Leaving FE0/1 enabled means unrelated processes might receive FPEs
    when they're not expecting them and crash. In particular if this
    happens to init the host will then panic.

    eg (transcribed):
    qemu-system-ppc[837]: unhandled signal 8 at 12cc9ce4 nip 12cc9ce4 lr 12cc9ca4 code 0
    systemd[1]: unhandled signal 8 at 202f02e0 nip 202f02e0 lr 001003d4 code 0
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b

    Reinstate these bits to the MSR bitmask to enable MacOS guests to run
    under 32-bit KVM-PR once again without issue.

    Fixes: 8792468da5e1 ("powerpc: Add the ability to save FPU without giving it up")
    Cc: stable@vger.kernel.org # v4.6+
    Signed-off-by: Mark Cave-Ayland
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Mark Cave-Ayland
     
  • commit 19f8a5b5be2898573a5e1dc1db93e8d40117606a upstream.

    Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api
    only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg()
    inside pnv_cpu_offline(), with the aim of changing the LPCR value in
    the SLW image to disable wakeups from the decrementer while a CPU is
    offline. However, pnv_cpu_offline() gets called each time a secondary
    CPU thread is woken up to participate in running a KVM guest, that is,
    not just when a CPU is offlined.

    Since opal_slw_set_reg() is a very slow operation (with observed
    execution times around 20 milliseconds), this means that an offline
    secondary CPU can often be busy doing the opal_slw_set_reg() call
    when the primary CPU wants to grab all the secondary threads so that
    it can run a KVM guest. This leads to messages like "KVM: couldn't
    grab CPU n" being printed and guest execution failing.

    There is no need to reprogram the SLW image on every KVM guest entry
    and exit. So that we do it only when a CPU is really transitioning
    between online and offline, this moves the calls to
    pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self().

    Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     
  • commit 36da5ff0bea2dc67298150ead8d8471575c54c7d upstream.

    The 83xx has 8 SPRG registers and uses at least SPRG4
    for DTLB handling LRU.

    Fixes: 2319f1239592 ("powerpc/mm: e300c2/c3/c4 TLB errata workaround")
    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 7b62f9bd2246b7d3d086e571397c14ba52645ef1 upstream.

    Currently the opal log is globally readable. It is kernel policy to
    limit the visibility of physical addresses / kernel pointers to root.
    Given this and the fact the opal log may contain this information it
    would be better to limit the readability to root.

    Fixes: bfc36894a48b ("powerpc/powernv: Add OPAL message log interface")
    Cc: stable@vger.kernel.org # v3.15+
    Signed-off-by: Jordan Niethe
    Reviewed-by: Stewart Smith
    Reviewed-by: Andrew Donnellan
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Jordan Niethe
     
  • commit 6d183ca8baec983dc4208ca45ece3c36763df912 upstream.

    'nobats' kernel parameter or some options like CONFIG_DEBUG_PAGEALLOC
    deny the use of BATS for mapping memory.

    This patch makes sure that the specific wii RAM mapping function
    takes it into account as well.

    Fixes: de32400dd26e ("wii: use both mem1 and mem2 as ram")
    Cc: stable@vger.kernel.org
    Reviewed-by: Jonathan Neuschafer
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     
  • commit 9580b71b5a7863c24a9bd18bcd2ad759b86b1eff upstream.

    Clear the on-stack STACK_FRAME_REGS_MARKER on exception exit in order
    to avoid confusing stacktrace like the one below.

    Call Trace:
    [c0e9dca0] [c01c42a0] print_address_description+0x64/0x2bc (unreliable)
    [c0e9dcd0] [c01c4684] kasan_report+0xfc/0x180
    [c0e9dd10] [c0895130] memchr+0x24/0x74
    [c0e9dd30] [c00a9e38] msg_print_text+0x124/0x574
    [c0e9dde0] [c00ab710] console_unlock+0x114/0x4f8
    [c0e9de40] [c00adc60] vprintk_emit+0x188/0x1c4
    --- interrupt: c0e9df00 at 0x400f330
    LR = init_stack+0x1f00/0x2000
    [c0e9de80] [c00ae3c4] printk+0xa8/0xcc (unreliable)
    [c0e9df20] [c0c27e44] early_irq_init+0x38/0x108
    [c0e9df50] [c0c15434] start_kernel+0x310/0x488
    [c0e9dff0] [00003484] 0x3484

    With this patch the trace becomes:

    Call Trace:
    [c0e9dca0] [c01c42c0] print_address_description+0x64/0x2bc (unreliable)
    [c0e9dcd0] [c01c46a4] kasan_report+0xfc/0x180
    [c0e9dd10] [c0895150] memchr+0x24/0x74
    [c0e9dd30] [c00a9e58] msg_print_text+0x124/0x574
    [c0e9dde0] [c00ab730] console_unlock+0x114/0x4f8
    [c0e9de40] [c00adc80] vprintk_emit+0x188/0x1c4
    [c0e9de80] [c00ae3e4] printk+0xa8/0xcc
    [c0e9df20] [c0c27e44] early_irq_init+0x38/0x108
    [c0e9df50] [c0c15434] start_kernel+0x310/0x488
    [c0e9dff0] [00003484] 0x3484

    Cc: stable@vger.kernel.org
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Christophe Leroy
     

27 Feb, 2019

1 commit

  • [ Upstream commit fb0bdec51a4901b7dd088de0a1e365e1b9f5cd21 ]

    Commit 8c8c10b90d88 ("powerpc/8xx: fix handling of early NULL pointer
    dereference") moved the loading of r6 earlier in the code. As some
    functions are called inbetween, r6 needs to be loaded again with the
    address of swapper_pg_dir in order to set PTE pointers for
    the Abatron BDI.

    Fixes: 8c8c10b90d88 ("powerpc/8xx: fix handling of early NULL pointer dereference")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Christophe Leroy
     

15 Feb, 2019

1 commit

  • commit 579b9239c1f38665b21e8d0e6ee83ecc96dbd6bb upstream.

    With support for split pmd lock, we use pmd page pmd_huge_pte pointer
    to store the deposited page table. In those config when we move page
    tables we need to make sure we move the deposited page table to the
    correct pmd page. Otherwise this can result in crash when we withdraw
    of deposited page table because we can find the pmd_huge_pte NULL.

    eg:

    __split_huge_pmd+0x1070/0x1940
    __split_huge_pmd+0xe34/0x1940 (unreliable)
    vma_adjust_trans_huge+0x110/0x1c0
    __vma_adjust+0x2b4/0x9b0
    __split_vma+0x1b8/0x280
    __do_munmap+0x13c/0x550
    sys_mremap+0x220/0x7e0
    system_call+0x5c/0x70

    Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
    Cc: stable@vger.kernel.org # v4.18+
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     

13 Feb, 2019

5 commits

  • [ Upstream commit 0db6896ff6332ba694f1e61b93ae3b2640317633 ]

    For fadump to work successfully there should not be any holes in reserved
    memory ranges where kernel has asked firmware to move the content of old
    kernel memory in event of crash. Now that fadump uses CMA for reserved
    area, this memory area is now not protected from hot-remove operations
    unless it is cma allocated. Hence, fadump service can fail to re-register
    after the hot-remove operation, if hot-removed memory belongs to fadump
    reserved region. To avoid this make sure that memory from fadump reserved
    area is not hot-removable if fadump is registered.

    However, if user still wants to remove that memory, he can do so by
    manually stopping fadump service before hot-remove operation.

    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Mahesh Salgaonkar
     
  • [ Upstream commit ffca395b11c4a5a6df6d6345f794b0e3d578e2d0 ]

    On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
    a no-exec fault generates DSISR_PROTFAULT error bits,
    not DSISR_NOEXEC_OR_G.

    This patch adds DSISR_PROTFAULT in the test mask.

    Fixes: d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute faults")
    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Christophe Leroy
     
  • [ Upstream commit bdbf649efe21173cae63b4b71db84176420f9039 ]

    The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
    table and a table with userspace addresses; the latter is used for
    marking pages dirty when corresponging TCEs are unmapped from
    the hardware table.

    a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels
    on demand") enabled on-demand allocation of the hardware table,
    however it missed the other table so it has still been fully allocated
    at the boot time. This fixes the issue by allocating a single level,
    just like we do for the hardware table.

    Fixes: a68bd1267b72 ("powerpc/powernv/ioda: Allocate indirect TCE levels on demand")
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Alexey Kardashevskiy
     
  • [ Upstream commit 17cfccc91545682513541924245abb876d296063 ]

    MMCRA[34:36] and MMCRA[38:44] expose the thresholding counter value.
    Thresholding counter can be used to count latency cycles such as
    load miss to reload. But threshold counter value is not relevant
    when the sampled instruction type is unknown or reserved. Patch to
    fix the thresholding counter value to zero when sampled instruction
    type is unknown or reserved.

    Fixes: 170a315f41c6('powerpc/perf: Support to export MMCRA[TEC*] field to userspace')
    Signed-off-by: Madhavan Srinivasan
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Madhavan Srinivasan
     
  • [ Upstream commit 05a4ab823983d9136a460b7b5e0d49ee709a6f86 ]

    With the following piece of code, the following compilation warning
    is encountered:

    if (_IOC_DIR(ioc) != _IOC_NONE) {
    int verify = _IOC_DIR(ioc) & _IOC_READ ? VERIFY_WRITE : VERIFY_READ;

    if (!access_ok(verify, ioarg, _IOC_SIZE(ioc))) {

    drivers/platform/test/dev.c: In function 'my_ioctl':
    drivers/platform/test/dev.c:219:7: warning: unused variable 'verify' [-Wunused-variable]
    int verify = _IOC_DIR(ioc) & _IOC_READ ? VERIFY_WRITE : VERIFY_READ;

    This patch fixes it by referencing 'type' in the macro allthough
    doing nothing with it.

    Signed-off-by: Christophe Leroy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Sasha Levin

    Christophe Leroy