01 Oct, 2020

5 commits

  • commit d3f7b1bb204099f2f7306318896223e8599bb6a2 upstream.

    Currently to make sure that every page table entry is read just once
    gup_fast walks perform READ_ONCE and pass pXd value down to the next
    gup_pXd_range function by value e.g.:

    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    ...
    pudp = pud_offset(&p4d, addr);

    This function passes a reference on that local value copy to pXd_offset,
    and might get the very same pointer in return. This happens when the
    level is folded (on most arches), and that pointer should not be
    iterated.

    On s390 due to the fact that each task might have different 5,4 or
    3-level address translation and hence different levels folded the logic
    is more complex and non-iteratable pointer to a local copy leads to
    severe problems.

    Here is an example of what happens with gup_fast on s390, for a task
    with 3-level paging, crossing a 2 GB pud boundary:

    // addr = 0x1007ffff000, end = 0x10080001000
    static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
    unsigned int flags, struct page **pages, int *nr)
    {
    unsigned long next;
    pud_t *pudp;

    // pud_offset returns &p4d itself (a pointer to a value on stack)
    pudp = pud_offset(&p4d, addr);
    do {
    // on second iteratation reading "random" stack value
    pud_t pud = READ_ONCE(*pudp);

    // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
    next = pud_addr_end(addr, end);
    ...
    } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

    return 1;
    }

    This happens since s390 moved to common gup code with commit
    d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
    commit 1a42010cdc26 ("s390/mm: convert to the generic
    get_user_pages_fast code").

    s390 tried to mimic static level folding by changing pXd_offset
    primitives to always calculate top level page table offset in pgd_offset
    and just return the value passed when pXd_offset has to act as folded.

    What is crucial for gup_fast and what has been overlooked is that
    PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
    And the latter is not possible with dynamic folding.

    To fix the issue in addition to pXd values pass original pXdp pointers
    down to gup_pXd_range functions. And introduce pXd_offset_lockless
    helpers, which take an additional pXd entry value parameter. This has
    already been discussed in

    https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

    Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Andrew Morton
    Reviewed-by: Gerald Schaefer
    Reviewed-by: Alexander Gordeev
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Mike Rapoport
    Reviewed-by: John Hubbard
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Arnd Bergmann
    Cc: Andrey Ryabinin
    Cc: Heiko Carstens
    Cc: Christian Borntraeger
    Cc: Claudio Imbrenda
    Cc: [5.2+]
    Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     
  • [ Upstream commit fcb2b70cdb194157678fb1a75f9ff499aeba3d2a ]

    Add __init to reserve_memory_end, reserve_oldmem and remove_oldmem.
    Sometimes these functions are not inlined, and then the build
    complains about section mismatch.

    Signed-off-by: Ilya Leoshkevich
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Ilya Leoshkevich
     
  • [ Upstream commit 8719b6d29d2851fa84c4074bb2e5adc022911ab8 ]

    request_irq() is preferred over setup_irq(). Invocations of setup_irq()
    occur after memory allocators are ready.

    Per tglx[1], setup_irq() existed in olden days when allocators were not
    ready by the time early interrupts were initialized.

    Hence replace setup_irq() by request_irq().

    [1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos

    Signed-off-by: afzal mohammed
    Message-Id:
    [heiko.carstens@de.ibm.com: replace pr_err with panic]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    afzal mohammed
     
  • [ Upstream commit 32dab6828c42f087439d3e2617dc7283546bd8f7 ]

    Use kzalloc() to allocate auxiliary buffer structure initialized
    with all zeroes to avoid random value in trace output.

    Avoid double access to SBD hardware flags.

    Signed-off-by: Thomas Richter
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Thomas Richter
     
  • [ Upstream commit 7bcaad1f9fac889f5fcd1a383acf7e00d006da41 ]

    CALL_ON_STACK is intended to be used for temporary stack switching with
    potential return to the caller.

    When CALL_ON_STACK is misused to switch from nodat stack to task stack
    back_chain information would later lead stack unwinder from task stack into
    (per cpu) nodat stack which is reused for other purposes. This would
    yield confusing unwinding result or errors.

    To avoid that introduce CALL_ON_STACK_NORETURN to be used instead. It
    makes sure that back_chain is zeroed and unwinder finishes gracefully
    ending up at task pt_regs.

    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Vasily Gorbik
     

10 Sep, 2020

1 commit

  • [ Upstream commit 1196f12a2c960951d02262af25af0bb1775ebcc2 ]

    Since commit a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context}
    to per-cpu variables") the lockdep code itself uses percpu variables. This
    leads to recursions because the percpu macros are calling preempt_enable()
    which might call trace_preempt_on().

    Signed-off-by: Sven Schnelle
    Reviewed-by: Vasily Gorbik
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Sven Schnelle
     

03 Sep, 2020

1 commit

  • [ Upstream commit 535e4fc623fab2e09a0653fc3a3e17f382ad0251 ]

    The node distance is hardcoded to 0, which causes a trouble
    for some user-level applications. In particular, "libnuma"
    expects the distance of a node to itself as LOCAL_DISTANCE.
    This update removes the offending node distance override.

    Cc: # 4.4
    Fixes: 3a368f742da1 ("s390/numa: add core infrastructure")
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Sasha Levin
     

26 Aug, 2020

2 commits

  • [ Upstream commit fd78c59446b8d050ecf3e0897c5a486c7de7c595 ]

    The key member of the runtime instrumentation control block contains
    only the access key, not the complete storage key. Therefore the value
    must be shifted by four bits. Since existing user space does not
    necessarily query and set the access key correctly, just ignore the
    user space provided key and use the correct one.
    Note: this is only relevant for debugging purposes in case somebody
    compiles a kernel with a default storage access key set to a value not
    equal to zero.

    Fixes: 262832bc5acd ("s390/ptrace: add runtime instrumention register get/set")
    Reported-by: Claudio Imbrenda
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Heiko Carstens
     
  • [ Upstream commit 9eaba29c7985236e16468f4e6a49cc18cf01443e ]

    The key member of the runtime instrumentation control block contains
    only the access key, not the complete storage key. Therefore the value
    must be shifted by four bits.
    Note: this is only relevant for debugging purposes in case somebody
    compiles a kernel with a default storage access key set to a value not
    equal to zero.

    Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
    Reported-by: Claudio Imbrenda
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Heiko Carstens
     

19 Aug, 2020

1 commit

  • commit ba925fa35057a062ac98c3e8138b013ce4ce351c upstream.

    During s390_enable_sie(), we need to take care of splitting all qemu user
    process THP mappings. This is currently done with follow_page(FOLL_SPLIT),
    by simply iterating over all vma ranges, with PAGE_SIZE increment.

    This logic is sub-optimal and can result in a lot of unnecessary overhead,
    especially when using qemu and ASAN with large shadow map. Ilya reported
    significant system slow-down with one CPU busy for a long time and overall
    unresponsiveness.

    Fix this by using walk_page_vma() and directly calling split_huge_pmd()
    only for present pmds, which greatly reduces overhead.

    Cc: # v5.4+
    Reported-by: Ilya Leoshkevich
    Tested-by: Ilya Leoshkevich
    Acked-by: Christian Borntraeger
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Gerald Schaefer
     

16 Jul, 2020

6 commits

  • [ Upstream commit d6df52e9996dcc2062c3d9c9123288468bb95b52 ]

    To be able to patch kernel code before paging is initialized do plain
    memcpy if DAT is off. This is required to enable early jump label
    initialization.

    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Vasily Gorbik
     
  • [ Upstream commit cb2cceaefb4c4dc28fc27ff1f1b2d258bfc10353 ]

    s390_kernel_write()'s function type is almost identical to memcpy().
    Change its return type to "void *" so they can be used interchangeably.

    Cc: linux-s390@vger.kernel.org
    Cc: heiko.carstens@de.ibm.com
    Signed-off-by: Josh Poimboeuf
    Acked-by: Joe Lawrence
    Acked-by: Miroslav Benes
    Acked-by: Gerald Schaefer # s390
    Signed-off-by: Jiri Kosina
    Signed-off-by: Sasha Levin

    Josh Poimboeuf
     
  • commit 528a9539348a0234375dfaa1ca5dbbb2f8f8e8d2 upstream.

    If the pmd is soft dirty we must mark the pte as soft dirty (and not dirty).
    This fixes some cases for guest migration with huge page backings.

    Cc: # 4.8
    Fixes: bc29b7ac1d9f ("s390/mm: clean up pte/pmd encoding")
    Reviewed-by: Christian Borntraeger
    Reviewed-by: Gerald Schaefer
    Signed-off-by: Janosch Frank
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Janosch Frank
     
  • commit 95e61b1b5d6394b53d147c0fcbe2ae70fbe09446 upstream.

    Command line parameters might set static keys. This is true for s390 at
    least since commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
    and init_on_free=1 boot options"). To avoid the following WARN:

    static_key_enable_cpuslocked(): static key 'init_on_alloc+0x0/0x40' used
    before call to jump_label_init()

    call jump_label_init() just before parse_early_param().
    jump_label_init() is safe to call multiple times (x86 does that), doesn't
    do any memory allocations and hence should be safe to call that early.

    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Cc: # 5.3: d6df52e9996d: s390/maccess: add no DAT mode to kernel_write
    Cc: # 5.3
    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Vasily Gorbik
     
  • [ Upstream commit 998f5bbe3dbdab81c1cfb1aef7c3892f5d24f6c7 ]

    Currently if early_pgm_check_handler is called it ends up in pgm check
    loop. The problem is that early_pgm_check_handler is instrumented by
    KASAN but executed without DAT flag enabled which leads to addressing
    exception when KASAN checks try to access shadow memory.

    Fix that by executing early handlers with DAT flag on under KASAN as
    expected.

    Reported-and-tested-by: Alexander Egorenkov
    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Vasily Gorbik
     
  • [ Upstream commit 774911290c589e98e3638e73b24b0a4d4530e97c ]

    The current number of KVM_IRQCHIP_NUM_PINS results in an order 3
    allocation (32kb) for each guest start/restart. This can result in OOM
    killer activity even with free swap when the memory is fragmented
    enough:

    kernel: qemu-system-s39 invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0
    kernel: CPU: 1 PID: 357274 Comm: qemu-system-s39 Kdump: loaded Not tainted 5.4.0-29-generic #33-Ubuntu
    kernel: Hardware name: IBM 8562 T02 Z06 (LPAR)
    kernel: Call Trace:
    kernel: ([] show_stack+0x7a/0xc0)
    kernel: [] dump_stack+0x8a/0xc0
    kernel: [] dump_header+0x62/0x258
    kernel: [] oom_kill_process+0x172/0x180
    kernel: [] out_of_memory+0xee/0x580
    kernel: [] __alloc_pages_slowpath+0xd18/0xe90
    kernel: [] __alloc_pages_nodemask+0x2a4/0x320
    kernel: [] kmalloc_order+0x34/0xb0
    kernel: [] kmalloc_order_trace+0x32/0xe0
    kernel: [] kvm_set_irq_routing+0xa6/0x2e0
    kernel: [] kvm_arch_vm_ioctl+0x544/0x9e0
    kernel: [] kvm_vm_ioctl+0x396/0x760
    kernel: [] do_vfs_ioctl+0x376/0x690
    kernel: [] ksys_ioctl+0x84/0xb0
    kernel: [] __s390x_sys_ioctl+0x2a/0x40
    kernel: [] system_call+0xd8/0x2c8

    As far as I can tell s390x does not use the iopins as we bail our for
    anything other than KVM_IRQ_ROUTING_S390_ADAPTER and the chip/pin is
    only used for KVM_IRQ_ROUTING_IRQCHIP. So let us use a small number to
    reduce the memory footprint.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Cornelia Huck
    Reviewed-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200617083620.5409-1-borntraeger@de.ibm.com
    Signed-off-by: Sasha Levin

    Christian Borntraeger
     

09 Jul, 2020

1 commit

  • [ Upstream commit 827c4913923e0b441ba07ba4cc41e01181102303 ]

    When specifying insanely large debug buffers a kernel warning is
    printed. The debug code does handle the error gracefully, though.
    Instead of duplicating the check let us silence the warning to
    avoid crashes when panic_on_warn is used.

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Heiko Carstens
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Christian Borntraeger
     

01 Jul, 2020

4 commits

  • [ Upstream commit 478237a595120a18e9b52fd2c57a6e8b7a01e411 ]

    clock_getres in the vDSO library has to preserve the same behaviour
    of posix_get_hrtimer_res().

    In particular, posix_get_hrtimer_res() does:
    sec = 0;
    ns = hrtimer_resolution;
    and hrtimer_resolution depends on the enablement of the high
    resolution timers that can happen either at compile or at run time.

    Fix the s390 vdso implementation of clock_getres keeping a copy of
    hrtimer_resolution in vdso data and using that directly.

    Link: https://lkml.kernel.org/r/20200324121027.21665-1-vincenzo.frascino@arm.com
    Signed-off-by: Vincenzo Frascino
    Acked-by: Martin Schwidefsky
    [heiko.carstens@de.ibm.com: use llgf for proper zero extension]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Vincenzo Frascino
     
  • [ Upstream commit 2b2a25845d534ac6d55086e35c033961fdd83a26 ]

    Currently, the VDSO is being linked through $(CC). This does not match
    how the rest of the kernel links objects, which is through the $(LD)
    variable.

    When clang is built in a default configuration, it first attempts to use
    the target triple's default linker, which is just ld. However, the user
    can override this through the CLANG_DEFAULT_LINKER cmake define so that
    clang uses another linker by default, such as LLVM's own linker, ld.lld.
    This can be useful to get more optimized links across various different
    projects.

    However, this is problematic for the s390 vDSO because ld.lld does not
    have any s390 emulatiom support:

    https://github.com/llvm/llvm-project/blob/llvmorg-10.0.1-rc1/lld/ELF/Driver.cpp#L132-L150

    Thus, if a user is using a toolchain with ld.lld as the default, they
    will see an error, even if they have specified ld.bfd through the LD
    make variable:

    $ make -j"$(nproc)" -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- LLVM=1 \
    LD=s390x-linux-gnu-ld \
    defconfig arch/s390/kernel/vdso64/
    ld.lld: error: unknown emulation: elf64_s390
    clang-11: error: linker command failed with exit code 1 (use -v to see invocation)

    Normally, '-fuse-ld=bfd' could be used to get around this; however, this
    can be fragile, depending on paths and variable naming. The cleaner
    solution for the kernel is to take advantage of the fact that $(LD) can
    be invoked directly, which bypasses the heuristics of $(CC) and respects
    the user's choice. Similar changes have been done for ARM, ARM64, and
    MIPS.

    Link: https://lkml.kernel.org/r/20200602192523.32758-1-natechancellor@gmail.com
    Link: https://github.com/ClangBuiltLinux/linux/issues/1041
    Signed-off-by: Nathan Chancellor
    Reviewed-by: Nick Desaulniers
    [heiko.carstens@de.ibm.com: add --build-id flag]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Nathan Chancellor
     
  • [ Upstream commit 873e5a763d604c32988c4a78913a8dab3862d2f9 ]

    When strace wants to update the syscall number, it sets GPR2
    to the desired number and updates the GPR via PTRACE_SETREGSET.
    It doesn't update regs->int_code which would cause the old syscall
    executed on syscall restart. As we cannot change the ptrace ABI and
    don't have a field for the interruption code, check whether the tracee
    is in a syscall and the last instruction was svc. In that case assume
    that the tracer wants to update the syscall number and copy the GPR2
    value to regs->int_code.

    Signed-off-by: Sven Schnelle
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Sven Schnelle
     
  • [ Upstream commit 00332c16b1604242a56289ff2b26e283dbad0812 ]

    tracing expects to see invalid syscalls, so pass it through.
    The syscall path in entry.S checks the syscall number before
    looking up the handler, so it is still safe.

    Signed-off-by: Sven Schnelle
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Sven Schnelle
     

24 Jun, 2020

1 commit

  • commit b3583fca5fb654af2cfc1c08259abb9728272538 upstream.

    If both the tracer and the tracee are compat processes, and gprs[2]
    is assigned a value by __poke_user_compat, then the higher 32 bits
    of gprs[2] are cleared, IS_ERR_VALUE() always returns false, and
    syscall_get_error() always returns 0.

    Fix the implementation by sign-extending the value for compat processes
    the same way as x86 implementation does.

    The bug was exposed to user space by commit 201766a20e30f ("ptrace: add
    PTRACE_GET_SYSCALL_INFO request") and detected by strace test suite.

    This change fixes strace syscall tampering on s390.

    Link: https://lkml.kernel.org/r/20200602180051.GA2427@altlinux.org
    Fixes: 753c4dd6a2fa2 ("[S390] ptrace changes")
    Cc: Elvira Khabirova
    Cc: stable@vger.kernel.org # v2.6.28+
    Signed-off-by: Dmitry V. Levin
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Dmitry V. Levin
     

17 Jun, 2020

1 commit

  • [ Upstream commit e1750a3d9abbea2ece29cac8dc5a6f5bc19c1492 ]

    After disabling a function, the original handle is logged instead of
    the disabled handle.

    Link: https://lkml.kernel.org/r/20200522183922.5253-1-ptesarik@suse.com
    Fixes: 17cdec960cf7 ("s390/pci: Recover handle in clp_set_pci_fn()")
    Reviewed-by: Pierre Morel
    Signed-off-by: Petr Tesarik
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Petr Tesarik
     

07 Jun, 2020

2 commits

  • [ Upstream commit ac8372f3b4e41015549b331a4f350224661e7fc6 ]

    On s390, the layout of normal and large ptes (i.e. pmds/puds) differs.
    Therefore, set_huge_pte_at() does a conversion from a normal pte to
    the corresponding large pmd/pud. So, when converting an empty pte, this
    should result in an empty pmd/pud, which would return true for
    pmd/pud_none().

    However, after conversion we also mark the pmd/pud as large, and
    therefore present. For empty ptes, this will result in an empty pmd/pud
    that is also marked as large, and pmd/pud_none() would not return true.

    There is currently no issue with this behaviour, as set_huge_pte_at()
    does not seem to be called for empty ptes. It would be valid though, so
    let's fix this by not marking empty ptes as large in set_huge_pte_at().

    This was found by testing a patch from from Anshuman Khandual, which is
    currently discussed on LKML ("mm/debug: Add more arch page table helper
    tests").

    Signed-off-by: Gerald Schaefer
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Gerald Schaefer
     
  • [ Upstream commit b4adfe55915d8363e244e42386d69567db1719b9 ]

    A typical backtrace acquired from ftraced function currently looks like
    the following (e.g. for "path_openat"):

    arch_stack_walk+0x15c/0x2d8
    stack_trace_save+0x50/0x68
    stack_trace_call+0x15a/0x3b8
    ftrace_graph_caller+0x0/0x1c
    0x3e0007e3c98
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Vasily Gorbik
     

27 May, 2020

3 commits

  • commit 70b690547d5ea1a3d135a4cc39cd1e08246d0c3a upstream.

    initrd_start must not point at the location the initrd is loaded into
    the crashkernel memory but at the location it will be after the
    crashkernel memory is swapped with the memory at 0.

    Fixes: ee337f5469fd ("s390/kexec_file: Add crash support to image loader")
    Reported-by: Lianbo Jiang
    Signed-off-by: Philipp Rudo
    Tested-by: Lianbo Jiang
    Link: https://lore.kernel.org/r/20200512193956.15ae3f23@laptop2-ibm.local
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Philipp Rudo
     
  • commit 4c1cbcbd6c56c79de2c07159be4f55386bb0bef2 upstream.

    With certain kernel configurations, the R_390_JMP_SLOT relocation type
    might be generated, which is not expected by the KASLR relocation code,
    and the kernel stops with the message "Unknown relocation type".

    This was found with a zfcpdump kernel config, where CONFIG_MODULES=n
    and CONFIG_VFIO=n. In that case, symbol_get() is used on undefined
    __weak symbols in virt/kvm/vfio.c, which results in the generation
    of R_390_JMP_SLOT relocation types.

    Fix this by handling R_390_JMP_SLOT similar to R_390_GLOB_DAT.

    Fixes: 805bc0bc238f ("s390/kernel: build a relocatable kernel")
    Cc: # v5.2+
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Philipp Rudo
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Gerald Schaefer
     
  • commit f058599e22d59e594e5aae1dc10560568d8f4a8b upstream.

    The s390_mmio_read/write syscalls are currently broken when running with
    MIO.

    The new pcistb_mio/pcstg_mio/pcilg_mio instructions are executed
    similiarly to normal load/store instructions and do address translation
    in the current address space. That means inside the kernel they are
    aware of mappings into kernel address space while outside the kernel
    they use user space mappings (usually created through mmap'ing a PCI
    device file).

    Now when existing user space applications use the s390_pci_mmio_write
    and s390_pci_mmio_read syscalls, they pass I/O addresses that are mapped
    into user space so as to be usable with the new instructions without
    needing a syscall. Accessing these addresses with the old instructions
    as done currently leads to a kernel panic.

    Also, for such a user space mapping there may not exist an equivalent
    kernel space mapping which means we can't just use the new instructions
    in kernel space.

    Instead of replicating user mappings in the kernel which then might
    collide with other mappings, we can conceptually execute the new
    instructions as if executed by the user space application using the
    secondary address space. This even allows us to directly store to the
    user pointer without the need for copy_to/from_user().

    Cc: stable@vger.kernel.org
    Fixes: 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions")
    Signed-off-by: Niklas Schnelle
    Reviewed-by: Sven Schnelle
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Niklas Schnelle
     

14 May, 2020

1 commit

  • commit 5615e74f48dcc982655543e979b6c3f3f877e6f6 upstream.

    In LPAR we will only get an intercept for FC==3 for the PQAP
    instruction. Running nested under z/VM can result in other intercepts as
    well as ECA_APIE is an effective bit: If one hypervisor layer has
    turned this bit off, the end result will be that we will get intercepts for
    all function codes. Usually the first one will be a query like PQAP(QCI).
    So the WARN_ON_ONCE is not right. Let us simply remove it.

    Cc: Pierre Morel
    Cc: Tony Krowiak
    Cc: stable@vger.kernel.org # v5.3+
    Fixes: e5282de93105 ("s390: ap: kvm: add PQAP interception for AQIC")
    Link: https://lore.kernel.org/kvm/20200505083515.2720-1-borntraeger@de.ibm.com
    Reported-by: Qian Cai
    Signed-off-by: Christian Borntraeger
    Reviewed-by: David Hildenbrand
    Reviewed-by: Cornelia Huck
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

02 May, 2020

2 commits

  • commit 86dbf32da150339ca81509fa2eb84c814b55258b upstream.

    with the introduction of CPU directed interrupts the kernel
    parameter pci=force_floating was introduced to fall back
    to the previous behavior using floating irqs.

    However we were still setting the affinity in that case,
    both in __irq_alloc_descs() and via the irq_set_affinity
    callback in struct irq_chip.

    For the former only set the affinity in the directed case.

    The latter is explicitly set in zpci_directed_irq_init()
    so we can just leave it unset for the floating case.

    Fixes: e979ce7bced2 ("s390/pci: provide support for CPU directed interrupts")
    Co-developed-by: Alexander Schmidt
    Signed-off-by: Alexander Schmidt
    Signed-off-by: Niklas Schnelle
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Niklas Schnelle
     
  • commit 8ebf6da9db1b2a20bb86cc1bee2552e894d03308 upstream.

    Switching tracers include instruction patching. To prevent that a
    instruction is patched while it's read the instruction patching is done
    in stop_machine 'context'. This also means that any function called
    during stop_machine must not be traced. Thus add 'notrace' to all
    functions called within stop_machine.

    Fixes: 1ec2772e0c3c ("s390/diag: add a statistic for diagnose calls")
    Fixes: 38f2c691a4b3 ("s390: improve wait logic of stop_machine")
    Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield")
    Signed-off-by: Philipp Rudo
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Philipp Rudo
     

29 Apr, 2020

2 commits

  • commit 316ec154810960052d4586b634156c54d0778f74 upstream.

    A page table upgrade in a kernel section that uses secondary address
    mode will mess up the kernel instructions as follows:

    Consider the following scenario: two threads are sharing memory.
    On CPU1 thread 1 does e.g. strnlen_user(). That gets to
    old_fs = enable_sacf_uaccess();
    len = strnlen_user_srst(src, size);
    and
    " la %2,0(%1)\n"
    " la %3,0(%0,%1)\n"
    " slgr %0,%0\n"
    " sacf 256\n"
    "0: srst %3,%2\n"
    in strnlen_user_srst(). At that point we are in secondary space mode,
    control register 1 points to kernel page table and instruction fetching
    happens via c1, rather than usual c13. Interrupts are not disabled, for
    obvious reasons.

    On CPU2 thread 2 does MAP_FIXED mmap(), forcing the upgrade of page table
    from 3-level to e.g. 4-level one. We'd allocated new top-level table,
    set it up and now we hit this:
    notify = 1;
    spin_unlock_bh(&mm->page_table_lock);
    }
    if (notify)
    on_each_cpu(__crst_table_upgrade, mm, 0);
    OK, we need to actually change over to use of new page table and we
    need that to happen in all threads that are currently running. Which
    happens to include the thread 1. IPI is delivered and we have
    static void __crst_table_upgrade(void *arg)
    {
    struct mm_struct *mm = arg;

    if (current->active_mm == mm)
    set_user_asce(mm);
    __tlb_flush_local();
    }
    run on CPU1. That does
    static inline void set_user_asce(struct mm_struct *mm)
    {
    S390_lowcore.user_asce = mm->context.asce;
    OK, user page table address updated...
    __ctl_load(S390_lowcore.user_asce, 1, 1);
    ... and control register 1 set to it.
    clear_cpu_flag(CIF_ASCE_PRIMARY);
    }

    IPI is run in home space mode, so it's fine - insns are fetched
    using c13, which always points to kernel page table. But as soon
    as we return from the interrupt, previous PSW is restored, putting
    CPU1 back into secondary space mode, at which point we no longer
    get the kernel instructions from the kernel mapping.

    The fix is to only fixup the control registers that are currently in use
    for user processes during the page table update. We must also disable
    interrupts in enable_sacf_uaccess to synchronize the cr and
    thread.mm_segment updates against the on_each-cpu.

    Fixes: 0aaba41b58bc ("s390: remove all code using the access register mode")
    Cc: stable@vger.kernel.org # 4.15+
    Reported-by: Al Viro
    Reviewed-by: Gerald Schaefer
    References: CVE-2020-11884
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     
  • commit 97daa028f3f621adff2c4f7b15fe0874e5b5bd6c upstream.

    Return the index of the last valid slot from gfn_to_memslot_approx() if
    its binary search loop yielded an out-of-bounds index. The index can
    be out-of-bounds if the specified gfn is less than the base of the
    lowest memslot (which is also the last valid memslot).

    Note, the sole caller, kvm_s390_get_cmma(), ensures used_slots is
    non-zero.

    Fixes: afdad61615cc3 ("KVM: s390: Fix storage attributes migration with memory slots")
    Cc: stable@vger.kernel.org # 4.19.x: 0774a964ef56: KVM: Fix out of range accesses to memslots
    Cc: stable@vger.kernel.org # 4.19.x
    Signed-off-by: Sean Christopherson
    Message-Id:
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

23 Apr, 2020

3 commits

  • [ Upstream commit 1493e0f944f3c319d11e067c185c904d01c17ae5 ]

    We have to properly retry again by returning -EINVAL immediately in case
    somebody else instantiated the table concurrently. We missed to add the
    goto in this function only. The code now matches the other, similar
    shadowing functions.

    We are overwriting an existing region 2 table entry. All allocated pages
    are added to the crst_list to be freed later, so they are not lost
    forever. However, when unshadowing the region 2 table, we wouldn't trigger
    unshadowing of the original shadowed region 3 table that we replaced. It
    would get unshadowed when the original region 3 table is modified. As it's
    not connected to the page table hierarchy anymore, it's not going to get
    used anymore. However, for a limited time, this page table will stick
    around, so it's in some sense a temporary memory leak.

    Identified by manual code inspection. I don't think this classifies as
    stable material.

    Fixes: 998f637cc4b9 ("s390/mm: avoid races on region/segment/page table shadowing")
    Signed-off-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200403153050.20569-4-david@redhat.com
    Reviewed-by: Claudio Imbrenda
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Sasha Levin

    David Hildenbrand
     
  • [ Upstream commit 4141b6a5e9f171325effc36a22eb92bf961e7a5c ]

    When perf record -e SF_CYCLES_BASIC_DIAG runs with very high
    frequency, the samples arrive faster than the perf process can
    save them to file. Eventually, for longer running processes, this
    leads to the siutation where the trace buffers allocated by perf
    slowly fills up. At one point the auxiliary trace buffer is full
    and the CPU Measurement sampling facility is turned off. Furthermore
    a warning is printed to the kernel log buffer:

    cpum_sf: The AUX buffer with 0 pages for the diagnostic-sampling
    mode is full

    The number of allocated pages for the auxiliary trace buffer is shown
    as zero pages. That is wrong.

    Fix this by saving the number of allocated pages before entering the
    work loop in the interrupt handler. When the interrupt handler processes
    the samples, it may detect the buffer full condition and stop sampling,
    reducing the buffer size to zero.
    Print the correct value in the error message:

    cpum_sf: The AUX buffer with 256 pages for the diagnostic-sampling
    mode is full

    Signed-off-by: Thomas Richter
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Thomas Richter
     
  • [ Upstream commit 872f27103874a73783aeff2aac2b41a489f67d7c ]

    /proc/cpuinfo should not print information about CPU 0 when it is offline.

    Fixes: 281eaa8cb67c ("s390/cpuinfo: simplify locking and skip offline cpus early")
    Signed-off-by: Alexander Gordeev
    Reviewed-by: Heiko Carstens
    [heiko.carstens@de.ibm.com: shortened commit message]
    Signed-off-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Sasha Levin

    Alexander Gordeev
     

17 Apr, 2020

3 commits

  • commit 6c7c851f1b666a8a455678a0b480b9162de86052 upstream.

    Show the full diag statistic table and not just parts of it.

    The issue surfaced in a KVM guest with a number of vcpus
    defined smaller than NR_DIAG_STAT.

    Fixes: 1ec2772e0c3c ("s390/diag: add a statistic for diagnose calls")
    Cc: stable@vger.kernel.org
    Signed-off-by: Michael Mueller
    Reviewed-by: Heiko Carstens
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Michael Mueller
     
  • commit 4d4cee96fb7a3cc53702a9be8299bf525be4ee98 upstream.

    Whenever we get an -EFAULT, we failed to read in guest 2 physical
    address space. Such addressing exceptions are reported via a program
    intercept to the nested hypervisor.

    We faked the intercept, we have to return to guest 2. Instead, right
    now we would be returning -EFAULT from the intercept handler, eventually
    crashing the VM.
    the correct thing to do is to return 1 as rc == 1 is the internal
    representation of "we have to go back into g2".

    Addressing exceptions can only happen if the g2->g3 page tables
    reference invalid g2 addresses (say, either a table or the final page is
    not accessible - so something that basically never happens in sane
    environments.

    Identified by manual code inspection.

    Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
    Cc: # v4.8+
    Signed-off-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200403153050.20569-3-david@redhat.com
    Reviewed-by: Claudio Imbrenda
    Reviewed-by: Christian Borntraeger
    [borntraeger@de.ibm.com: fix patch description]
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     
  • commit a1d032a49522cb5368e5dfb945a85899b4c74f65 upstream.

    In case we have a region 1 the following calculation
    (31 + ((gmap->asce & _ASCE_TYPE_MASK) >> 2)*11)
    results in 64. As shifts beyond the size are undefined the compiler is
    free to use instructions like sllg. sllg will only use 6 bits of the
    shift value (here 64) resulting in no shift at all. That means that ALL
    addresses will be rejected.

    The can result in endless loops, e.g. when prefix cannot get mapped.

    Fixes: 4be130a08420 ("s390/mm: add shadow gmap support")
    Tested-by: Janosch Frank
    Reported-by: Janosch Frank
    Cc: # v4.8+
    Signed-off-by: David Hildenbrand
    Link: https://lore.kernel.org/r/20200403153050.20569-2-david@redhat.com
    Reviewed-by: Claudio Imbrenda
    Reviewed-by: Christian Borntraeger
    [borntraeger@de.ibm.com: fix patch description, remove WARN_ON_ONCE]
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

13 Apr, 2020

1 commit

  • commit 0b38b5e1d0e2f361e418e05c179db05bb688bbd6 upstream.

    When userspace executes a syscall or gets interrupted,
    BEAR contains a kernel address when returning to userspace.
    This make it pretty easy to figure out where the kernel is
    mapped even with KASLR enabled. To fix this, add lpswe to
    lowcore and always execute it there, so userspace sees only
    the lowcore address of lpswe. For this we have to extend
    both critical_cleanup and the SWITCH_ASYNC macro to also check
    for lpswe addresses in lowcore.

    Fixes: b2d24b97b2a9 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
    Cc: # v5.2+
    Reviewed-by: Gerald Schaefer
    Signed-off-by: Sven Schnelle
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Greg Kroah-Hartman

    Sven Schnelle