Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2020

5 commits

4f5260ee0 mm/gup: fix gup_fast with dynamic page table folding ... Browse Code »

commit d3f7b1bb204099f2f7306318896223e8599bb6a2 upstream.

Currently to make sure that every page table entry is read just once
gup_fast walks perform READ_ONCE and pass pXd value down to the next
gup_pXd_range function by value e.g.:

static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
...
pudp = pud_offset(&p4d, addr);

This function passes a reference on that local value copy to pXd_offset,
and might get the very same pointer in return. This happens when the
level is folded (on most arches), and that pointer should not be
iterated.

On s390 due to the fact that each task might have different 5,4 or
3-level address translation and hence different levels folded the logic
is more complex and non-iteratable pointer to a local copy leads to
severe problems.

Here is an example of what happens with gup_fast on s390, for a task
with 3-level paging, crossing a 2 GB pud boundary:

// addr = 0x1007ffff000, end = 0x10080001000
static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
unsigned int flags, struct page **pages, int *nr)
{
unsigned long next;
pud_t *pudp;

// pud_offset returns &p4d itself (a pointer to a value on stack)
pudp = pud_offset(&p4d, addr);
do {
// on second iteratation reading "random" stack value
pud_t pud = READ_ONCE(*pudp);

// next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
next = pud_addr_end(addr, end);
...
} while (pudp++, addr = next, addr != end); // pudp++ iterating over stack

return 1;
}

This happens since s390 moved to common gup code with commit
d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and
commit 1a42010cdc26 ("s390/mm: convert to the generic
get_user_pages_fast code").

s390 tried to mimic static level folding by changing pXd_offset
primitives to always calculate top level page table offset in pgd_offset
and just return the value passed when pXd_offset has to act as folded.

What is crucial for gup_fast and what has been overlooked is that
PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
And the latter is not possible with dynamic folding.

To fix the issue in addition to pXd values pass original pXdp pointers
down to gup_pXd_range functions. And introduce pXd_offset_lockless
helpers, which take an additional pXd entry value parameter. This has
already been discussed in

https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1

Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
Signed-off-by: Vasily Gorbik
Signed-off-by: Andrew Morton
Reviewed-by: Gerald Schaefer
Reviewed-by: Alexander Gordeev
Reviewed-by: Jason Gunthorpe
Reviewed-by: Mike Rapoport
Reviewed-by: John Hubbard
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Dave Hansen
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: Arnd Bergmann
Cc: Andrey Ryabinin
Cc: Heiko Carstens
Cc: Christian Borntraeger
Cc: Claudio Imbrenda
Cc: [5.2+]
Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vasily Gorbik
2020-10-01 19:18:24 +0800
43d750a09 s390/init: add missing __init annotations ... Browse Code »

[ Upstream commit fcb2b70cdb194157678fb1a75f9ff499aeba3d2a ]

Add __init to reserve_memory_end, reserve_oldmem and remove_oldmem.
Sometimes these functions are not inlined, and then the build
complains about section mismatch.

Signed-off-by: Ilya Leoshkevich
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Ilya Leoshkevich
2020-10-01 19:18:14 +0800
fc1d08a20 s390/irq: replace setup_irq() by request_irq() ... Browse Code »

[ Upstream commit 8719b6d29d2851fa84c4074bb2e5adc022911ab8 ]

request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.

Per tglx[1], setup_irq() existed in olden days when allocators were not
ready by the time early interrupts were initialized.

Hence replace setup_irq() by request_irq().

[1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos

Signed-off-by: afzal mohammed
Message-Id:
[heiko.carstens@de.ibm.com: replace pr_err with panic]
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

afzal mohammed
2020-10-01 19:17:40 +0800
4f726a2af s390/cpum_sf: Use kzalloc and minor changes ... Browse Code »

[ Upstream commit 32dab6828c42f087439d3e2617dc7283546bd8f7 ]

Use kzalloc() to allocate auxiliary buffer structure initialized
with all zeroes to avoid random value in trace output.

Avoid double access to SBD hardware flags.

Signed-off-by: Thomas Richter
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Thomas Richter
2020-10-01 19:17:28 +0800
a04223019 s390: avoid misusing CALL_ON_STACK for task stack setup ... Browse Code »

[ Upstream commit 7bcaad1f9fac889f5fcd1a383acf7e00d006da41 ]

CALL_ON_STACK is intended to be used for temporary stack switching with
potential return to the caller.

When CALL_ON_STACK is misused to switch from nodat stack to task stack
back_chain information would later lead stack unwinder from task stack into
(per cpu) nodat stack which is reused for other purposes. This would
yield confusing unwinding result or errors.

To avoid that introduce CALL_ON_STACK_NORETURN to be used instead. It
makes sure that back_chain is zeroed and unwinder finishes gracefully
ending up at task pt_regs.

Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Vasily Gorbik
2020-10-01 19:17:22 +0800

10 Sep, 2020

1 commit

29bade8e2 s390: don't trace preemption in percpu macros ... Browse Code »

[ Upstream commit 1196f12a2c960951d02262af25af0bb1775ebcc2 ]

Since commit a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context}
to per-cpu variables") the lockdep code itself uses percpu variables. This
leads to recursions because the percpu macros are calling preempt_enable()
which might call trace_preempt_on().

Signed-off-by: Sven Schnelle
Reviewed-by: Vasily Gorbik
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Sven Schnelle
2020-09-10 01:12:22 +0800

03 Sep, 2020

1 commit

a0cfda9cb s390/numa: set node distance to LOCAL_DISTANCE ... Browse Code »

[ Upstream commit 535e4fc623fab2e09a0653fc3a3e17f382ad0251 ]

The node distance is hardcoded to 0, which causes a trouble
for some user-level applications. In particular, "libnuma"
expects the distance of a node to itself as LOCAL_DISTANCE.
This update removes the offending node distance override.

Cc: # 4.4
Fixes: 3a368f742da1 ("s390/numa: add core infrastructure")
Signed-off-by: Alexander Gordeev
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Sasha Levin
2020-09-03 17:26:50 +0800

26 Aug, 2020

2 commits

e9849a60f s390/ptrace: fix storage key handling ... Browse Code »

[ Upstream commit fd78c59446b8d050ecf3e0897c5a486c7de7c595 ]

The key member of the runtime instrumentation control block contains
only the access key, not the complete storage key. Therefore the value
must be shifted by four bits. Since existing user space does not
necessarily query and set the access key correctly, just ignore the
user space provided key and use the correct one.
Note: this is only relevant for debugging purposes in case somebody
compiles a kernel with a default storage access key set to a value not
equal to zero.

Fixes: 262832bc5acd ("s390/ptrace: add runtime instrumention register get/set")
Reported-by: Claudio Imbrenda
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Heiko Carstens
2020-08-26 16:41:02 +0800
d35f24bc5 s390/runtime_instrumentation: fix storage key handling ... Browse Code »

[ Upstream commit 9eaba29c7985236e16468f4e6a49cc18cf01443e ]

The key member of the runtime instrumentation control block contains
only the access key, not the complete storage key. Therefore the value
must be shifted by four bits.
Note: this is only relevant for debugging purposes in case somebody
compiles a kernel with a default storage access key set to a value not
equal to zero.

Fixes: e4b8b3f33fca ("s390: add support for runtime instrumentation")
Reported-by: Claudio Imbrenda
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Heiko Carstens
2020-08-26 16:41:02 +0800

19 Aug, 2020

1 commit

4db28111b s390/gmap: improve THP splitting ... Browse Code »

commit ba925fa35057a062ac98c3e8138b013ce4ce351c upstream.

During s390_enable_sie(), we need to take care of splitting all qemu user
process THP mappings. This is currently done with follow_page(FOLL_SPLIT),
by simply iterating over all vma ranges, with PAGE_SIZE increment.

This logic is sub-optimal and can result in a lot of unnecessary overhead,
especially when using qemu and ASAN with large shadow map. Ilya reported
significant system slow-down with one CPU busy for a long time and overall
unresponsiveness.

Fix this by using walk_page_vma() and directly calling split_huge_pmd()
only for present pmds, which greatly reduces overhead.

Cc: # v5.4+
Reported-by: Ilya Leoshkevich
Tested-by: Ilya Leoshkevich
Acked-by: Christian Borntraeger
Signed-off-by: Gerald Schaefer
Signed-off-by: Heiko Carstens
Signed-off-by: Greg Kroah-Hartman

Gerald Schaefer
2020-08-19 14:16:29 +0800

16 Jul, 2020

6 commits

1a7085759 s390/maccess: add no DAT mode to kernel_write ... Browse Code »

[ Upstream commit d6df52e9996dcc2062c3d9c9123288468bb95b52 ]

To be able to patch kernel code before paging is initialized do plain
memcpy if DAT is off. This is required to enable early jump label
initialization.

Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Vasily Gorbik
2020-07-16 14:16:48 +0800
627d15eec s390: Change s390_kernel_write() return type to match memcpy() ... Browse Code »

[ Upstream commit cb2cceaefb4c4dc28fc27ff1f1b2d258bfc10353 ]

s390_kernel_write()'s function type is almost identical to memcpy().
Change its return type to "void *" so they can be used interchangeably.

Cc: linux-s390@vger.kernel.org
Cc: heiko.carstens@de.ibm.com
Signed-off-by: Josh Poimboeuf
Acked-by: Joe Lawrence
Acked-by: Miroslav Benes
Acked-by: Gerald Schaefer # s390
Signed-off-by: Jiri Kosina
Signed-off-by: Sasha Levin

Josh Poimboeuf
2020-07-16 14:16:48 +0800
2dfd18245 s390/mm: fix huge pte soft dirty copying ... Browse Code »

commit 528a9539348a0234375dfaa1ca5dbbb2f8f8e8d2 upstream.

If the pmd is soft dirty we must mark the pte as soft dirty (and not dirty).
This fixes some cases for guest migration with huge page backings.

Cc: # 4.8
Fixes: bc29b7ac1d9f ("s390/mm: clean up pte/pmd encoding")
Reviewed-by: Christian Borntraeger
Reviewed-by: Gerald Schaefer
Signed-off-by: Janosch Frank
Signed-off-by: Heiko Carstens
Signed-off-by: Greg Kroah-Hartman

Janosch Frank
2020-07-16 14:16:47 +0800
0d62bc7e9 s390/setup: init jump labels before command line parsing ... Browse Code »

commit 95e61b1b5d6394b53d147c0fcbe2ae70fbe09446 upstream.

Command line parameters might set static keys. This is true for s390 at
least since commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1
and init_on_free=1 boot options"). To avoid the following WARN:

static_key_enable_cpuslocked(): static key 'init_on_alloc+0x0/0x40' used
before call to jump_label_init()

call jump_label_init() just before parse_early_param().
jump_label_init() is safe to call multiple times (x86 does that), doesn't
do any memory allocations and hence should be safe to call that early.

Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Cc: # 5.3: d6df52e9996d: s390/maccess: add no DAT mode to kernel_write
Cc: # 5.3
Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Heiko Carstens
Signed-off-by: Greg Kroah-Hartman

Vasily Gorbik
2020-07-16 14:16:46 +0800
9c732cccb s390/kasan: fix early pgm check handler execution ... Browse Code »

[ Upstream commit 998f5bbe3dbdab81c1cfb1aef7c3892f5d24f6c7 ]

Currently if early_pgm_check_handler is called it ends up in pgm check
loop. The problem is that early_pgm_check_handler is instrumented by
KASAN but executed without DAT flag enabled which leads to addressing
exception when KASAN checks try to access shadow memory.

Fix that by executing early handlers with DAT flag on under KASAN as
expected.

Reported-and-tested-by: Alexander Egorenkov
Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Vasily Gorbik
2020-07-16 14:16:35 +0800
eb676bef0 KVM: s390: reduce number of IO pins to 1 ... Browse Code »

[ Upstream commit 774911290c589e98e3638e73b24b0a4d4530e97c ]

The current number of KVM_IRQCHIP_NUM_PINS results in an order 3
allocation (32kb) for each guest start/restart. This can result in OOM
killer activity even with free swap when the memory is fragmented
enough:

kernel: qemu-system-s39 invoked oom-killer: gfp_mask=0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), order=3, oom_score_adj=0
kernel: CPU: 1 PID: 357274 Comm: qemu-system-s39 Kdump: loaded Not tainted 5.4.0-29-generic #33-Ubuntu
kernel: Hardware name: IBM 8562 T02 Z06 (LPAR)
kernel: Call Trace:
kernel: ([] show_stack+0x7a/0xc0)
kernel: [] dump_stack+0x8a/0xc0
kernel: [] dump_header+0x62/0x258
kernel: [] oom_kill_process+0x172/0x180
kernel: [] out_of_memory+0xee/0x580
kernel: [] __alloc_pages_slowpath+0xd18/0xe90
kernel: [] __alloc_pages_nodemask+0x2a4/0x320
kernel: [] kmalloc_order+0x34/0xb0
kernel: [] kmalloc_order_trace+0x32/0xe0
kernel: [] kvm_set_irq_routing+0xa6/0x2e0
kernel: [] kvm_arch_vm_ioctl+0x544/0x9e0
kernel: [] kvm_vm_ioctl+0x396/0x760
kernel: [] do_vfs_ioctl+0x376/0x690
kernel: [] ksys_ioctl+0x84/0xb0
kernel: [] __s390x_sys_ioctl+0x2a/0x40
kernel: [] system_call+0xd8/0x2c8

As far as I can tell s390x does not use the iopins as we bail our for
anything other than KVM_IRQ_ROUTING_S390_ADAPTER and the chip/pin is
only used for KVM_IRQ_ROUTING_IRQCHIP. So let us use a small number to
reduce the memory footprint.

Signed-off-by: Christian Borntraeger
Reviewed-by: Cornelia Huck
Reviewed-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200617083620.5409-1-borntraeger@de.ibm.com
Signed-off-by: Sasha Levin

Christian Borntraeger
2020-07-16 14:16:32 +0800

09 Jul, 2020

1 commit

8f4aa3a6d s390/debug: avoid kernel warning on too large number of pages ... Browse Code »

[ Upstream commit 827c4913923e0b441ba07ba4cc41e01181102303 ]

When specifying insanely large debug buffers a kernel warning is
printed. The debug code does handle the error gracefully, though.
Instead of duplicating the check let us silence the warning to
avoid crashes when panic_on_warn is used.

Signed-off-by: Christian Borntraeger
Reviewed-by: Heiko Carstens
Signed-off-by: Heiko Carstens
Signed-off-by: Sasha Levin

Christian Borntraeger
2020-07-09 15:37:50 +0800

01 Jul, 2020

4 commits

a9a3b33b2 s390/vdso: fix vDSO clock_getres() ... Browse Code »

[ Upstream commit 478237a595120a18e9b52fd2c57a6e8b7a01e411 ]

clock_getres in the vDSO library has to preserve the same behaviour
of posix_get_hrtimer_res().

In particular, posix_get_hrtimer_res() does:
sec = 0;
ns = hrtimer_resolution;
and hrtimer_resolution depends on the enablement of the high
resolution timers that can happen either at compile or at run time.

Fix the s390 vdso implementation of clock_getres keeping a copy of
hrtimer_resolution in vdso data and using that directly.

Link: https://lkml.kernel.org/r/20200324121027.21665-1-vincenzo.frascino@arm.com
Signed-off-by: Vincenzo Frascino
Acked-by: Martin Schwidefsky
[heiko.carstens@de.ibm.com: use llgf for proper zero extension]
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Vincenzo Frascino
2020-07-01 03:37:05 +0800
68a3cbc44 s390/vdso: Use $(LD) instead of $(CC) to link vDSO ... Browse Code »

[ Upstream commit 2b2a25845d534ac6d55086e35c033961fdd83a26 ]

Currently, the VDSO is being linked through $(CC). This does not match
how the rest of the kernel links objects, which is through the $(LD)
variable.

When clang is built in a default configuration, it first attempts to use
the target triple's default linker, which is just ld. However, the user
can override this through the CLANG_DEFAULT_LINKER cmake define so that
clang uses another linker by default, such as LLVM's own linker, ld.lld.
This can be useful to get more optimized links across various different
projects.

However, this is problematic for the s390 vDSO because ld.lld does not
have any s390 emulatiom support:

https://github.com/llvm/llvm-project/blob/llvmorg-10.0.1-rc1/lld/ELF/Driver.cpp#L132-L150

Thus, if a user is using a toolchain with ld.lld as the default, they
will see an error, even if they have specified ld.bfd through the LD
make variable:

$ make -j"$(nproc)" -s ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- LLVM=1 \
LD=s390x-linux-gnu-ld \
defconfig arch/s390/kernel/vdso64/
ld.lld: error: unknown emulation: elf64_s390
clang-11: error: linker command failed with exit code 1 (use -v to see invocation)

Normally, '-fuse-ld=bfd' could be used to get around this; however, this
can be fragile, depending on paths and variable naming. The cleaner
solution for the kernel is to take advantage of the fact that $(LD) can
be invoked directly, which bypasses the heuristics of $(CC) and respects
the user's choice. Similar changes have been done for ARM, ARM64, and
MIPS.

Link: https://lkml.kernel.org/r/20200602192523.32758-1-natechancellor@gmail.com
Link: https://github.com/ClangBuiltLinux/linux/issues/1041
Signed-off-by: Nathan Chancellor
Reviewed-by: Nick Desaulniers
[heiko.carstens@de.ibm.com: add --build-id flag]
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Nathan Chancellor
2020-07-01 03:37:05 +0800
7c17909a8 s390/ptrace: fix setting syscall number ... Browse Code »

[ Upstream commit 873e5a763d604c32988c4a78913a8dab3862d2f9 ]

When strace wants to update the syscall number, it sets GPR2
to the desired number and updates the GPR via PTRACE_SETREGSET.
It doesn't update regs->int_code which would cause the old syscall
executed on syscall restart. As we cannot change the ptrace ABI and
don't have a field for the interruption code, check whether the tracee
is in a syscall and the last instruction was svc. In that case assume
that the tracer wants to update the syscall number and copy the GPR2
value to regs->int_code.

Signed-off-by: Sven Schnelle
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Sven Schnelle
2020-07-01 03:37:04 +0800
64f7b10a9 s390/ptrace: pass invalid syscall numbers to tracing ... Browse Code »

[ Upstream commit 00332c16b1604242a56289ff2b26e283dbad0812 ]

tracing expects to see invalid syscalls, so pass it through.
The syscall path in entry.S checks the syscall number before
looking up the handler, so it is still safe.

Signed-off-by: Sven Schnelle
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Sven Schnelle
2020-07-01 03:37:04 +0800

24 Jun, 2020

1 commit

190f6c2d6 s390: fix syscall_get_error for compat processes ... Browse Code »

commit b3583fca5fb654af2cfc1c08259abb9728272538 upstream.

If both the tracer and the tracee are compat processes, and gprs[2]
is assigned a value by __poke_user_compat, then the higher 32 bits
of gprs[2] are cleared, IS_ERR_VALUE() always returns false, and
syscall_get_error() always returns 0.

Fix the implementation by sign-extending the value for compat processes
the same way as x86 implementation does.

The bug was exposed to user space by commit 201766a20e30f ("ptrace: add
PTRACE_GET_SYSCALL_INFO request") and detected by strace test suite.

This change fixes strace syscall tampering on s390.

Link: https://lkml.kernel.org/r/20200602180051.GA2427@altlinux.org
Fixes: 753c4dd6a2fa2 ("[S390] ptrace changes")
Cc: Elvira Khabirova
Cc: stable@vger.kernel.org # v2.6.28+
Signed-off-by: Dmitry V. Levin
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Dmitry V. Levin
2020-06-24 23:50:50 +0800

17 Jun, 2020

1 commit

86c7d245e s390/pci: Log new handle in clp_disable_fh() ... Browse Code »

[ Upstream commit e1750a3d9abbea2ece29cac8dc5a6f5bc19c1492 ]

After disabling a function, the original handle is logged instead of
the disabled handle.

Link: https://lkml.kernel.org/r/20200522183922.5253-1-ptesarik@suse.com
Fixes: 17cdec960cf7 ("s390/pci: Recover handle in clp_set_pci_fn()")
Reviewed-by: Pierre Morel
Signed-off-by: Petr Tesarik
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Petr Tesarik
2020-06-17 22:40:23 +0800

07 Jun, 2020

2 commits

b5cb7fe92 s390/mm: fix set_huge_pte_at() for empty ptes ... Browse Code »

[ Upstream commit ac8372f3b4e41015549b331a4f350224661e7fc6 ]

On s390, the layout of normal and large ptes (i.e. pmds/puds) differs.
Therefore, set_huge_pte_at() does a conversion from a normal pte to
the corresponding large pmd/pud. So, when converting an empty pte, this
should result in an empty pmd/pud, which would return true for
pmd/pud_none().

However, after conversion we also mark the pmd/pud as large, and
therefore present. For empty ptes, this will result in an empty pmd/pud
that is also marked as large, and pmd/pud_none() would not return true.

There is currently no issue with this behaviour, as set_huge_pte_at()
does not seem to be called for empty ptes. It would be valid though, so
let's fix this by not marking empty ptes as large in set_huge_pte_at().

This was found by testing a patch from from Anshuman Khandual, which is
currently discussed on LKML ("mm/debug: Add more arch page table helper
tests").

Signed-off-by: Gerald Schaefer
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Gerald Schaefer
2020-06-07 19:18:51 +0800
0377fda07 s390/ftrace: save traced function caller ... Browse Code »

[ Upstream commit b4adfe55915d8363e244e42386d69567db1719b9 ]

A typical backtrace acquired from ftraced function currently looks like
the following (e.g. for "path_openat"):

arch_stack_walk+0x15c/0x2d8
stack_trace_save+0x50/0x68
stack_trace_call+0x15a/0x3b8
ftrace_graph_caller+0x0/0x1c
0x3e0007e3c98
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Vasily Gorbik
2020-06-07 19:18:49 +0800

27 May, 2020

3 commits

bd6f0c799 s390/kexec_file: fix initrd location for kdump kernel ... Browse Code »

commit 70b690547d5ea1a3d135a4cc39cd1e08246d0c3a upstream.

initrd_start must not point at the location the initrd is loaded into
the crashkernel memory but at the location it will be after the
crashkernel memory is swapped with the memory at 0.

Fixes: ee337f5469fd ("s390/kexec_file: Add crash support to image loader")
Reported-by: Lianbo Jiang
Signed-off-by: Philipp Rudo
Tested-by: Lianbo Jiang
Link: https://lore.kernel.org/r/20200512193956.15ae3f23@laptop2-ibm.local
Signed-off-by: Christian Borntraeger
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Philipp Rudo
2020-05-27 23:46:49 +0800
9e451933b s390/kaslr: add support for R_390_JMP_SLOT relocation type ... Browse Code »

commit 4c1cbcbd6c56c79de2c07159be4f55386bb0bef2 upstream.

With certain kernel configurations, the R_390_JMP_SLOT relocation type
might be generated, which is not expected by the KASLR relocation code,
and the kernel stops with the message "Unknown relocation type".

This was found with a zfcpdump kernel config, where CONFIG_MODULES=n
and CONFIG_VFIO=n. In that case, symbol_get() is used on undefined
__weak symbols in virt/kvm/vfio.c, which results in the generation
of R_390_JMP_SLOT relocation types.

Fix this by handling R_390_JMP_SLOT similar to R_390_GLOB_DAT.

Fixes: 805bc0bc238f ("s390/kernel: build a relocatable kernel")
Cc: # v5.2+
Signed-off-by: Gerald Schaefer
Reviewed-by: Philipp Rudo
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Gerald Schaefer
2020-05-27 23:46:47 +0800
72f324150 s390/pci: Fix s390_mmio_read/write with MIO ... Browse Code »

commit f058599e22d59e594e5aae1dc10560568d8f4a8b upstream.

The s390_mmio_read/write syscalls are currently broken when running with
MIO.

The new pcistb_mio/pcstg_mio/pcilg_mio instructions are executed
similiarly to normal load/store instructions and do address translation
in the current address space. That means inside the kernel they are
aware of mappings into kernel address space while outside the kernel
they use user space mappings (usually created through mmap'ing a PCI
device file).

Now when existing user space applications use the s390_pci_mmio_write
and s390_pci_mmio_read syscalls, they pass I/O addresses that are mapped
into user space so as to be usable with the new instructions without
needing a syscall. Accessing these addresses with the old instructions
as done currently leads to a kernel panic.

Also, for such a user space mapping there may not exist an equivalent
kernel space mapping which means we can't just use the new instructions
in kernel space.

Instead of replicating user mappings in the kernel which then might
collide with other mappings, we can conceptually execute the new
instructions as if executed by the user space application using the
secondary address space. This even allows us to directly store to the
user pointer without the need for copy_to/from_user().

Cc: stable@vger.kernel.org
Fixes: 71ba41c9b1d9 ("s390/pci: provide support for MIO instructions")
Signed-off-by: Niklas Schnelle
Reviewed-by: Sven Schnelle
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Niklas Schnelle
2020-05-27 23:46:47 +0800

14 May, 2020

1 commit

3f23f7812 KVM: s390: Remove false WARN_ON_ONCE for the PQAP instruction ... Browse Code »

commit 5615e74f48dcc982655543e979b6c3f3f877e6f6 upstream.

In LPAR we will only get an intercept for FC==3 for the PQAP
instruction. Running nested under z/VM can result in other intercepts as
well as ECA_APIE is an effective bit: If one hypervisor layer has
turned this bit off, the end result will be that we will get intercepts for
all function codes. Usually the first one will be a query like PQAP(QCI).
So the WARN_ON_ONCE is not right. Let us simply remove it.

Cc: Pierre Morel
Cc: Tony Krowiak
Cc: stable@vger.kernel.org # v5.3+
Fixes: e5282de93105 ("s390: ap: kvm: add PQAP interception for AQIC")
Link: https://lore.kernel.org/kvm/20200505083515.2720-1-borntraeger@de.ibm.com
Reported-by: Qian Cai
Signed-off-by: Christian Borntraeger
Reviewed-by: David Hildenbrand
Reviewed-by: Cornelia Huck
Signed-off-by: Christian Borntraeger
Signed-off-by: Greg Kroah-Hartman

Christian Borntraeger
2020-05-14 13:58:25 +0800

02 May, 2020

2 commits

a8b5611ff s390/pci: do not set affinity for floating irqs ... Browse Code »

commit 86dbf32da150339ca81509fa2eb84c814b55258b upstream.

with the introduction of CPU directed interrupts the kernel
parameter pci=force_floating was introduced to fall back
to the previous behavior using floating irqs.

However we were still setting the affinity in that case,
both in __irq_alloc_descs() and via the irq_set_affinity
callback in struct irq_chip.

For the former only set the affinity in the directed case.

The latter is explicitly set in zpci_directed_irq_init()
so we can just leave it unset for the floating case.

Fixes: e979ce7bced2 ("s390/pci: provide support for CPU directed interrupts")
Co-developed-by: Alexander Schmidt
Signed-off-by: Alexander Schmidt
Signed-off-by: Niklas Schnelle
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Niklas Schnelle
2020-05-02 14:48:51 +0800
37405f296 s390/ftrace: fix potential crashes when switching tracers ... Browse Code »

commit 8ebf6da9db1b2a20bb86cc1bee2552e894d03308 upstream.

Switching tracers include instruction patching. To prevent that a
instruction is patched while it's read the instruction patching is done
in stop_machine 'context'. This also means that any function called
during stop_machine must not be traced. Thus add 'notrace' to all
functions called within stop_machine.

Fixes: 1ec2772e0c3c ("s390/diag: add a statistic for diagnose calls")
Fixes: 38f2c691a4b3 ("s390: improve wait logic of stop_machine")
Fixes: 4ecf0a43e729 ("processor: get rid of cpu_relax_yield")
Signed-off-by: Philipp Rudo
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Philipp Rudo
2020-05-02 14:48:44 +0800

29 Apr, 2020

2 commits

44d9eb0eb s390/mm: fix page table upgrade vs 2ndary address mode accesses ... Browse Code »

commit 316ec154810960052d4586b634156c54d0778f74 upstream.

A page table upgrade in a kernel section that uses secondary address
mode will mess up the kernel instructions as follows:

Consider the following scenario: two threads are sharing memory.
On CPU1 thread 1 does e.g. strnlen_user(). That gets to
old_fs = enable_sacf_uaccess();
len = strnlen_user_srst(src, size);
and
" la %2,0(%1)\n"
" la %3,0(%0,%1)\n"
" slgr %0,%0\n"
" sacf 256\n"
"0: srst %3,%2\n"
in strnlen_user_srst(). At that point we are in secondary space mode,
control register 1 points to kernel page table and instruction fetching
happens via c1, rather than usual c13. Interrupts are not disabled, for
obvious reasons.

On CPU2 thread 2 does MAP_FIXED mmap(), forcing the upgrade of page table
from 3-level to e.g. 4-level one. We'd allocated new top-level table,
set it up and now we hit this:
notify = 1;
spin_unlock_bh(&mm->page_table_lock);
}
if (notify)
on_each_cpu(__crst_table_upgrade, mm, 0);
OK, we need to actually change over to use of new page table and we
need that to happen in all threads that are currently running. Which
happens to include the thread 1. IPI is delivered and we have
static void __crst_table_upgrade(void *arg)
{
struct mm_struct *mm = arg;

if (current->active_mm == mm)
set_user_asce(mm);
__tlb_flush_local();
}
run on CPU1. That does
static inline void set_user_asce(struct mm_struct *mm)
{
S390_lowcore.user_asce = mm->context.asce;
OK, user page table address updated...
__ctl_load(S390_lowcore.user_asce, 1, 1);
... and control register 1 set to it.
clear_cpu_flag(CIF_ASCE_PRIMARY);
}

IPI is run in home space mode, so it's fine - insns are fetched
using c13, which always points to kernel page table. But as soon
as we return from the interrupt, previous PSW is restored, putting
CPU1 back into secondary space mode, at which point we no longer
get the kernel instructions from the kernel mapping.

The fix is to only fixup the control registers that are currently in use
for user processes during the page table update. We must also disable
interrupts in enable_sacf_uaccess to synchronize the cr and
thread.mm_segment updates against the on_each-cpu.

Fixes: 0aaba41b58bc ("s390: remove all code using the access register mode")
Cc: stable@vger.kernel.org # 4.15+
Reported-by: Al Viro
Reviewed-by: Gerald Schaefer
References: CVE-2020-11884
Signed-off-by: Christian Borntraeger
Signed-off-by: Greg Kroah-Hartman

Christian Borntraeger
2020-04-29 22:33:25 +0800
347125705 KVM: s390: Return last valid slot if approx index is out-of-bounds ... Browse Code »

commit 97daa028f3f621adff2c4f7b15fe0874e5b5bd6c upstream.

Return the index of the last valid slot from gfn_to_memslot_approx() if
its binary search loop yielded an out-of-bounds index. The index can
be out-of-bounds if the specified gfn is less than the base of the
lowest memslot (which is also the last valid memslot).

Note, the sole caller, kvm_s390_get_cmma(), ensures used_slots is
non-zero.

Fixes: afdad61615cc3 ("KVM: s390: Fix storage attributes migration with memory slots")
Cc: stable@vger.kernel.org # 4.19.x: 0774a964ef56: KVM: Fix out of range accesses to memslots
Cc: stable@vger.kernel.org # 4.19.x
Signed-off-by: Sean Christopherson
Message-Id:
Reviewed-by: Cornelia Huck
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2020-04-29 22:33:16 +0800

23 Apr, 2020

3 commits

f24d8de03 KVM: s390: vsie: Fix possible race when shadowing region 3 tables ... Browse Code »

[ Upstream commit 1493e0f944f3c319d11e067c185c904d01c17ae5 ]

We have to properly retry again by returning -EINVAL immediately in case
somebody else instantiated the table concurrently. We missed to add the
goto in this function only. The code now matches the other, similar
shadowing functions.

We are overwriting an existing region 2 table entry. All allocated pages
are added to the crst_list to be freed later, so they are not lost
forever. However, when unshadowing the region 2 table, we wouldn't trigger
unshadowing of the original shadowed region 3 table that we replaced. It
would get unshadowed when the original region 3 table is modified. As it's
not connected to the page table hierarchy anymore, it's not going to get
used anymore. However, for a limited time, this page table will stick
around, so it's in some sense a temporary memory leak.

Identified by manual code inspection. I don't think this classifies as
stable material.

Fixes: 998f637cc4b9 ("s390/mm: avoid races on region/segment/page table shadowing")
Signed-off-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200403153050.20569-4-david@redhat.com
Reviewed-by: Claudio Imbrenda
Reviewed-by: Christian Borntraeger
Signed-off-by: Christian Borntraeger
Signed-off-by: Sasha Levin

David Hildenbrand
2020-04-23 16:36:37 +0800
4078dceb1 s390/cpum_sf: Fix wrong page count in error message ... Browse Code »

[ Upstream commit 4141b6a5e9f171325effc36a22eb92bf961e7a5c ]

When perf record -e SF_CYCLES_BASIC_DIAG runs with very high
frequency, the samples arrive faster than the perf process can
save them to file. Eventually, for longer running processes, this
leads to the siutation where the trace buffers allocated by perf
slowly fills up. At one point the auxiliary trace buffer is full
and the CPU Measurement sampling facility is turned off. Furthermore
a warning is printed to the kernel log buffer:

cpum_sf: The AUX buffer with 0 pages for the diagnostic-sampling
mode is full

The number of allocated pages for the auxiliary trace buffer is shown
as zero pages. That is wrong.

Fix this by saving the number of allocated pages before entering the
work loop in the interrupt handler. When the interrupt handler processes
the samples, it may detect the buffer full condition and stop sampling,
reducing the buffer size to zero.
Print the correct value in the error message:

cpum_sf: The AUX buffer with 256 pages for the diagnostic-sampling
mode is full

Signed-off-by: Thomas Richter
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Thomas Richter
2020-04-23 16:36:35 +0800
4753b111f s390/cpuinfo: fix wrong output when CPU0 is offline ... Browse Code »

[ Upstream commit 872f27103874a73783aeff2aac2b41a489f67d7c ]

/proc/cpuinfo should not print information about CPU 0 when it is offline.

Fixes: 281eaa8cb67c ("s390/cpuinfo: simplify locking and skip offline cpus early")
Signed-off-by: Alexander Gordeev
Reviewed-by: Heiko Carstens
[heiko.carstens@de.ibm.com: shortened commit message]
Signed-off-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Sasha Levin

Alexander Gordeev
2020-04-23 16:36:33 +0800

17 Apr, 2020

3 commits

efb9e9f72 s390/diag: fix display of diagnose call statistics ... Browse Code »

commit 6c7c851f1b666a8a455678a0b480b9162de86052 upstream.

Show the full diag statistic table and not just parts of it.

The issue surfaced in a KVM guest with a number of vcpus
defined smaller than NR_DIAG_STAT.

Fixes: 1ec2772e0c3c ("s390/diag: add a statistic for diagnose calls")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Mueller
Reviewed-by: Heiko Carstens
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Michael Mueller
2020-04-17 16:50:21 +0800
0c7fb8c91 KVM: s390: vsie: Fix delivery of addressing exceptions ... Browse Code »

commit 4d4cee96fb7a3cc53702a9be8299bf525be4ee98 upstream.

Whenever we get an -EFAULT, we failed to read in guest 2 physical
address space. Such addressing exceptions are reported via a program
intercept to the nested hypervisor.

We faked the intercept, we have to return to guest 2. Instead, right
now we would be returning -EFAULT from the intercept handler, eventually
crashing the VM.
the correct thing to do is to return 1 as rc == 1 is the internal
representation of "we have to go back into g2".

Addressing exceptions can only happen if the g2->g3 page tables
reference invalid g2 addresses (say, either a table or the final page is
not accessible - so something that basically never happens in sane
environments.

Identified by manual code inspection.

Fixes: a3508fbe9dc6 ("KVM: s390: vsie: initial support for nested virtualization")
Cc: # v4.8+
Signed-off-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200403153050.20569-3-david@redhat.com
Reviewed-by: Claudio Imbrenda
Reviewed-by: Christian Borntraeger
[borntraeger@de.ibm.com: fix patch description]
Signed-off-by: Christian Borntraeger
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-04-17 16:50:13 +0800
654b70e84 KVM: s390: vsie: Fix region 1 ASCE sanity shadow address checks ... Browse Code »

commit a1d032a49522cb5368e5dfb945a85899b4c74f65 upstream.

In case we have a region 1 the following calculation
(31 + ((gmap->asce & _ASCE_TYPE_MASK) >> 2)*11)
results in 64. As shifts beyond the size are undefined the compiler is
free to use instructions like sllg. sllg will only use 6 bits of the
shift value (here 64) resulting in no shift at all. That means that ALL
addresses will be rejected.

The can result in endless loops, e.g. when prefix cannot get mapped.

Fixes: 4be130a08420 ("s390/mm: add shadow gmap support")
Tested-by: Janosch Frank
Reported-by: Janosch Frank
Cc: # v4.8+
Signed-off-by: David Hildenbrand
Link: https://lore.kernel.org/r/20200403153050.20569-2-david@redhat.com
Reviewed-by: Claudio Imbrenda
Reviewed-by: Christian Borntraeger
[borntraeger@de.ibm.com: fix patch description, remove WARN_ON_ONCE]
Signed-off-by: Christian Borntraeger
Signed-off-by: Greg Kroah-Hartman

David Hildenbrand
2020-04-17 16:50:13 +0800

13 Apr, 2020

1 commit

5e3319782 s390: prevent leaking kernel address in BEAR ... Browse Code »

commit 0b38b5e1d0e2f361e418e05c179db05bb688bbd6 upstream.

When userspace executes a syscall or gets interrupted,
BEAR contains a kernel address when returning to userspace.
This make it pretty easy to figure out where the kernel is
mapped even with KASLR enabled. To fix this, add lpswe to
lowcore and always execute it there, so userspace sees only
the lowcore address of lpswe. For this we have to extend
both critical_cleanup and the SWITCH_ASYNC macro to also check
for lpswe addresses in lowcore.

Fixes: b2d24b97b2a9 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
Cc: # v5.2+
Reviewed-by: Gerald Schaefer
Signed-off-by: Sven Schnelle
Signed-off-by: Vasily Gorbik
Signed-off-by: Greg Kroah-Hartman

Sven Schnelle
2020-04-13 16:48:06 +0800