13 Jan, 2021

1 commit

  • [ Upstream commit 87dbc209ea04645fd2351981f09eff5d23f8e2e9 ]

    Make mandatory in include/asm-generic/Kbuild and
    remove all arch/*/include/asm/local64.h arch-specific files since they
    only #include .

    This fixes build errors on arch/c6x/ and arch/nios2/ for
    block/blk-iocost.c.

    Build-tested on 21 of 25 arch-es. (tools problems on the others)

    Yes, we could even rename to
    and change all #includes to use
    instead.

    Link: https://lkml.kernel.org/r/20201227024446.17018-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Suggested-by: Christoph Hellwig
    Reviewed-by: Masahiro Yamada
    Cc: Jens Axboe
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Peter Zijlstra
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Randy Dunlap
     

06 Jan, 2021

1 commit


30 Dec, 2020

5 commits

  • commit 454efcf82ea17d7efeb86ebaa20775a21ec87d27 upstream.

    When a machine check interrupt is triggered during idle, the code
    is using the async timer/clock for idle time calculation. It should use
    the machine check enter timer/clock which is passed to the macro.

    Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S")
    Cc: # 5.8
    Reviewed-by: Heiko Carstens
    Signed-off-by: Sven Schnelle
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Sven Schnelle
     
  • commit e259b3fafa7de362b04ecd86e7fa9a9e9273e5fb upstream.

    During removal of the critical section cleanup the calculation
    of mt_cycles during idle was removed. This causes invalid
    accounting on systems with SMT enabled.

    Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S")
    Cc: # 5.8
    Reviewed-by: Heiko Carstens
    Signed-off-by: Sven Schnelle
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Sven Schnelle
     
  • commit 613775d62ec60202f98d2c5f520e6e9ba6dd4ac4 upstream.

    diag308 subcode 0 performes a clear reset which inlcudes the reset of
    all registers in the system. While this is the preferred behavior when
    loading a normal kernel via kexec it prevents the crash kernel to store
    the register values in the dump. To prevent this use subcode 1 when
    loading a crash kernel instead.

    Fixes: ee337f5469fd ("s390/kexec_file: Add crash support to image loader")
    Cc: # 4.17
    Signed-off-by: Philipp Rudo
    Reported-by: Xiaoying Yan
    Tested-by: Lianbo Jiang
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Philipp Rudo
     
  • commit b5e438ebd7e808d1d2435159ac4742e01a94b8da upstream.

    Not resetting the SMT siblings might leave them in unpredictable
    state. One of the observed problems was that the CPU timer wasn't
    reset and therefore large system time values where accounted during
    CPU bringup.

    Cc: # 4.0
    Fixes: 10ad34bc76dfb ("s390: add SMT support")
    Reviewed-by: Heiko Carstens
    Signed-off-by: Sven Schnelle
    Signed-off-by: Heiko Carstens
    Signed-off-by: Greg Kroah-Hartman

    Sven Schnelle
     
  • [ Upstream commit f22b9c219a798e1bf11110a3d2733d883e6da059 ]

    The CALL_ON_STACK tests use the no_dat stack to switch to a different
    stack for unwinding tests. If an interrupt or machine check happens
    while using that stack, and previously being on the async stack, the
    interrupt / machine check entry code (SWITCH_ASYNC) will assume that
    the previous context did not use the async stack and happily use the
    async stack again.

    This will lead to stack corruption of the previous context.

    To solve this disable both interrupts and machine checks before
    switching to the no_dat stack.

    Fixes: 7868249fbbc8 ("s390/test_unwind: add CALL_ON_STACK tests")
    Signed-off-by: Heiko Carstens
    Signed-off-by: Sasha Levin

    Heiko Carstens
     

03 Dec, 2020

2 commits

  • With commit 58c644ba512c ("sched/idle: Fix arch_cpu_idle() vs
    tracing") common code calls arch_cpu_idle() with a lockdep state that
    tells irqs are on.

    This doesn't work very well for s390: psw_idle() will enable interrupts
    to wait for an interrupt. As soon as an interrupt occurs the interrupt
    handler will verify if the old context was psw_idle(). If that is the
    case the interrupt enablement bits in the old program status word will
    be cleared.

    A subsequent test in both the external as well as the io interrupt
    handler checks if in the old context interrupts were enabled. Due to
    the above patching of the old program status word it is assumed the
    old context had interrupts disabled, and therefore a call to
    TRACE_IRQS_OFF (aka trace_hardirqs_off_caller) is skipped. Which in
    turn makes lockdep incorrectly "think" that interrupts are enabled
    within the interrupt handler.

    Fix this by unconditionally calling TRACE_IRQS_OFF when entering
    interrupt handlers. Also call unconditionally TRACE_IRQS_ON when
    leaving interrupts handlers.

    This leaves the special psw_idle() case, which now returns with
    interrupts disabled, but has an "irqs on" lockdep state. So callers of
    psw_idle() must adjust the state on their own, if required. This is
    currently only __udelay_disabled().

    Fixes: 58c644ba512c ("sched/idle: Fix arch_cpu_idle() vs tracing")
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Heiko Carstens

    Heiko Carstens
     
  • The directed MSIs are delivered to CPUs whose address is
    written to the MSI message address. The current code assumes
    that a CPU logical number (as it is seen by the kernel)
    is also the CPU address.

    The above assumption is not correct, as the CPU address
    is rather the value returned by STAP instruction. That
    value does not necessarily match the kernel logical CPU
    number.

    Fixes: e979ce7bced2 ("s390/pci: provide support for CPU directed interrupts")
    Cc: # v5.2+
    Signed-off-by: Alexander Gordeev
    Reviewed-by: Halil Pasic
    Reviewed-by: Niklas Schnelle
    Signed-off-by: Niklas Schnelle
    Signed-off-by: Heiko Carstens

    Alexander Gordeev
     

30 Nov, 2020

1 commit


28 Nov, 2020

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "ARM:
    - Fix alignment of the new HYP sections
    - Fix GICR_TYPER access from userspace

    S390:
    - do not reset the global diag318 data for per-cpu reset
    - do not mark memory as protected too early
    - fix for destroy page ultravisor call

    x86:
    - fix for SEV debugging
    - fix incorrect return code
    - fix for 'noapic' with PIC in userspace and LAPIC in kernel
    - fix for 5-level paging"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT
    KVM: x86: Fix split-irqchip vs interrupt injection window request
    KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint
    MAINTAINERS: Update email address for Sean Christopherson
    MAINTAINERS: add uv.c also to KVM/s390
    s390/uv: handle destroy page legacy interface
    KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace
    KVM: SVM: fix error return code in svm_create_vcpu()
    KVM: SVM: Fix offset computation bug in __sev_dbg_decrypt().
    KVM: arm64: Correctly align nVHE percpu data
    KVM: s390: remove diag318 reset code
    KVM: s390: pv: Mark mm as protected after the set secure parameters and improve cleanup

    Linus Torvalds
     

25 Nov, 2020

1 commit


24 Nov, 2020

1 commit

  • We call arch_cpu_idle() with RCU disabled, but then use
    local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

    Switch all arch_cpu_idle() implementations to use
    raw_local_irq_{en,dis}able() and carefully manage the
    lockdep,rcu,tracing state like we do in entry.

    (XXX: we really should change arch_cpu_idle() to not return with
    interrupts enabled)

    Reported-by: Sven Schnelle
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Mark Rutland
    Tested-by: Mark Rutland
    Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org

    Peter Zijlstra
     

23 Nov, 2020

1 commit

  • We need to disable interrupts in load_fpu_regs(). Otherwise an
    interrupt might come in after the registers are loaded, but before
    CIF_FPU is cleared in load_fpu_regs(). When the interrupt returns,
    CIF_FPU will be cleared and the registers will never be restored.

    The entry.S code usually saves the interrupt state in __SF_EMPTY on the
    stack when disabling/restoring interrupts. sie64a however saves the pointer
    to the sie control block in __SF_SIE_CONTROL, which references the same
    location. This is non-obvious to the reader. To avoid thrashing the sie
    control block pointer in load_fpu_regs(), move the __SIE_* offsets eight
    bytes after __SF_EMPTY on the stack.

    Cc: # 5.8
    Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S")
    Reported-by: Pierre Morel
    Signed-off-by: Sven Schnelle
    Acked-by: Christian Borntraeger
    Reviewed-by: Heiko Carstens
    Signed-off-by: Heiko Carstens

    Sven Schnelle
     

19 Nov, 2020

1 commit


18 Nov, 2020

2 commits

  • Older firmware can return rc=0x107 rrc=0xd for destroy page if the
    page is already non-secure. This should be handled like a success
    as already done by newer firmware.

    Signed-off-by: Christian Borntraeger
    Fixes: 1a80b54d1ce1 ("s390/uv: add destroy page call")
    Reviewed-by: David Hildenbrand
    Acked-by: Cornelia Huck
    Reviewed-by: Janosch Frank

    Christian Borntraeger
     
  • Pull s390 fixes from Heiko Carstens:

    - fix system call exit path; avoid return to user space with any
    TIF/CIF/PIF set

    - fix file permission for cpum_sfb_size parameter

    - another small defconfig update

    * tag 's390-5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/cpum_sf.c: fix file permission for cpum_sfb_size
    s390: update defconfigs
    s390: fix system call exit path

    Linus Torvalds
     

17 Nov, 2020

1 commit


12 Nov, 2020

2 commits

  • This file is installed by the s390 CPU Measurement sampling
    facility device driver to export supported minimum and
    maximum sample buffer sizes.
    This file is read by lscpumf tool to display the details
    of the device driver capabilities. The lscpumf tool might
    be invoked by a non-root user. In this case it does not
    print anything because the file contents can not be read.

    Fix this by allowing read access for all users. Reading
    the file contents is ok, changing the file contents is
    left to the root user only.

    For further reference and details see:
    [1] https://github.com/ibm-s390-tools/s390-tools/issues/97

    Fixes: 69f239ed335a ("s390/cpum_sf: Dynamically extend the sampling buffer if overflows occur")
    Cc: # 3.14
    Signed-off-by: Thomas Richter
    Acked-by: Sumanth Korikkar
    Signed-off-by: Heiko Carstens

    Thomas Richter
     
  • Signed-off-by: Heiko Carstens

    Heiko Carstens
     

11 Nov, 2020

2 commits

  • The diag318 data must be set to 0 by VM-wide reset events
    triggered by diag308. As such, KVM should not handle
    resetting this data via the VCPU ioctls.

    Fixes: 23a60f834406 ("s390/kvm: diagnose 0x318 sync and reset")
    Signed-off-by: Collin Walling
    Reviewed-by: Christian Borntraeger
    Reviewed-by: Janosch Frank
    Acked-by: Cornelia Huck
    Signed-off-by: Christian Borntraeger
    Link: https://lore.kernel.org/r/20201104181032.109800-1-walling@linux.ibm.com

    Collin Walling
     
  • We can only have protected guest pages after a successful set secure
    parameters call as only then the UV allows imports and unpacks.

    By moving the test we can now also check for it in s390_reset_acc()
    and do an early return if it is 0.

    Signed-off-by: Janosch Frank
    Fixes: 29b40f105ec8 ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
    Reviewed-by: Cornelia Huck
    Signed-off-by: Christian Borntraeger

    Janosch Frank
     

10 Nov, 2020

2 commits

  • struct perf_sample_data lives on-stack, we should be careful about it's
    size. Furthermore, the pt_regs copy in there is only because x86_64 is a
    trainwreck, solve it differently.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Steven Rostedt
    Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org

    Peter Zijlstra
     
  • __perf_output_begin() has an on-stack struct perf_sample_data in the
    unlikely case it needs to generate a LOST record. However, every call
    to perf_output_begin() must already have a perf_sample_data on-stack.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20201030151954.985416146@infradead.org

    Peter Zijlstra
     

09 Nov, 2020

1 commit

  • The system call exit path is running with interrupts enabled while
    checking for TIF/PIF/CIF bits which require special handling. If all
    bits have been checked interrupts are disabled and the kernel exits to
    user space.
    The problem is that after checking all bits and before interrupts are
    disabled bits can be set already again, due to interrupt handling.

    This means that the kernel can exit to user space with some
    TIF/PIF/CIF bits set, which should never happen. E.g. TIF_NEED_RESCHED
    might be set, which might lead to additional latencies, since that bit
    will only be recognized with next exit to user space.

    Fix this by checking the corresponding bits only when interrupts are
    disabled.

    Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S")
    Cc: # 5.8
    Acked-by: Sven Schnelle
    Signed-off-by: Heiko Carstens

    Heiko Carstens
     

03 Nov, 2020

6 commits

  • Under some circumstances in particular with "Reconfigure I/O Path"
    a zPCI function may first appear in Standby through a PCI event with
    PEC 0x0302 which initially makes it visible to the zPCI subsystem,
    Only after that is it configured with a zPCI event with PEC 0x0301.
    If the zbus is still missing a PCI function zero (devfn == 0) when the
    PCI event 0x0301 is handled zdev->zbus->bus is still NULL and gets
    dereferenced in common code.
    Check for this case and enable but don't scan the zPCI function.
    This matches what would happen if we immediately got the 0x0301
    configuration request or the function was included in CLP List PCI.
    In all cases the PCI functions with devfn != 0 will be scanned once
    function 0 appears.

    Fixes: 3047766bc6ec ("s390/pci: fix enabling a reserved PCI function")
    Cc: # 5.8
    Signed-off-by: Niklas Schnelle
    Acked-by: Pierre Morel
    Signed-off-by: Heiko Carstens

    Niklas Schnelle
     
  • The call to rcu_cpu_starting() in smp_init_secondary() is not early
    enough in the CPU-hotplug onlining process, which results in lockdep
    splats as follows:

    WARNING: suspicious RCU usage
    -----------------------------
    kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    RCU used illegally from offline CPU!
    rcu_scheduler_active = 1, debug_locks = 1
    no locks held by swapper/1/0.

    Call Trace:
    show_stack+0x158/0x1f0
    dump_stack+0x1f2/0x238
    __lock_acquire+0x2640/0x4dd0
    lock_acquire+0x3a8/0xd08
    _raw_spin_lock_irqsave+0xc0/0xf0
    clockevents_register_device+0xa8/0x528
    init_cpu_timer+0x33e/0x468
    smp_init_secondary+0x11a/0x328
    smp_start_secondary+0x82/0x88

    This is avoided by moving the call to rcu_cpu_starting up near the
    beginning of the smp_init_secondary() function. Note that the
    raw_smp_processor_id() is required in order to avoid calling into
    lockdep before RCU has declared the CPU to be watched for readers.

    Link: https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
    Signed-off-by: Qian Cai
    Acked-by: Paul E. McKenney
    Signed-off-by: Heiko Carstens

    Qian Cai
     
  • Signed-off-by: Heiko Carstens

    Heiko Carstens
     
  • Signed-off-by: Heiko Carstens

    Heiko Carstens
     
  • Signed-off-by: Heiko Carstens

    Heiko Carstens
     
  • pmd/pud_deref() assume that they will never operate on large pmd/pud
    entries, and therefore only use the non-large _xxx_ENTRY_ORIGIN mask.
    With commit 9ec8fa8dc331b ("s390/vmemmap: extend modify_pagetable()
    to handle vmemmap"), that assumption is no longer true, at least for
    pmd_deref().

    In theory, we could end up with wrong addresses because some of the
    non-address bits of a large entry would not be masked out.
    In practice, this does not (yet) show any impact, because vmemmap_free()
    is currently never used for s390.

    Fix pmd/pud_deref() to check for the entry type and use the
    _xxx_ENTRY_ORIGIN_LARGE mask for large entries.

    While at it, also move pmd/pud_pfn() around, in order to avoid code
    duplication, because they do the same thing.

    Fixes: 9ec8fa8dc331b ("s390/vmemmap: extend modify_pagetable() to handle vmemmap")
    Cc: # 5.9
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Alexander Gordeev
    Signed-off-by: Heiko Carstens

    Gerald Schaefer
     

26 Oct, 2020

2 commits

  • Currently s390 build is broken.

    SECTCMP .boot.data
    error: section .boot.data differs between vmlinux and arch/s390/boot/compressed/vmlinux
    make[2]: *** [arch/s390/boot/section_cmp.boot.data] Error 1
    SECTCMP .boot.preserved.data
    error: section .boot.preserved.data differs between vmlinux and arch/s390/boot/compressed/vmlinux
    make[2]: *** [arch/s390/boot/section_cmp.boot.preserved.data] Error 1
    make[1]: *** [bzImage] Error 2

    Commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo)
    to __section("foo")") converted all __section(foo) to __section("foo").
    This is wrong for __bootdata / __bootdata_preserved macros which want
    variable names to be a part of intermediate section names .boot.data. and .boot.preserved.data.. Those sections are later
    sorted by alignment + name and merged together into final .boot.data
    / .boot.preserved.data sections. Those sections must be identical in
    the decompressor and the decompressed kernel (that is checked during
    the build).

    Fixes: 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")")
    Signed-off-by: Vasily Gorbik
    Signed-off-by: Heiko Carstens

    Vasily Gorbik
     
  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     

24 Oct, 2020

2 commits

  • Pull virtio updates from Michael Tsirkin:
    "vhost, vdpa, and virtio cleanups and fixes

    A very quiet cycle, no new features"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    MAINTAINERS: add URL for virtio-mem
    vhost_vdpa: remove unnecessary spin_lock in vhost_vring_call
    vringh: fix __vringh_iov() when riov and wiov are different
    vdpa/mlx5: Setup driver only if VIRTIO_CONFIG_S_DRIVER_OK
    s390: virtio: PV needs VIRTIO I/O device protection
    virtio: let arch advertise guest's memory access restrictions
    vhost_vdpa: Fix duplicate included kernel.h
    vhost: reduce stack usage in log_used
    virtio-mem: Constify mem_id_table
    virtio_input: Constify id_table
    virtio-balloon: Constify id_table
    vdpa/mlx5: Fix failure to bring link up
    vdpa/mlx5: Make use of a specific 16 bit endianness API

    Linus Torvalds
     
  • Pull arch task_work cleanups from Jens Axboe:
    "Two cleanups that don't fit other categories:

    - Finally get the task_work_add() cleanup done properly, so we don't
    have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
    all callers, and also fixes up the documentation for
    task_work_add().

    - While working on some TIF related changes for 5.11, this
    TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
    duplication for how that is handled"

    * tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
    task_work: cleanup notification modes
    tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()

    Linus Torvalds
     

23 Oct, 2020

3 commits

  • Pull Kbuild updates from Masahiro Yamada:

    - Support 'make compile_commands.json' to generate the compilation
    database more easily, avoiding stale entries

    - Support 'make clang-analyzer' and 'make clang-tidy' for static checks
    using clang-tidy

    - Preprocess scripts/modules.lds.S to allow CONFIG options in the
    module linker script

    - Drop cc-option tests from compiler flags supported by our minimal
    GCC/Clang versions

    - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y

    - Use sha1 build id for both BFD linker and LLD

    - Improve deb-pkg for reproducible builds and rootless builds

    - Remove stale, useless scripts/namespace.pl

    - Turn -Wreturn-type warning into error

    - Fix build error of deb-pkg when CONFIG_MODULES=n

    - Replace 'hostname' command with more portable 'uname -n'

    - Various Makefile cleanups

    * tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
    kbuild: Use uname for LINUX_COMPILE_HOST detection
    kbuild: Only add -fno-var-tracking-assignments for old GCC versions
    kbuild: remove leftover comment for filechk utility
    treewide: remove DISABLE_LTO
    kbuild: deb-pkg: clean up package name variables
    kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n
    kbuild: enforce -Werror=return-type
    scripts: remove namespace.pl
    builddeb: Add support for all required debian/rules targets
    builddeb: Enable rootless builds
    builddeb: Pass -n to gzip for reproducible packages
    kbuild: split the build log of kallsyms
    kbuild: explicitly specify the build id style
    scripts/setlocalversion: make git describe output more reliable
    kbuild: remove cc-option test of -Werror=date-time
    kbuild: remove cc-option test of -fno-stack-check
    kbuild: remove cc-option test of -fno-strict-overflow
    kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles
    kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan
    kbuild: do not create built-in objects for external module builds
    ...

    Linus Torvalds
     
  • Pull VFIO updates from Alex Williamson:

    - New fsl-mc vfio bus driver supporting userspace drivers of objects
    within NXP's DPAA2 architecture (Diana Craciun)

    - Support for exposing zPCI information on s390 (Matthew Rosato)

    - Fixes for "detached" VFs on s390 (Matthew Rosato)

    - Fixes for pin-pages and dma-rw accesses (Yan Zhao)

    - Cleanups and optimize vconfig regen (Zenghui Yu)

    - Fix duplicate irq-bypass token registration (Alex Williamson)

    * tag 'vfio-v5.10-rc1' of git://github.com/awilliam/linux-vfio: (30 commits)
    vfio iommu type1: Fix memory leak in vfio_iommu_type1_pin_pages
    vfio/pci: Clear token on bypass registration failure
    vfio/fsl-mc: fix the return of the uninitialized variable ret
    vfio/fsl-mc: Fix the dead code in vfio_fsl_mc_set_irq_trigger
    vfio/fsl-mc: Fixed vfio-fsl-mc driver compilation on 32 bit
    MAINTAINERS: Add entry for s390 vfio-pci
    vfio-pci/zdev: Add zPCI capabilities to VFIO_DEVICE_GET_INFO
    vfio/fsl-mc: Add support for device reset
    vfio/fsl-mc: Add read/write support for fsl-mc devices
    vfio/fsl-mc: trigger an interrupt via eventfd
    vfio/fsl-mc: Add irq infrastructure for fsl-mc devices
    vfio/fsl-mc: Added lock support in preparation for interrupt handling
    vfio/fsl-mc: Allow userspace to MMAP fsl-mc device MMIO regions
    vfio/fsl-mc: Implement VFIO_DEVICE_GET_REGION_INFO ioctl call
    vfio/fsl-mc: Implement VFIO_DEVICE_GET_INFO ioctl
    vfio/fsl-mc: Scan DPRC objects on vfio-fsl-mc driver bind
    vfio: Introduce capability definitions for VFIO_DEVICE_GET_INFO
    s390/pci: track whether util_str is valid in the zpci_dev
    s390/pci: stash version in the zpci_dev
    vfio/fsl-mc: Add VFIO framework skeleton for fsl-mc devices
    ...

    Linus Torvalds
     
  • Pull initial set_fs() removal from Al Viro:
    "Christoph's set_fs base series + fixups"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Allow a NULL pos pointer to __kernel_read
    fs: Allow a NULL pos pointer to __kernel_write
    powerpc: remove address space overrides using set_fs()
    powerpc: use non-set_fs based maccess routines
    x86: remove address space overrides using set_fs()
    x86: make TASK_SIZE_MAX usable from assembly code
    x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
    lkdtm: remove set_fs-based tests
    test_bitmap: remove user bitmap tests
    uaccess: add infrastructure for kernel builds with set_fs()
    fs: don't allow splice read/write without explicit ops
    fs: don't allow kernel reads and writes without iter ops
    sysctl: Convert to iter interfaces
    proc: add a read_iter method to proc proc_ops
    proc: cleanup the compat vs no compat file ops
    proc: remove a level of indentation in proc_get_inode

    Linus Torvalds
     

21 Oct, 2020

1 commit

  • If protected virtualization is active on s390, VIRTIO has only retricted
    access to the guest memory.
    Define CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS and export
    arch_has_restricted_virtio_memory_access to advertize VIRTIO if that's
    the case.

    Signed-off-by: Pierre Morel
    Reviewed-by: Cornelia Huck
    Reviewed-by: Halil Pasic
    Link: https://lore.kernel.org/r/1599728030-17085-3-git-send-email-pmorel@linux.ibm.com
    Signed-off-by: Michael S. Tsirkin
    Acked-by: Christian Borntraeger

    Pierre Morel
     

19 Oct, 2020

1 commit

  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim