12 Jan, 2019

1 commit

  • The function at issue does not fully validate the content of the
    structure pointed by the log parameter, though its content has just been
    copied from userspace and lacks validation. Fix that.

    Moreover, change the type of n to unsigned long as that is the type
    returned by kvm_dirty_bitmap_bytes().

    Signed-off-by: Tomas Bortoli
    Reported-by: syzbot+028366e52c9ace67deb3@syzkaller.appspotmail.com
    [Squashed the fix from Paolo. - Radim.]
    Signed-off-by: Radim Krčmář

    Tomas Bortoli
     

06 Jan, 2019

1 commit

  • Merge more updates from Andrew Morton:

    - procfs updates

    - various misc bits

    - lib/ updates

    - epoll updates

    - autofs

    - fatfs

    - a few more MM bits

    * emailed patches from Andrew Morton : (58 commits)
    mm/page_io.c: fix polled swap page in
    checkpatch: add Co-developed-by to signature tags
    docs: fix Co-Developed-by docs
    drivers/base/platform.c: kmemleak ignore a known leak
    fs: don't open code lru_to_page()
    fs/: remove caller signal_pending branch predictions
    mm/: remove caller signal_pending branch predictions
    arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
    kernel/sched/: remove caller signal_pending branch predictions
    kernel/locking/mutex.c: remove caller signal_pending branch predictions
    mm: select HAVE_MOVE_PMD on x86 for faster mremap
    mm: speed up mremap by 20x on large regions
    mm: treewide: remove unused address argument from pte_alloc functions
    initramfs: cleanup incomplete rootfs
    scripts/gdb: fix lx-version string output
    kernel/kcov.c: mark write_comp_data() as notrace
    kernel/sysctl: add panic_print into sysctl
    panic: add options to print system info when panic happens
    bfs: extra sanity checking and static inode bitmap
    exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
    ...

    Linus Torvalds
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Dec, 2018

1 commit

  • Patch series "mmu notifier contextual informations", v2.

    This patchset adds contextual information, why an invalidation is
    happening, to mmu notifier callback. This is necessary for user of mmu
    notifier that wish to maintains their own data structure without having to
    add new fields to struct vm_area_struct (vma).

    For instance device can have they own page table that mirror the process
    address space. When a vma is unmap (munmap() syscall) the device driver
    can free the device page table for the range.

    Today we do not have any information on why a mmu notifier call back is
    happening and thus device driver have to assume that it is always an
    munmap(). This is inefficient at it means that it needs to re-allocate
    device page table on next page fault and rebuild the whole device driver
    data structure for the range.

    Other use case beside munmap() also exist, for instance it is pointless
    for device driver to invalidate the device page table when the
    invalidation is for the soft dirtyness tracking. Or device driver can
    optimize away mprotect() that change the page table permission access for
    the range.

    This patchset enables all this optimizations for device drivers. I do not
    include any of those in this series but another patchset I am posting will
    leverage this.

    The patchset is pretty simple from a code point of view. The first two
    patches consolidate all mmu notifier arguments into a struct so that it is
    easier to add/change arguments. The last patch adds the contextual
    information (munmap, protection, soft dirty, clear, ...).

    This patch (of 3):

    To avoid having to change many callback definition everytime we want to
    add a parameter use a structure to group all parameters for the
    mmu_notifier invalidate_range_start/end callback. No functional changes
    with this patch.

    [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
    Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Jan Kara
    Acked-by: Jason Gunthorpe [infiniband]
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

27 Dec, 2018

2 commits

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     
  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - selftests improvements
    - large PUD support for HugeTLB
    - single-stepping fixes
    - improved tracing
    - various timer and vGIC fixes

    x86:
    - Processor Tracing virtualization
    - STIBP support
    - some correctness fixes
    - refactorings and splitting of vmx.c
    - use the Hyper-V range TLB flush hypercall
    - reduce order of vcpu struct
    - WBNOINVD support
    - do not use -ftrace for __noclone functions
    - nested guest support for PAUSE filtering on AMD
    - more Hyper-V enlightenments (direct mode for synthetic timers)

    PPC:
    - nested VFIO

    s390:
    - bugfixes only this time"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
    KVM: x86: Add CPUID support for new instruction WBNOINVD
    kvm: selftests: ucall: fix exit mmio address guessing
    Revert "compiler-gcc: disable -ftracer for __noclone functions"
    KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
    KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
    KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
    MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
    KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
    KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
    KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
    KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
    KVM: Make kvm_set_spte_hva() return int
    KVM: Replace old tlb flush function with new one to flush a specified range.
    KVM/MMU: Add tlb flush with range helper function
    KVM/VMX: Add hv tlb range flush support
    x86/hyper-v: Add HvFlushGuestAddressList hypercall support
    KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
    KVM: x86: Disable Intel PT when VMXON in L1 guest
    KVM: x86: Set intercept for Intel PT MSRs read/write
    KVM: x86: Implement Intel PT MSRs read/write emulation
    ...

    Linus Torvalds
     

26 Dec, 2018

1 commit

  • Pull arm64 festive updates from Will Deacon:
    "In the end, we ended up with quite a lot more than I expected:

    - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
    kernel-side support to come later)

    - Support for per-thread stack canaries, pending an update to GCC
    that is currently undergoing review

    - Support for kexec_file_load(), which permits secure boot of a kexec
    payload but also happens to improve the performance of kexec
    dramatically because we can avoid the sucky purgatory code from
    userspace. Kdump will come later (requires updates to libfdt).

    - Optimisation of our dynamic CPU feature framework, so that all
    detected features are enabled via a single stop_machine()
    invocation

    - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
    they can benefit from global TLB entries when KASLR is not in use

    - 52-bit virtual addressing for userspace (kernel remains 48-bit)

    - Patch in LSE atomics for per-cpu atomic operations

    - Custom preempt.h implementation to avoid unconditional calls to
    preempt_schedule() from preempt_enable()

    - Support for the new 'SB' Speculation Barrier instruction

    - Vectorised implementation of XOR checksumming and CRC32
    optimisations

    - Workaround for Cortex-A76 erratum #1165522

    - Improved compatibility with Clang/LLD

    - Support for TX2 system PMUS for profiling the L3 cache and DMC

    - Reflect read-only permissions in the linear map by default

    - Ensure MMIO reads are ordered with subsequent calls to Xdelay()

    - Initial support for memory hotplug

    - Tweak the threshold when we invalidate the TLB by-ASID, so that
    mremap() performance is improved for ranges spanning multiple PMDs.

    - Minor refactoring and cleanups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits)
    arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset()
    arm64: sysreg: Use _BITUL() when defining register bits
    arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches
    arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4
    arm64: docs: document pointer authentication
    arm64: ptr auth: Move per-thread keys from thread_info to thread_struct
    arm64: enable pointer authentication
    arm64: add prctl control for resetting ptrauth keys
    arm64: perf: strip PAC when unwinding userspace
    arm64: expose user PAC bit positions via ptrace
    arm64: add basic pointer authentication support
    arm64/cpufeature: detect pointer authentication
    arm64: Don't trap host pointer auth use to EL2
    arm64/kvm: hide ptrauth from guests
    arm64/kvm: consistently handle host HCR_EL2 flags
    arm64: add pointer authentication register bits
    arm64: add comments about EC exception levels
    arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned
    arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
    arm64: enable per-task stack canaries
    ...

    Linus Torvalds
     

21 Dec, 2018

5 commits

  • This patch is to move tlb flush in kvm_set_pte_rmapp() to
    kvm_mmu_notifier_change_pte() in order to avoid redundant tlb flush.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • The patch is to make kvm_set_spte_hva() return int and caller can
    check return value to determine flush tlb or not.

    Signed-off-by: Lan Tianyu
    Acked-by: Paul Mackerras
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • Signed-off-by: Wei Yang
    [Preserved the iff and a probably intentional weird bracket notation.
    Also dropped the style change to make a single-purpose patch. - Radim]
    Signed-off-by: Radim Krčmář

    Wei Yang
     
  • Since the offset is added directly to the hva from the
    gfn_to_hva_cache, a negative offset could result in an out of bounds
    write. The existing BUG_ON only checks for addresses beyond the end of
    the gfn_to_hva_cache, not for addresses before the start of the
    gfn_to_hva_cache.

    Note that all current call sites have non-negative offsets.

    Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
    Reported-by: Cfir Cohen
    Signed-off-by: Jim Mattson
    Reviewed-by: Cfir Cohen
    Reviewed-by: Peter Shier
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Sean Christopherson
    Signed-off-by: Radim Krčmář

    Jim Mattson
     
  • Previously, in the case where (gpa + len) wrapped around, the entire
    region was not validated, as the comment claimed. It doesn't actually
    seem that wraparound should be allowed here at all.

    Furthermore, since some callers don't check the return code from this
    function, it seems prudent to clear ghc->memslot in the event of an
    error.

    Fixes: 8f964525a121f ("KVM: Allow cross page reads and writes from cached translations.")
    Reported-by: Cfir Cohen
    Signed-off-by: Jim Mattson
    Reviewed-by: Cfir Cohen
    Reviewed-by: Marc Orr
    Cc: Andrew Honig
    Signed-off-by: Radim Krčmář

    Jim Mattson
     

20 Dec, 2018

9 commits

  • …marm/kvmarm into HEAD

    KVM/arm updates for 4.21

    - Large PUD support for HugeTLB
    - Single-stepping fixes
    - Improved tracing
    - Various timer and vgic fixups

    Paolo Bonzini
     
  • 32 and 64bit use different symbols to identify the traps.
    32bit has a fine grained approach (prefetch abort, data abort and HVC),
    while 64bit is pretty happy with just "trap".

    This has been fine so far, except that we now need to decode some
    of that in tracepoints that are common to both architectures.

    Introduce ARM_EXCEPTION_IS_TRAP which abstracts the trap symbols
    and make the tracepoint use it.

    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • There are two things we need to take care of when we create block
    mappings in the stage 2 page tables:

    (1) The alignment within a PMD between the host address range and the
    guest IPA range must be the same, since otherwise we end up mapping
    pages with the wrong offset.

    (2) The head and tail of a memory slot may not cover a full block
    size, and we have to take care to not map those with block
    descriptors, since we could expose memory to the guest that the host
    did not intend to expose.

    So far, we have been taking care of (1), but not (2), and our commentary
    describing (1) was somewhat confusing.

    This commit attempts to factor out the checks of both into a common
    function, and if we don't pass the check, we won't attempt any PMD
    mappings for neither hugetlbfs nor THP.

    Note that we used to only check the alignment for THP, not for
    hugetlbfs, but as far as I can tell the check needs to be applied to
    both scenarios.

    Cc: Ralph Palutke
    Cc: Lukas Braun
    Reported-by: Lukas Braun
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We currently only halt the guest when a vCPU messes with the active
    state of an SPI. This is perfectly fine for GICv2, but isn't enough
    for GICv3, where all vCPUs can access the state of any other vCPU.

    Let's broaden the condition to include any GICv3 interrupt that
    has an active state (i.e. all but LPIs).

    Cc: stable@vger.kernel.org
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • kvm_timer_vcpu_terminate can only be called in two scenarios:

    1. As part of cleanup during a failed VCPU create
    2. As part of freeing the whole VM (struct kvm refcount == 0)

    In the first case, we cannot have programmed any timers or mapped any
    IRQs, and therefore we do not have to cancel anything or unmap anything.

    In the second case, the VCPU will have gone through kvm_timer_vcpu_put,
    which will have canceled the emulated physical timer's hrtimer, and we
    do not need to that here as well. We also do not care if the irq is
    recorded as mapped or not in the VGIC data structure, because the whole
    VM is going away. That leaves us only with having to ensure that we
    cancel the bg_timer if we were blocking the last time we called
    kvm_timer_vcpu_put().

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The use of a work queue in the hrtimer expire function for the bg_timer
    is a leftover from the time when we would inject interrupts when the
    bg_timer expired.

    Since we are no longer doing that, we can instead call
    kvm_vcpu_wake_up() directly from the hrtimer function and remove all
    workqueue functionality from the arch timer code.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The kvm_exit tracepoint strangely always reported exits as being IRQs.
    This seems to be because either the __print_symbolic or the tracepoint
    macros use a variable named idx.

    Take this chance to update the fields in the tracepoint to reflect the
    concepts in the arm64 architecture that we pass to the tracepoint and
    move the exception type table to the same location and header files as
    the exits code.

    We also clear out the exception code to 0 for IRQ exits (which
    translates to UNKNOWN in text) to make it slighyly less confusing to
    parse the trace output.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When checking if there are any pending IRQs for the VM, consider the
    active state and priority of the IRQs as well.

    Otherwise we could be continuously scheduling a guest hypervisor without
    it seeing an IRQ.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When using the nospec API, it should be taken into account that:

    "...if the CPU speculates past the bounds check then
    * array_index_nospec() will clamp the index within the range of [0,
    * size)."

    The above is part of the header for macro array_index_nospec() in
    linux/nospec.h

    Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
    or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
    returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
    of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:

    /* SGIs and PPIs */
    if (intid arch.vgic_cpu.private_irqs[intid];

    /* SPIs */
    if (intid arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];

    are valid values for intid.

    Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
    and VGIC_MAX_SPI + 1 as arguments for its parameter size.

    Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Gustavo A. R. Silva
    [dropped the SPI part which was fixed separately]
    Signed-off-by: Marc Zyngier

    Gustavo A. R. Silva
     

19 Dec, 2018

1 commit

  • If you register a kvm_coalesced_mmio_zone with '.pio = 0' but then
    unregister it with '.pio = 1', KVM_UNREGISTER_COALESCED_MMIO will try to
    unregister it from KVM_PIO_BUS rather than KVM_MMIO_BUS, which is a
    no-op. But it frees the kvm_coalesced_mmio_dev anyway, causing a
    use-after-free.

    Fix it by only unregistering and freeing the zone if the correct value
    of 'pio' is provided.

    Reported-by: syzbot+f87f60bb6f13f39b54e3@syzkaller.appspotmail.com
    Fixes: 0804c849f1df ("kvm/x86 : add coalesced pio support")
    Signed-off-by: Eric Biggers
    Signed-off-by: Paolo Bonzini

    Eric Biggers
     

18 Dec, 2018

15 commits

  • SPIs should be checked against the VMs specific configuration, and
    not the architectural maximum.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • In attempting to re-construct the logic for our stage 2 page table
    layout I found the reasoning in the comment explaining how we calculate
    the number of levels used for stage 2 page tables a bit backwards.

    This commit attempts to clarify the comment, to make it slightly easier
    to read without having the Arm ARM open on the right page.

    While we're at it, fixup a typo in a comment that was recently changed.

    Reviewed-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • To change the active state of an MMIO, halt is requested for all vcpus of
    the affected guest before modifying the IRQ state. This is done by calling
    cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
    disabled at this point and we cannot reschedule a vcpu.

    We actually don't need any of this, as kvm_arm_halt_guest ensures that
    all the other vcpus are out of the guest. Let's just drop that useless
    code.

    Signed-off-by: Julien Thierry
    Suggested-by: Christoffer Dall
    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier

    Julien Thierry
     
  • KVM only supports PMD hugepages at stage 2. Now that the various page
    handling routines are updated, extend the stage 2 fault handling to
    map in PUD hugepages.

    Addition of PUD hugepage support enables additional page sizes (e.g.,
    1G with 4K granule) which can be useful on cores that support mapping
    larger block sizes in the TLB entries.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replace BUG() => WARN_ON(1) for arm32 PUD helpers ]
    Signed-off-by: Suzuki Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating larger hugepages at Stage 2, add support
    to the age handling notifiers for PUD hugepages when encountered.

    Provide trivial helpers for arm32 to allow sharing code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) for arm32 PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating larger hugepages at Stage 2, extend the
    access fault handling at Stage 2 to support PUD hugepages when
    encountered.

    Provide trivial helpers for arm32 to allow sharing of code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) in PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating PUD hugepages at stage 2, add support for
    detecting execute permissions on PUD page table entries. Faults due to
    lack of execute permissions on page table entries is used to perform
    i-cache invalidation on first execute.

    Provide trivial implementations of arm32 helpers to allow sharing of
    code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) in arm32 PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating PUD hugepages at stage 2, add support for
    write protecting PUD hugepages when they are encountered. Write
    protecting guest tables is used to track dirty pages when migrating
    VMs.

    Also, provide trivial implementations of required kvm_s2pud_* helpers
    to allow sharing of code with arm32.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON() in arm32 pud helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Introduce helpers to abstract architectural handling of the conversion
    of pfn to page table entries and marking a PMD page table entry as a
    block entry.

    The helpers are introduced in preparation for supporting PUD hugepages
    at stage 2 - which are supported on arm64 but do not exist on arm.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Acked-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Reviewed-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Stage 2 fault handler marks a page as executable if it is handling an
    execution fault or if it was a permission fault in which case the
    executable bit needs to be preserved.

    The logic to decide if the page should be marked executable is
    duplicated for PMD and PTE entries. To avoid creating another copy
    when support for PUD hugepages is introduced refactor the code to
    share the checks needed to mark a page table entry as executable.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • The code for operations such as marking the pfn as dirty, and
    dcache/icache maintenance during stage 2 fault handling is duplicated
    between normal pages and PMD hugepages.

    Instead of creating another copy of the operations when we introduce
    PUD hugepages, let's share them across the different pagesizes.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • When restoring the active state from userspace, we don't know which CPU
    was the source for the active state, and this is not architecturally
    exposed in any of the register state.

    Set the active_source to 0 in this case. In the future, we can expand
    on this and exposse the information as additional information to
    userspace for GICv2 if anyone cares.

    Cc: stable@vger.kernel.org
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We recently addressed a VMID generation race by introducing a read/write
    lock around accesses and updates to the vmid generation values.

    However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
    does so without taking the read lock.

    As far as I can tell, this can lead to the same kind of race:

    VM 0, VCPU 0 VM 0, VCPU 1
    ------------ ------------
    update_vttbr (vmid 254)
    update_vttbr (vmid 1) // roll over
    read_lock(kvm_vmid_lock);
    force_vm_exit()
    local_irq_disable
    need_new_vmid_gen == false //because vmid gen matches

    enter_guest (vmid 254)
    kvm_arch.vttbr = :
    read_unlock(kvm_vmid_lock);

    enter_guest (vmid 1)

    Which results in running two VCPUs in the same VM with different VMIDs
    and (even worse) other VCPUs from other VMs could now allocate clashing
    VMID 254 from the new generation as long as VCPU 0 is not exiting.

    Attempt to solve this by making sure vttbr is updated before another CPU
    can observe the updated VMID generation.

    Cc: stable@vger.kernel.org
    Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
    Reviewed-by: Julien Thierry
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When we emulate a guest instruction, we don't advance the hardware
    singlestep state machine, and thus the guest will receive a software
    step exception after a next instruction which is not emulated by the
    host.

    We bodge around this in an ad-hoc fashion. Sometimes we explicitly check
    whether userspace requested a single step, and fake a debug exception
    from within the kernel. Other times, we advance the HW singlestep state
    rely on the HW to generate the exception for us. Thus, the observed step
    behaviour differs for host and guest.

    Let's make this simpler and consistent by always advancing the HW
    singlestep state machine when we skip an instruction. Thus we can rely
    on the hardware to generate the singlestep exception for us, and never
    need to explicitly check for an active-pending step, nor do we need to
    fake a debug exception from the guest.

    Cc: Peter Maydell
    Reviewed-by: Alex Bennée
    Reviewed-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier

    Mark Rutland
     
  • When we emulate an MMIO instruction, we advance the CPU state within
    decode_hsr(), before emulating the instruction effects.

    Having this logic in decode_hsr() is opaque, and advancing the state
    before emulation is problematic. It gets in the way of applying
    consistent single-step logic, and it prevents us from being able to fail
    an MMIO instruction with a synchronous exception.

    Clean this up by only advancing the CPU state *after* the effects of the
    instruction are emulated.

    Cc: Peter Maydell
    Reviewed-by: Alex Bennée
    Reviewed-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier

    Mark Rutland
     

14 Dec, 2018

2 commits

  • There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
    it can take kvm->mmu_lock for an extended period of time. Second, its user
    can actually see many false positives in some cases. The latter is due
    to a benign race like this:

    1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
    them.
    2. The guest modifies the pages, causing them to be marked ditry.
    3. Userspace actually copies the pages.
    4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
    they were not written to since (3).

    This is especially a problem for large guests, where the time between
    (1) and (3) can be substantial. This patch introduces a new
    capability which, when enabled, makes KVM_GET_DIRTY_LOG not
    write-protect the pages it returns. Instead, userspace has to
    explicitly clear the dirty log bits just before using the content
    of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
    64-page granularity rather than requiring to sync a full memslot;
    this way, the mmu_lock is taken for small amounts of time, and
    only a small amount of time will pass between write protection
    of pages and the sending of their content.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
    pointer argument will always be false on exit, because no TLB flush is needed
    until the manual re-protection operation. Rename it from "is_dirty" to "flush",
    which more accurately tells the caller what they have to do with it.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini