05 Sep, 2018

2 commits

  • commit 976d34e2dab10ece5ea8fe7090b7692913f89084 upstream.

    When there is contention on faulting in a particular page table entry
    at stage 2, the break-before-make requirement of the architecture can
    lead to additional refaulting due to TLB invalidation.

    Avoid this by skipping a page table update if the new value of the PTE
    matches the previous value.

    Cc: stable@vger.kernel.org
    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     
  • commit 86658b819cd0a9aa584cd84453ed268a6f013770 upstream.

    Contention on updating a PMD entry by a large number of vcpus can lead
    to duplicate work when handling stage 2 page faults. As the page table
    update follows the break-before-make requirement of the architecture,
    it can lead to repeated refaults due to clearing the entry and
    flushing the tlbs.

    This problem is more likely when -

    * there are large number of vcpus
    * the mapping is large block mapping

    such as when using PMD hugepages (512MB) with 64k pages.

    Fix this by skipping the page table update if there is no change in
    the entry being updated.

    Cc: stable@vger.kernel.org
    Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

21 Mar, 2018

1 commit

  • commit 76600428c3677659e3c3633bb4f2ea302220a275 upstream.

    On my GICv3 system, the following is printed to the kernel log at boot:

    kvm [1]: 8-bit VMID
    kvm [1]: IDMAP page: d20e35000
    kvm [1]: HYP VA range: 800000000000:ffffffffffff
    kvm [1]: vgic-v2@2c020000
    kvm [1]: GIC system register CPU interface enabled
    kvm [1]: vgic interrupt IRQ1
    kvm [1]: virtual timer IRQ4
    kvm [1]: Hyp mode initialized successfully

    The KVM IDMAP is a mapping of a statically allocated kernel structure,
    and so printing its physical address leaks the physical placement of
    the kernel when physical KASLR in effect. So change the kvm_info() to
    kvm_debug() to remove it from the log output.

    While at it, trim the output a bit more: IRQ numbers can be found in
    /proc/interrupts, and the HYP VA and vgic-v2 lines are not highly
    informational either.

    Cc:
    Acked-by: Will Deacon
    Acked-by: Christoffer Dall
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     

24 Jan, 2018

1 commit

  • commit c507babf10ead4d5c8cca704539b170752a8ac84 upstream.

    KVM only supports PMD hugepages at stage 2 but doesn't actually check
    that the provided hugepage memory pagesize is PMD_SIZE before populating
    stage 2 entries.

    In cases where the backing hugepage size is smaller than PMD_SIZE (such
    as when using contiguous hugepages), KVM can end up creating stage 2
    mappings that extend beyond the supplied memory.

    Fix this by checking for the pagesize of userspace vma before creating
    PMD hugepage at stage 2.

    Fixes: 66b3923a1a0f77a ("arm64: hugetlb: add support for PTE contiguous bit")
    Signed-off-by: Punit Agrawal
    Cc: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

30 Dec, 2017

1 commit

  • commit 7839c672e58bf62da8f2f0197fefb442c02ba1dd upstream.

    When we unmap the HYP memory, we try to be clever and unmap one
    PGD at a time. If we start with a non-PGD aligned address and try
    to unmap a whole PGD, things go horribly wrong in unmap_hyp_range
    (addr and end can never match, and it all goes really badly as we
    keep incrementing pgd and parse random memory as page tables...).

    The obvious fix is to let unmap_hyp_range do what it does best,
    which is to iterate over a range.

    The size of the linear mapping, which begins at PAGE_OFFSET, can be
    easily calculated by subtracting PAGE_OFFSET form high_memory, because
    high_memory is defined as the linear map address of the last byte of
    DRAM, plus one.

    The size of the vmalloc region is given trivially by VMALLOC_END -
    VMALLOC_START.

    Reported-by: Andre Przywara
    Tested-by: Andre Przywara
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

05 Sep, 2017

1 commit

  • The ARM-ARM has two bits in the ESR/HSR relevant to external aborts.
    A range of {I,D}FSC values (of which bit 5 is always set) and bit 9 'EA'
    which provides:
    > an IMPLEMENTATION DEFINED classification of External Aborts.

    This bit is in addition to the {I,D}FSC range, and has an implementation
    defined meaning. KVM should always ignore this bit when handling external
    aborts from a guest.

    Remove the ESR_ELx_EA definition and rewrite its helper
    kvm_vcpu_dabt_isextabt() to check the {I,D}FSC range. This merges
    kvm_vcpu_dabt_isextabt() and the recently added is_abort_sea() helper.

    CC: Tyler Baicar
    Reported-by: gengdongjiu
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    James Morse
     

25 Jul, 2017

1 commit

  • The mmu_notifier_release() callback of KVM triggers cleaning up
    the stage2 page table on kvm-arm. However there could be other
    notifier callbacks in parallel with the mmu_notifier_release(),
    which could cause the call backs ending up in an empty stage2
    page table. Make sure we check it for all the notifier callbacks.

    Cc: stable@vger.kernel.org
    Fixes: commit 293f29363 ("kvm-arm: Unmap shadow pagetables properly")
    Reported-by: Alex Graf
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     

07 Jul, 2017

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "PPC:
    - Better machine check handling for HV KVM
    - Ability to support guests with threads=2, 4 or 8 on POWER9
    - Fix for a race that could cause delayed recognition of signals
    - Fix for a bug where POWER9 guests could sleep with interrupts pending.

    ARM:
    - VCPU request overhaul
    - allow timer and PMU to have their interrupt number selected from userspace
    - workaround for Cavium erratum 30115
    - handling of memory poisonning
    - the usual crop of fixes and cleanups

    s390:
    - initial machine check forwarding
    - migration support for the CMMA page hinting information
    - cleanups and fixes

    x86:
    - nested VMX bugfixes and improvements
    - more reliable NMI window detection on AMD
    - APIC timer optimizations

    Generic:
    - VCPU request overhaul + documentation of common code patterns
    - kvm_stat improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
    Update my email address
    kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
    x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12
    kvm: x86: mmu: allow A/D bits to be disabled in an mmu
    x86: kvm: mmu: make spte mmio mask more explicit
    x86: kvm: mmu: dead code thanks to access tracking
    KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code
    KVM: PPC: Book3S HV: Close race with testing for signals on guest entry
    KVM: PPC: Book3S HV: Simplify dynamic micro-threading code
    KVM: x86: remove ignored type attribute
    KVM: LAPIC: Fix lapic timer injection delay
    KVM: lapic: reorganize restart_apic_timer
    KVM: lapic: reorganize start_hv_timer
    kvm: nVMX: Check memory operand to INVVPID
    KVM: s390: Inject machine check into the nested guest
    KVM: s390: Inject machine check into the guest
    tools/kvm_stat: add new interactive command 'b'
    tools/kvm_stat: add new command line switch '-i'
    tools/kvm_stat: fix error on interactive command 'g'
    KVM: SVM: suppress unnecessary NMI singlestep on GIF=0 and nested exit
    ...

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • Pull arm64 updates from Will Deacon:

    - RAS reporting via GHES/APEI (ACPI)

    - Indirect ftrace trampolines for modules

    - Improvements to kernel fault reporting

    - Page poisoning

    - Sigframe cleanups and preparation for SVE context

    - Core dump fixes

    - Sparse fixes (mainly relating to endianness)

    - xgene SoC PMU v3 driver

    - Misc cleanups and non-critical fixes

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (75 commits)
    arm64: fix endianness annotation for 'struct jit_ctx' and friends
    arm64: cpuinfo: constify attribute_group structures.
    arm64: ptrace: Fix incorrect get_user() use in compat_vfp_set()
    arm64: ptrace: Remove redundant overrun check from compat_vfp_set()
    arm64: ptrace: Avoid setting compat FP[SC]R to garbage if get_user fails
    arm64: fix endianness annotation for __apply_alternatives()/get_alt_insn()
    arm64: fix endianness annotation in get_kaslr_seed()
    arm64: add missing conversion to __wsum in ip_fast_csum()
    arm64: fix endianness annotation in acpi_parking_protocol.c
    arm64: use readq() instead of readl() to read 64bit entry_point
    arm64: fix endianness annotation for reloc_insn_movw() & reloc_insn_imm()
    arm64: fix endianness annotation for aarch64_insn_write()
    arm64: fix endianness annotation in aarch64_insn_read()
    arm64: fix endianness annotation in call_undef_hook()
    arm64: fix endianness annotation for debug-monitors.c
    ras: mark stub functions as 'inline'
    arm64: pass endianness info to sparse
    arm64: ftrace: fix !CONFIG_ARM64_MODULE_PLTS kernels
    arm64: signal: Allow expansion of the signal frame
    acpi: apei: check for pending errors when probing GHES entries
    ...

    Linus Torvalds
     

23 Jun, 2017

2 commits

  • Currently external aborts are unsupported by the guest abort
    handling. Add handling for SEAs so that the host kernel reports
    SEAs which occur in the guest kernel.

    When an SEA occurs in the guest kernel, the guest exits and is
    routed to kvm_handle_guest_abort(). Prior to this patch, a print
    message of an unsupported FSC would be printed and nothing else
    would happen. With this patch, the code gets routed to the APEI
    handling of SEAs in the host kernel to report the SEA information.

    Signed-off-by: Tyler Baicar
    Acked-by: Catalin Marinas
    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Will Deacon

    Tyler Baicar
     
  • Once we enable ARCH_SUPPORTS_MEMORY_FAILURE on arm64, notifications for
    broken memory can call memory_failure() in mm/memory-failure.c to offline
    pages of memory, possibly signalling user space processes and notifying all
    the in-kernel users.

    memory_failure() has two modes, early and late. Early is used by
    machine-managers like Qemu to receive a notification when a memory error is
    notified to the host. These can then be relayed to the guest before the
    affected page is accessed. To enable this, the process must set
    PR_MCE_KILL_EARLY in PR_MCE_KILL_SET using the prctl() syscall.

    Once the early notification has been handled, nothing stops the
    machine-manager or guest from accessing the affected page. If the
    machine-manager does this the page will fail to be mapped and SIGBUS will
    be sent. This patch adds the equivalent path for when the guest accesses
    the page, sending SIGBUS to the machine-manager.

    These two signals can be distinguished by the machine-manager using their
    si_code: BUS_MCEERR_AO for 'action optional' early notifications, and
    BUS_MCEERR_AR for 'action required' synchronous/late notifications.

    Do as x86 does, and deliver the SIGBUS when we discover pfn ==
    KVM_PFN_ERR_HWPOISON. Use the hugepage size as si_addr_lsb if this vma was
    allocated as a hugepage. Transparent hugepages will be split by
    memory_failure() before we see them here.

    Cc: Punit Agrawal
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    James Morse
     

06 Jun, 2017

1 commit

  • Under memory pressure, we start ageing pages, which amounts to parsing
    the page tables. Since we don't want to allocate any extra level,
    we pass NULL for our private allocation cache. Which means that
    stage2_get_pud() is allowed to fail. This results in the following
    splat:

    [ 1520.409577] Unable to handle kernel NULL pointer dereference at virtual address 00000008
    [ 1520.417741] pgd = ffff810f52fef000
    [ 1520.421201] [00000008] *pgd=0000010f636c5003, *pud=0000010f56f48003, *pmd=0000000000000000
    [ 1520.429546] Internal error: Oops: 96000006 [#1] PREEMPT SMP
    [ 1520.435156] Modules linked in:
    [ 1520.438246] CPU: 15 PID: 53550 Comm: qemu-system-aar Tainted: G W 4.12.0-rc4-00027-g1885c397eaec #7205
    [ 1520.448705] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB12A 10/26/2016
    [ 1520.463726] task: ffff800ac5fb4e00 task.stack: ffff800ce04e0000
    [ 1520.469666] PC is at stage2_get_pmd+0x34/0x110
    [ 1520.474119] LR is at kvm_age_hva_handler+0x44/0xf0
    [ 1520.478917] pc : [] lr : [] pstate: 40000145
    [ 1520.486325] sp : ffff800ce04e33d0
    [ 1520.489644] x29: ffff800ce04e33d0 x28: 0000000ffff40064
    [ 1520.494967] x27: 0000ffff27e00000 x26: 0000000000000000
    [ 1520.500289] x25: ffff81051ba65008 x24: 0000ffff40065000
    [ 1520.505618] x23: 0000ffff40064000 x22: 0000000000000000
    [ 1520.510947] x21: ffff810f52b20000 x20: 0000000000000000
    [ 1520.516274] x19: 0000000058264000 x18: 0000000000000000
    [ 1520.521603] x17: 0000ffffa6fe7438 x16: ffff000008278b70
    [ 1520.526940] x15: 000028ccd8000000 x14: 0000000000000008
    [ 1520.532264] x13: ffff7e0018298000 x12: 0000000000000002
    [ 1520.537582] x11: ffff000009241b93 x10: 0000000000000940
    [ 1520.542908] x9 : ffff0000092ef800 x8 : 0000000000000200
    [ 1520.548229] x7 : ffff800ce04e36a8 x6 : 0000000000000000
    [ 1520.553552] x5 : 0000000000000001 x4 : 0000000000000000
    [ 1520.558873] x3 : 0000000000000000 x2 : 0000000000000008
    [ 1520.571696] x1 : ffff000008fd5000 x0 : ffff0000080b149c
    [ 1520.577039] Process qemu-system-aar (pid: 53550, stack limit = 0xffff800ce04e0000)
    [...]
    [ 1521.510735] [] stage2_get_pmd+0x34/0x110
    [ 1521.516221] [] kvm_age_hva_handler+0x44/0xf0
    [ 1521.522054] [] handle_hva_to_gpa+0xb8/0xe8
    [ 1521.527716] [] kvm_age_hva+0x44/0xf0
    [ 1521.532854] [] kvm_mmu_notifier_clear_flush_young+0x70/0xc0
    [ 1521.539992] [] __mmu_notifier_clear_flush_young+0x88/0xd0
    [ 1521.546958] [] page_referenced_one+0xf0/0x188
    [ 1521.552881] [] rmap_walk_anon+0xec/0x250
    [ 1521.558370] [] rmap_walk+0x78/0xa0
    [ 1521.563337] [] page_referenced+0x164/0x180
    [ 1521.569002] [] shrink_active_list+0x178/0x3b8
    [ 1521.574922] [] shrink_node_memcg+0x328/0x600
    [ 1521.580758] [] shrink_node+0xc4/0x328
    [ 1521.585986] [] do_try_to_free_pages+0xc0/0x340
    [ 1521.592000] [] try_to_free_pages+0xcc/0x240
    [...]

    The trivial fix is to handle this NULL pud value early, rather than
    dereferencing it blindly.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

16 May, 2017

2 commits

  • We yield the kvm->mmu_lock occassionaly while performing an operation
    (e.g, unmap or permission changes) on a large area of stage2 mappings.
    However this could possibly cause another thread to clear and free up
    the stage2 page tables while we were waiting for regaining the lock and
    thus the original thread could end up in accessing memory that was
    freed. This patch fixes the problem by making sure that the stage2
    pagetable is still valid after we regain the lock. The fact that
    mmu_notifer->release() could be called twice (via __mmu_notifier_release
    and mmu_notifier_unregsister) enhances the possibility of hitting
    this race where there are two threads trying to unmap the entire guest
    shadow pages.

    While at it, cleanup the redudant checks around cond_resched_lock in
    stage2_wp_range(), as cond_resched_lock already does the same checks.

    Cc: Mark Rutland
    Cc: Radim Krčmář
    Cc: andreyknvl@google.com
    Cc: Paolo Bonzini
    Cc: stable@vger.kernel.org
    Acked-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     
  • Make sure we don't use a cached value of the KVM stage2 PGD while
    resetting the PGD.

    Cc: Marc Zyngier
    Cc: stable@vger.kernel.org
    Signed-off-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     

15 May, 2017

1 commit

  • In kvm_free_stage2_pgd() we check the stage2 PGD before holding
    the lock and proceed to take the lock if it is valid. And we unmap
    the page tables, followed by releasing the lock. We reset the PGD
    only after dropping this lock, which could cause a race condition
    where another thread waiting on or even holding the lock, could
    potentially see that the PGD is still valid and proceed to perform
    a stage2 operation and later encounter a NULL PGD.

    [223090.242280] Unable to handle kernel NULL pointer dereference at
    virtual address 00000040
    [223090.262330] PC is at unmap_stage2_range+0x8c/0x428
    [223090.262332] LR is at kvm_unmap_hva_handler+0x2c/0x3c
    [223090.262531] Call trace:
    [223090.262533] [] unmap_stage2_range+0x8c/0x428
    [223090.262535] [] kvm_unmap_hva_handler+0x2c/0x3c
    [223090.262537] [] handle_hva_to_gpa+0xb0/0x104
    [223090.262539] [] kvm_unmap_hva+0x5c/0xbc
    [223090.262543] []
    kvm_mmu_notifier_invalidate_page+0x50/0x8c
    [223090.262547] []
    __mmu_notifier_invalidate_page+0x5c/0x84
    [223090.262551] [] try_to_unmap_one+0x1d0/0x4a0
    [223090.262553] [] rmap_walk+0x1cc/0x2e0
    [223090.262555] [] try_to_unmap+0x74/0xa4
    [223090.262557] [] migrate_pages+0x31c/0x5ac
    [223090.262561] [] compact_zone+0x3fc/0x7ac
    [223090.262563] [] compact_zone_order+0x94/0xb0
    [223090.262564] [] try_to_compact_pages+0x108/0x290
    [223090.262569] [] __alloc_pages_direct_compact+0x70/0x1ac
    [223090.262571] [] __alloc_pages_nodemask+0x434/0x9f4
    [223090.262572] [] alloc_pages_vma+0x230/0x254
    [223090.262574] [] do_huge_pmd_anonymous_page+0x114/0x538
    [223090.262576] [] handle_mm_fault+0xd40/0x17a4
    [223090.262577] [] __get_user_pages+0x12c/0x36c
    [223090.262578] [] get_user_pages_unlocked+0xa4/0x1b8
    [223090.262579] [] __gfn_to_pfn_memslot+0x280/0x31c
    [223090.262580] [] gfn_to_pfn_prot+0x4c/0x5c
    [223090.262582] [] kvm_handle_guest_abort+0x240/0x774
    [223090.262584] [] handle_exit+0x11c/0x1ac
    [223090.262586] [] kvm_arch_vcpu_ioctl_run+0x31c/0x648
    [223090.262587] [] kvm_vcpu_ioctl+0x378/0x768
    [223090.262590] [] do_vfs_ioctl+0x324/0x5a4
    [223090.262591] [] SyS_ioctl+0x90/0xa4
    [223090.262595] [] el0_svc_naked+0x38/0x3c

    This patch moves the stage2 PGD manipulation under the lock.

    Reported-by: Alexander Graf
    Cc: Mark Rutland
    Cc: Marc Zyngier
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Reviewed-by: Christoffer Dall
    Reviewed-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall

    Suzuki K Poulose
     

09 May, 2017

1 commit


04 May, 2017

1 commit

  • For some time now we have been having a lot of shared functionality
    between the arm and arm64 KVM support in arch/arm, which not only
    required a horrible inter-arch reference from the Makefile in
    arch/arm64/kvm, but also created confusion for newcomers to the code
    base, as was recently seen on the mailing list.

    Further, it causes confusion for things like cscope, which needs special
    attention to index specific shared files for arm64 from the arm tree.

    Move the shared files into virt/kvm/arm and move the trace points along
    with it. When moving the tracepoints we have to modify the way the vgic
    creates definitions of the trace points, so we take the chance to
    include the VGIC tracepoints in its very own special vgic trace.h file.

    Signed-off-by: Christoffer Dall

    Christoffer Dall