28 Feb, 2020

1 commit


17 Feb, 2020

1 commit

  • With VHE, running a vCPU always requires the sequence:

    1. kvm_arm_vhe_guest_enter();
    2. kvm_vcpu_run_vhe();
    3. kvm_arm_vhe_guest_exit()

    ... and as we invoke this from the shared arm/arm64 KVM code, 32-bit arm
    has to provide stubs for all three functions.

    To simplify the common code, and make it easier to make further
    modifications to the arm64-specific portions in the near future, let's
    fold kvm_arm_vhe_guest_enter() and kvm_arm_vhe_guest_exit() into
    kvm_vcpu_run_vhe().

    The 32-bit stubs for kvm_arm_vhe_guest_enter() and
    kvm_arm_vhe_guest_exit() are removed, as they are no longer used. The
    32-bit stub for kvm_vcpu_run_vhe() is left as-is.

    There should be no functional change as a result of this patch.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200210114757.2889-1-mark.rutland@arm.com

    Mark Rutland
     

12 Feb, 2020

1 commit

  • Accessing a per-cpu variable only makes sense when preemption is
    disabled (and the kernel does check this when the right debug options
    are switched on).

    For kvm_get_running_vcpu(), it is fine to return the value after
    re-enabling preemption, as the preempt notifiers will make sure that
    this is kept consistent across task migration (the comment above the
    function hints at it, but lacks the crucial preemption management).

    While we're at it, move the comment from the ARM code, which explains
    why the whole thing works.

    Fixes: 7495e22bb165 ("KVM: Move running VCPU from ARM to common code").
    Cc: Paolo Bonzini
    Reported-by: Zenghui Yu
    Tested-by: Zenghui Yu
    Reviewed-by: Peter Xu
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/318984f6-bc36-33a3-abc6-bf2295974b06@huawei.com
    Message-id:
    Signed-off-by: Paolo Bonzini

    Marc Zyngier
     

05 Feb, 2020

2 commits

  • We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
    and found the zero_page refcount overflow.
    The cause of refcount overflow is increased in try_async_pf
    (get_user_page) without being decreased in mmu_set_spte()
    while handling ept violation.
    In kvm_release_pfn_clean(), only unreserved page will call
    put_page. However, zero page is reserved.
    So, as well as creating and destroy vm, the refcount of
    zero page will continue to increase until it overflows.

    step1:
    echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
    echo 1 > /sys/kernel/pages_to_scan/run
    echo 1 > /sys/kernel/pages_to_scan/use_zero_pages

    step2:
    just create several normal qemu kvm vms.
    And destroy it after 10s.
    Repeat this action all the time.

    After a long period of time, all domains hang because
    of the refcount of zero page overflow.

    Qemu print error log as follow:
    …
    error: kvm run failed Bad address
    EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
    ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
    EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
    ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
    SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
    TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
    GDT= 000f7070 00000037
    IDT= 000f70ae 00000000
    CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
    DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
    DR6=00000000ffff0ff0 DR7=0000000000000400
    EFER=0000000000000000
    Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
    …

    Meanwhile, a kernel warning is departed.

    [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
    [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G OE 5.2.0-rc2 #5
    [40914.836415] RIP: 0010:try_get_page+0x1f/0x30
    [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
    0 00 00 00 48 8b 47 08 a8
    [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
    [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
    [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
    [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
    [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
    [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
    [40914.836423] FS: 00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
    [40914.836425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
    [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [40914.836428] Call Trace:
    [40914.836433] follow_page_pte+0x302/0x47b
    [40914.836437] __get_user_pages+0xf1/0x7d0
    [40914.836441] ? irq_work_queue+0x9/0x70
    [40914.836443] get_user_pages_unlocked+0x13f/0x1e0
    [40914.836469] __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
    [40914.836486] try_async_pf+0x87/0x240 [kvm]
    [40914.836503] tdp_page_fault+0x139/0x270 [kvm]
    [40914.836523] kvm_mmu_page_fault+0x76/0x5e0 [kvm]
    [40914.836588] vcpu_enter_guest+0xb45/0x1570 [kvm]
    [40914.836632] kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
    [40914.836645] kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
    [40914.836650] do_vfs_ioctl+0xa9/0x620
    [40914.836653] ksys_ioctl+0x60/0x90
    [40914.836654] __x64_sys_ioctl+0x16/0x20
    [40914.836658] do_syscall_64+0x5b/0x180
    [40914.836664] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [40914.836666] RIP: 0033:0x7fb61cb6bfc7

    Signed-off-by: LinFeng
    Signed-off-by: Zhuang Yanying
    Signed-off-by: Paolo Bonzini

    Zhuang Yanying
     
  • Fedora kernel builds on armv7hl began failing recently because
    kvm_arm_exception_type and kvm_arm_exception_class were undeclared in
    trace.h. Add the missing include.

    Fixes: 0e20f5e25556 ("KVM: arm/arm64: Cleanup MMIO handling")
    Signed-off-by: Jeremy Cline
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200205134146.82678-1-jcline@redhat.com

    Jeremy Cline
     

31 Jan, 2020

4 commits

  • From Boris Ostrovsky:

    The KVM hypervisor may provide a guest with ability to defer remote TLB
    flush when the remote VCPU is not running. When this feature is used,
    the TLB flush will happen only when the remote VPCU is scheduled to run
    again. This will avoid unnecessary (and expensive) IPIs.

    Under certain circumstances, when a guest initiates such deferred action,
    the hypervisor may miss the request. It is also possible that the guest
    may mistakenly assume that it has already marked remote VCPU as needing
    a flush when in fact that request had already been processed by the
    hypervisor. In both cases this will result in an invalid translation
    being present in a vCPU, potentially allowing accesses to memory locations
    in that guest's address space that should not be accessible.

    Note that only intra-guest memory is vulnerable.

    The five patches address both of these problems:
    1. The first patch makes sure the hypervisor doesn't accidentally clear
    a guest's remote flush request
    2. The rest of the patches prevent the race between hypervisor
    acknowledging a remote flush request and guest issuing a new one.

    Conflicts:
    arch/x86/kvm/x86.c [move from kvm_arch_vcpu_free to kvm_arch_vcpu_destroy]

    Paolo Bonzini
     
  • __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
    * relatively expensive
    * in certain cases (such as when done from atomic context) cannot be called

    Stashing gfn-to-pfn mapping should help with both cases.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Boris Ostrovsky
     
  • kvm_vcpu_(un)map operates on gfns from any current address space.
    In certain cases we want to make sure we are not mapping SMRAM
    and for that we can use kvm_(un)map_gfn() that we are introducing
    in this patch.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Boris Ostrovsky
     
  • KVM/arm updates for Linux 5.6

    - Fix MMIO sign extension
    - Fix HYP VA tagging on tag space exhaustion
    - Fix PSTATE/CPSR handling when generating exception
    - Fix MMU notifier's advertizing of young pages
    - Fix poisoned page handling
    - Fix PMU SW event handling
    - Fix TVAL register access
    - Fix AArch32 external abort injection
    - Fix ITS unmapped collection handling
    - Various cleanups

    Paolo Bonzini
     

28 Jan, 2020

23 commits

  • According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
    RES0 [1]. When reading the register, the value is truncated to the least
    significant 32 bits [2], and on writes, TimerValue is treated as a signed
    32-bit integer [1, 2].

    When the guest behaves correctly and writes 32-bit values, treating TVAL
    as an unsigned 64 bit register works as expected. However, things start
    to break down when the guest writes larger values, because
    (u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
    former will cause the timer interrupt to be asserted in the future, but
    the latter will cause it to be asserted now. Let's treat TVAL as a
    signed 32-bit register on writes, to match the behaviour described in
    the architecture, and the behaviour experimentally exhibited by the
    virtual timer on a non-vhe host.

    [1] Arm DDI 0487E.a, section D13.8.18
    [2] Arm DDI 0487E.a, section D11.2.4

    Signed-off-by: Alexandru Elisei
    [maz: replaced the read-side mask with lower_32_bits]
    Signed-off-by: Marc Zyngier
    Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
    Link: https://lore.kernel.org/r/20200127103652.2326-1-alexandru.elisei@arm.com

    Alexandru Elisei
     
  • Let the code never use unsupported event counters. Change
    kvm_pmu_handle_pmcr() to only reset supported counters and
    kvm_pmu_vcpu_reset() to only stop supported counters.

    Other actions are filtered on the supported counters in
    kvm/sysregs.c

    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-5-eric.auger@redhat.com

    Eric Auger
     
  • At the moment a SW_INCR counter always overflows on 32-bit
    boundary, independently on whether the n+1th counter is
    programmed as CHAIN.

    Check whether the SW_INCR counter is a 64b counter and if so,
    implement the 64b logic.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-4-eric.auger@redhat.com

    Eric Auger
     
  • At the moment we update the chain bitmap on type setting. This
    does not take into account the enable state of the odd register.

    Let's make sure a counter is never considered as chained if
    the high counter is disabled.

    We recompute the chain state on enable/disable and type changes.

    Also let create_perf_event() use the chain bitmap and not use
    kvm_pmu_idx_has_chain_evtype().

    Suggested-by: Marc Zyngier
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-3-eric.auger@redhat.com

    Eric Auger
     
  • The specification says PMSWINC increments PMEVCNTR_EL1 by 1
    if PMEVCNTR_EL0 is enabled and configured to count SW_INCR.

    For PMEVCNTR_EL0 to be enabled, we need both PMCNTENSET to
    be set for the corresponding event counter but we also need
    the PMCR.E bit to be set.

    Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Andrew Murray
    Acked-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com

    Eric Auger
     
  • Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
    on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
    this allows x86 to create large mappings for read-only memslots that
    are backed by HugeTLB mappings.

    Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
    support") states "If the largepage contains write-protected pages, a
    large pte is not used.", but "write-protected" refers to pages that are
    temporarily read-only, e.g. read-only memslots didn't even exist at the
    time.

    Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
    correct set of memslots is used when handling x86 page faults in SMM.

    Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Add a helper, is_transparent_hugepage(), to explicitly check whether a
    compound page is a THP and use it when populating KVM's secondary MMU.
    The explicit check fixes a bug where a remapped compound page, e.g. for
    an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
    which results in KVM incorrectly creating a huge page in its secondary
    MMU.

    Fixes: 936a5fe6e6148 ("thp: kvm mmu transparent hugepage support")
    Reported-by: syzbot+c9d1fb51ac9d0d10c39d@syzkaller.appspotmail.com
    Cc: Andrea Arcangeli
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Check the result of __kvm_gfn_to_hva_cache_init() and return immediately
    instead of relying on the kvm_is_error_hva() check to detect errors so
    that it's abundantly clear KVM intends to immediately bail on an error.

    Note, the hva check is still mandatory to handle errors on subqeuesnt
    calls with the same generation. Similarly, always return -EFAULT on
    error so that multiple (bad) calls for a given generation will get the
    same result, e.g. on an illegal gfn wrap, propagating the return from
    __kvm_gfn_to_hva_cache_init() would cause the initial call to return
    -EINVAL and subsequent calls to return -EFAULT.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Barret reported a (technically benign) bug where nr_pages_avail can be
    accessed without being initialized if gfn_to_hva_many() fails.

    virt/kvm/kvm_main.c:2193:13: warning: 'nr_pages_avail' may be
    used uninitialized in this function [-Wmaybe-uninitialized]

    Rather than simply squashing the warning by initializing nr_pages_avail,
    fix the underlying issues by reworking __kvm_gfn_to_hva_cache_init() to
    return immediately instead of continuing on. Now that all callers check
    the result and/or bail immediately on a bad hva, there's no need to
    explicitly nullify the memslot on error.

    Reported-by: Barret Rhoden
    Fixes: f1b9dd5eb86c ("kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init")
    Cc: Jim Mattson
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • When reading/writing using the guest/host cache, check for a bad hva
    before checking for a NULL memslot, which triggers the slow path for
    handing cross-page accesses. Because the memslot is nullified on error
    by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
    crossing into a new page, then the kvm_{read,write}_guest() slow path
    could potentially write/access the first chunk prior to detecting the
    bad hva.

    Arguably, performing a partial access is semantically correct from an
    architectural perspective, but that behavior is certainly not intended.
    In the original implementation, memslot was not explicitly nullified
    and therefore the partial access behavior varied based on whether the
    memslot itself was null, or if the hva was simply bad. The current
    behavior was introduced as a seemingly unintentional side effect in
    commit f1b9dd5eb86c ("kvm: Disallow wraparound in
    kvm_gfn_to_hva_cache_init"), which justified the change with "since some
    callers don't check the return code from this function, it sit seems
    prudent to clear ghc->memslot in the event of an error".

    Regardless of intent, the partial access is dependent on _not_ checking
    the result of the cache initialization, which is arguably a bug in its
    own right, at best simply weird.

    Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
    Cc: Jim Mattson
    Cc: Andrew Honig
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • For ring-based dirty log tracking, it will be more efficient to account
    writes during schedule-out or schedule-in to the currently running VCPU.
    We would like to do it even if the write doesn't use the current VCPU's
    address space, as is the case for cached writes (see commit 4e335d9e7ddb,
    "Revert "KVM: Support vCPU-based gfn->hva cache"", 2017-05-02).

    Therefore, add a mechanism to track the currently-loaded kvm_vcpu struct.
    There is already something similar in KVM/ARM; one important difference
    is that kvm_arch_vcpu_{load,put} have two callers in virt/kvm/kvm_main.c:
    we have to update both the architecture-independent vcpu_{load,put} and
    the preempt notifiers.

    Another change made in the process is to allow using kvm_get_running_vcpu()
    in preemptible code. This is allowed because preempt notifiers ensure
    that the value does not change even after the VCPU thread is migrated.

    Signed-off-by: Paolo Bonzini
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Peter Xu
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • It's already going to reach 2400 Bytes (which is over half of page
    size on 4K page archs), so maybe it's good to have this build-time
    check in case it overflows when adding new fields.

    Signed-off-by: Peter Xu
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • Remove kvm_read_guest_atomic() because it's not used anywhere.

    Signed-off-by: Peter Xu
    Signed-off-by: Paolo Bonzini

    Peter Xu
     
  • Open code the allocation and freeing of the vcpu->run page in
    kvm_vm_ioctl_create_vcpu() and kvm_vcpu_destroy() respectively. Doing
    so allows kvm_vcpu_init() to be a pure init function and eliminates
    kvm_vcpu_uninit() entirely.

    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Move the putting of vcpu->pid to kvm_vcpu_destroy(). vcpu->pid is
    guaranteed to be NULL when kvm_vcpu_uninit() is called in the error path
    of kvm_vm_ioctl_create_vcpu(), e.g. it is explicitly nullified by
    kvm_vcpu_init() and is only changed by KVM_RUN.

    No functional change intended.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Remove kvm_arch_vcpu_init() and kvm_arch_vcpu_uninit() now that all
    arch specific implementations are nops.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Add an arm specific hook to free the arm64-only sve_state. Doing so
    eliminates the last functional code from kvm_arch_vcpu_uninit() across
    all architectures and paves the way for removing kvm_arch_vcpu_init()
    and kvm_arch_vcpu_uninit() entirely.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Fold init() into create() now that the two are called back-to-back by
    common KVM code (kvm_vcpu_init() calls kvm_arch_vcpu_init() as its last
    action, and kvm_vm_ioctl_create_vcpu() calls kvm_arch_vcpu_create()
    immediately thereafter). This paves the way for removing
    kvm_arch_vcpu_{un}init() entirely.

    Note, there is no associated unwinding in kvm_arch_vcpu_uninit() that
    needs to be relocated (to kvm_arch_vcpu_destroy()).

    No functional change intended.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Remove kvm_arch_vcpu_setup() now that all arch specific implementations
    are nops.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Initialize the preempt notifier immediately in kvm_vcpu_init() to pave
    the way for removing kvm_arch_vcpu_setup(), i.e. to allow arch specific
    code to call vcpu_load() during kvm_arch_vcpu_create().

    Back when preemption support was added, the location of the call to init
    the preempt notifier was perfectly sane. The overall vCPU creation flow
    featured a single arch specific hook and the preempt notifer was used
    immediately after its initialization (by vcpu_load()). E.g.:

    vcpu = kvm_arch_ops->vcpu_create(kvm, n);
    if (IS_ERR(vcpu))
    return PTR_ERR(vcpu);

    preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);

    vcpu_load(vcpu);
    r = kvm_mmu_setup(vcpu);
    vcpu_put(vcpu);
    if (r < 0)
    goto free_vcpu;

    Today, the call to preempt_notifier_init() is sandwiched between two
    arch specific calls, kvm_arch_vcpu_create() and kvm_arch_vcpu_setup(),
    which needlessly forces x86 (and possibly others?) to split its vCPU
    creation flow. Init the preempt notifier prior to any arch specific
    call so that each arch can independently decide how best to organize
    its creation flow.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Unexport kvm_vcpu_cache and kvm_vcpu_{un}init() and make them static
    now that they are referenced only in kvm_main.c.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Now that all architectures tightly couple vcpu allocation/free with the
    mandatory calls to kvm_{un}init_vcpu(), move the sequences verbatim to
    common KVM code.

    Move both allocation and initialization in a single patch to eliminate
    thrash in arch specific code. The bisection benefits of moving the two
    pieces in separate patches is marginal at best, whereas the odds of
    introducing a transient arch specific bug are non-zero.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

24 Jan, 2020

3 commits

  • Add kvm_vcpu_destroy() and wire up all architectures to call the common
    function instead of their arch specific implementation. The common
    destruction function will be used by future patches to move allocation
    and initialization of vCPUs to common KVM code, i.e. to free resources
    that are allocated by arch agnostic code.

    No functional change intended.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Add a pre-allocation arch hook to handle checks that are currently done
    by arch specific code prior to allocating the vCPU object. This paves
    the way for moving the allocation to common KVM code.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Remove the superfluous kvm_arch_vcpu_free() as it is no longer called
    from commmon KVM code. Note, kvm_arch_vcpu_destroy() *is* called from
    common code, i.e. choosing which function to whack is not completely
    arbitrary.

    Acked-by: Christoffer Dall
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

23 Jan, 2020

5 commits

  • KVM's inject_abt64() injects an external-abort into an aarch64 guest.
    The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but
    for an aarch32 guest inject_abt32() injects an implementation-defined
    exception, 'Lockdown fault'.

    Change this to external abort. For non-LPAE we now get the documented:
    | Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00
    and for LPAE:
    | Unhandled fault: synchronous external abort (0x210) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com

    James Morse
     
  • Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected
    exception to a non-LPAE aarch32 guest.

    The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault
    (Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on
    instruction cache maintenance". This fault is hooked by
    do_translation_fault() since ARMv6, which goes on to silently 'handle'
    the exception, and restart the faulting instruction.

    It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has
    to shuffle up to DFSR[10].

    As KVM only does this in one place, fix up the static values. We
    now get the expected:
    | Unhandled fault: lock abort (0x404) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com

    James Morse
     
  • kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong
    address range has been passed to handle_hva_to_gpa(). With the wrong
    address range, no young bits will be checked in handle_hva_to_gpa().
    It means zero is always returned from mmu_notifier_test_young().

    This fixes the issue by passing correct address range to the underly
    function handle_hva_to_gpa(), so that the hardware young (access) bit
    will be visited.

    Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging")
    Signed-off-by: Gavin Shan
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com

    Gavin Shan
     
  • Our MMIO handling is a bit odd, in the sense that it uses an
    intermediate per-vcpu structure to store the various decoded
    information that describe the access.

    But the same information is readily available in the HSR/ESR_EL2
    field, and we actually use this field to populate the structure.

    Let's simplify the whole thing by getting rid of the superfluous
    structure and save a (tiny) bit of space in the vcpu structure.

    [32bit fix courtesy of Olof Johansson ]
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • The wrappers make it less clear that the position of the call
    to kvm_arch_async_page_present depends on the architecture, and
    that only one of the two call sites will actually be active.
    Remove them.

    Cc: Andy Lutomirski
    Cc: Christian Borntraeger
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini