08 Mar, 2020

1 commit

  • Merge Linux stable release v5.4.24 into imx_5.4.y

    * tag 'v5.4.24': (3306 commits)
    Linux 5.4.24
    blktrace: Protect q->blk_trace with RCU
    kvm: nVMX: VMWRITE checks unsupported field before read-only field
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6sll-evk.dts
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
    drivers/clk/imx/clk-composite-8m.c
    drivers/gpio/gpio-mxc.c
    drivers/irqchip/Kconfig
    drivers/mmc/host/sdhci-of-esdhc.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/can/flexcan.c
    drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
    drivers/net/ethernet/mscc/ocelot.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/realtek.c
    drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/tee/optee/shm_pool.c
    drivers/usb/cdns3/gadget.c
    kernel/sched/cpufreq.c
    net/core/xdp.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c
    sound/soc/sof/core.c
    sound/soc/sof/imx/Kconfig
    sound/soc/sof/loader.c

    Jason Liu
     

05 Mar, 2020

1 commit

  • commit fcfbc617547fc6d9552cb6c1c563b6a90ee98085 upstream.

    When reading/writing using the guest/host cache, check for a bad hva
    before checking for a NULL memslot, which triggers the slow path for
    handing cross-page accesses. Because the memslot is nullified on error
    by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
    crossing into a new page, then the kvm_{read,write}_guest() slow path
    could potentially write/access the first chunk prior to detecting the
    bad hva.

    Arguably, performing a partial access is semantically correct from an
    architectural perspective, but that behavior is certainly not intended.
    In the original implementation, memslot was not explicitly nullified
    and therefore the partial access behavior varied based on whether the
    memslot itself was null, or if the hva was simply bad. The current
    behavior was introduced as a seemingly unintentional side effect in
    commit f1b9dd5eb86c ("kvm: Disallow wraparound in
    kvm_gfn_to_hva_cache_init"), which justified the change with "since some
    callers don't check the return code from this function, it sit seems
    prudent to clear ghc->memslot in the event of an error".

    Regardless of intent, the partial access is dependent on _not_ checking
    the result of the cache initialization, which is arguably a bug in its
    own right, at best simply weird.

    Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
    Cc: Jim Mattson
    Cc: Andrew Honig
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

15 Feb, 2020

7 commits

  • commit 4a267aa707953a9a73d1f5dc7f894dd9024a92be upstream.

    According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
    RES0 [1]. When reading the register, the value is truncated to the least
    significant 32 bits [2], and on writes, TimerValue is treated as a signed
    32-bit integer [1, 2].

    When the guest behaves correctly and writes 32-bit values, treating TVAL
    as an unsigned 64 bit register works as expected. However, things start
    to break down when the guest writes larger values, because
    (u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
    former will cause the timer interrupt to be asserted in the future, but
    the latter will cause it to be asserted now. Let's treat TVAL as a
    signed 32-bit register on writes, to match the behaviour described in
    the architecture, and the behaviour experimentally exhibited by the
    virtual timer on a non-vhe host.

    [1] Arm DDI 0487E.a, section D13.8.18
    [2] Arm DDI 0487E.a, section D11.2.4

    Signed-off-by: Alexandru Elisei
    [maz: replaced the read-side mask with lower_32_bits]
    Signed-off-by: Marc Zyngier
    Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
    Link: https://lore.kernel.org/r/20200127103652.2326-1-alexandru.elisei@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Alexandru Elisei
     
  • commit aa76829171e98bd75a0cc00b6248eca269ac7f4f upstream.

    At the moment a SW_INCR counter always overflows on 32-bit
    boundary, independently on whether the n+1th counter is
    programmed as CHAIN.

    Check whether the SW_INCR counter is a 64b counter and if so,
    implement the 64b logic.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-4-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 3837407c1aa1101ed5e214c7d6041e7a23335c6e upstream.

    The specification says PMSWINC increments PMEVCNTR_EL1 by 1
    if PMEVCNTR_EL0 is enabled and configured to count SW_INCR.

    For PMEVCNTR_EL0 to be enabled, we need both PMCNTENSET to
    be set for the corresponding event counter but we also need
    the PMCR.E bit to be set.

    Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Andrew Murray
    Acked-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 21aecdbd7f3ab02c9b82597dc733ee759fb8b274 upstream.

    KVM's inject_abt64() injects an external-abort into an aarch64 guest.
    The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but
    for an aarch32 guest inject_abt32() injects an implementation-defined
    exception, 'Lockdown fault'.

    Change this to external abort. For non-LPAE we now get the documented:
    | Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00
    and for LPAE:
    | Unhandled fault: synchronous external abort (0x210) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit 018f22f95e8a6c3e27188b7317ef2c70a34cb2cd upstream.

    Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected
    exception to a non-LPAE aarch32 guest.

    The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault
    (Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on
    instruction cache maintenance". This fault is hooked by
    do_translation_fault() since ARMv6, which goes on to silently 'handle'
    the exception, and restart the faulting instruction.

    It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has
    to shuffle up to DFSR[10].

    As KVM only does this in one place, fix up the static values. We
    now get the expected:
    | Unhandled fault: lock abort (0x404) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit cf2d23e0bac9f6b5cd1cba8898f5f05ead40e530 upstream.

    kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong
    address range has been passed to handle_hva_to_gpa(). With the wrong
    address range, no young bits will be checked in handle_hva_to_gpa().
    It means zero is always returned from mmu_notifier_test_young().

    This fixes the issue by passing correct address range to the underly
    function handle_hva_to_gpa(), so that the hardware young (access) bit
    will be visited.

    Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging")
    Signed-off-by: Gavin Shan
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit 8c58be34494b7f1b2adb446e2d8beeb90e5de65b upstream.

    Saving/restoring an unmapped collection is a valid scenario. For
    example this happens if a MAPTI command was sent, featuring an
    unmapped collection. At the moment the CTE fails to be restored.
    Only compare against the number of online vcpus if the rdist
    base is set.

    Fixes: ea1ad53e1e31a ("KVM: arm64: vgic-its: Collection table save/restore")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Zenghui Yu
    Link: https://lore.kernel.org/r/20191213094237.19627-1-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     

11 Feb, 2020

8 commits

  • [ Upstream commit 42cde48b2d39772dba47e680781a32a6c4b7dc33 ]

    Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
    on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
    this allows x86 to create large mappings for read-only memslots that
    are backed by HugeTLB mappings.

    Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
    support") states "If the largepage contains write-protected pages, a
    large pte is not used.", but "write-protected" refers to pages that are
    temporarily read-only, e.g. read-only memslots didn't even exist at the
    time.

    Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]

    Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
    correct set of memslots is used when handling x86 page faults in SMM.

    Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit 736c291c9f36b07f8889c61764c28edce20e715d ]

    Convert a plethora of parameters and variables in the MMU and page fault
    flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

    Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
    addresses. When TDP is enabled, the fault address is a guest physical
    address and thus can be a 64-bit value, even when both KVM and its guest
    are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
    64-bit field, not a natural width field.

    Using a gva_t for the fault address means KVM will incorrectly drop the
    upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
    translate L2 GPAs to L1 GPAs.

    Opportunistically rename variables and parameters to better reflect the
    dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
    "addr" instead of "vaddr" when the address may be either a GVA or an L2
    GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
    a confusing "gpa_t gva" declaration; this also sets the stage for a
    future patch to combing nonpaging_page_fault() and tdp_page_fault() with
    minimal churn.

    Sprinkle in a few comments to document flows where an address is known
    to be a GVA and thus can be safely truncated to a 32-bit value. Add
    WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
    document such cases and detect bugs.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream.

    __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
    * relatively expensive
    * in certain cases (such as when done from atomic context) cannot be called

    Stashing gfn-to-pfn mapping should help with both cases.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream.

    kvm_vcpu_(un)map operates on gfns from any current address space.
    In certain cases we want to make sure we are not mapping SMRAM
    and for that we can use kvm_(un)map_gfn() that we are introducing
    in this patch.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit b6ae256afd32f96bec0117175b329d0dd617655e upstream.

    On AArch64 you can do a sign-extended load to either a 32-bit or 64-bit
    register, and we should only sign extend the register up to the width of
    the register as specified in the operation (by using the 32-bit Wn or
    64-bit Xn register specifier).

    As it turns out, the architecture provides this decoding information in
    the SF ("Sixty-Four" -- how cute...) bit.

    Let's take advantage of this with the usual 32-bit/64-bit header file
    dance and do the right thing on AArch64 hosts.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191212195055.5541-1-christoffer.dall@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     
  • commit 1cfbb484de158e378e8971ac40f3082e53ecca55 upstream.

    Confusingly, there are three SPSR layouts that a kernel may need to deal
    with:

    (1) An AArch64 SPSR_ELx view of an AArch64 pstate
    (2) An AArch64 SPSR_ELx view of an AArch32 pstate
    (3) An AArch32 SPSR_* view of an AArch32 pstate

    When the KVM AArch32 support code deals with SPSR_{EL2,HYP}, it's either
    dealing with #2 or #3 consistently. On arm64 the PSR_AA32_* definitions
    match the AArch64 SPSR_ELx view, and on arm the PSR_AA32_* definitions
    match the AArch32 SPSR_* view.

    However, when we inject an exception into an AArch32 guest, we have to
    synthesize the AArch32 SPSR_* that the guest will see. Thus, an AArch64
    host needs to synthesize layout #3 from layout #2.

    This patch adds a new host_spsr_to_spsr32() helper for this, and makes
    use of it in the KVM AArch32 support code. For arm64 we need to shuffle
    the DIT bit around, and remove the SS bit, while for arm we can use the
    value as-is.

    I've open-coded the bit manipulation for now to avoid having to rework
    the existing PSR_* definitions into PSR64_AA32_* and PSR32_AA32_*
    definitions. I hope to perform a more thorough refactoring in future so
    that we can handle pstate view manipulation more consistently across the
    kernel tree.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-4-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit 3c2483f15499b877ccb53250d88addb8c91da147 upstream.

    When KVM injects an exception into a guest, it generates the CPSR value
    from scratch, configuring CPSR.{M,A,I,T,E}, and setting all other
    bits to zero.

    This isn't correct, as the architecture specifies that some CPSR bits
    are (conditionally) cleared or set upon an exception, and others are
    unchanged from the original context.

    This patch adds logic to match the architectural behaviour. To make this
    simple to follow/audit/extend, documentation references are provided,
    and bits are configured in order of their layout in SPSR_EL2. This
    layout can be seen in the diagram on ARM DDI 0487E.a page C5-426.

    Note that this code is used by both arm and arm64, and is intended to
    fuction with the SPSR_EL2 and SPSR_HYP layouts.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-3-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

31 Dec, 2019

1 commit

  • commit 6d674e28f642e3ff676fbae2d8d1b872814d32b6 upstream.

    A device mapping is normally always mapped at Stage-2, since there
    is very little gain in having it faulted in.

    Nonetheless, it is possible to end-up in a situation where the device
    mapping has been removed from Stage-2 (userspace munmaped the VFIO
    region, and the MMU notifier did its job), but present in a userspace
    mapping (userpace has mapped it back at the same address). In such
    a situation, the device mapping will be demand-paged as the guest
    performs memory accesses.

    This requires to be careful when dealing with mapping size, cache
    management, and to handle potential execution of a device mapping.

    Reported-by: Alexandru Elisei
    Signed-off-by: Marc Zyngier
    Tested-by: Alexandru Elisei
    Reviewed-by: James Morse
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191211165651.7889-2-maz@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

16 Dec, 2019

1 commit

  • This is the 5.4.3 stable release

    Conflicts:
    drivers/cpufreq/imx-cpufreq-dt.c
    drivers/spi/spi-fsl-qspi.c

    The conflict is very minor, fixed it when do the merge. The imx-cpufreq-dt.c
    is just one line code-style change, using upstream one, no any function change.

    The spi-fsl-qspi.c has minor conflicts when merge upstream fixes: c69b17da53b2
    spi: spi-fsl-qspi: Clear TDH bits in FLSHCR register

    After merge, basic boot sanity test and basic qspi test been done on i.mx

    Signed-off-by: Jason Liu

    Jason Liu
     

13 Dec, 2019

1 commit

  • commit ca185b260951d3b55108c0b95e188682d8a507b7 upstream.

    It's possible that two LPIs locate in the same "byte_offset" but target
    two different vcpus, where their pending status are indicated by two
    different pending tables. In such a scenario, using last_byte_offset
    optimization will lead KVM relying on the wrong pending table entry.
    Let us use last_ptr instead, which can be treated as a byte index into
    a pending table and also, can be vcpu specific.

    Fixes: 280771252c1b ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES")
    Cc: stable@vger.kernel.org
    Signed-off-by: Zenghui Yu
    Signed-off-by: Marc Zyngier
    Acked-by: Eric Auger
    Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Zenghui Yu
     

25 Nov, 2019

3 commits

  • FSL-MC bus devices uses device-ids from 0x10000 to 0x20000.
    So to support MSI interrupts for mc-bus devices we need
    vgi-ITS device-id table of size 2^17 to support deviceid
    range from 0x10000 to 0x20000.

    Signed-off-by: Bharat Bhushan

    Bharat Bhushan
     
  • Instead of hardcoding checks for qman cacheable
    mmio region physical addresses extract mapping
    information from the user-space mapping.
    The involves several steps;
    - get access to a pte part of the user-space mapping
    by using get_locked_pte() / pte_unmap_unlock() apis
    - extract memtype (normal / device), shareability from
    the pte
    - convert to S2 translation bits in newly added
    function stage1_to_stage2_pgprot()
    - finish making the s2 translation with the obtained bits

    Another explored option was using vm_area_struct::vm_page_prot
    which is set in vfio-mc mmap code to the correct page bits.
    However, experiments show that these bits are later altered
    in the generic mmap code (e.g. the shareability bit is always
    set on arm64).
    The only place where the original bits can still be found
    is the user-space mapping, using the method described above.

    Signed-off-by: Laurentiu Tudor
    [Bharat - Fixed mem_type check issue]
    [changed "ifdef ARM64" to CONFIG_ARM64]
    Signed-off-by: Bharat Bhushan
    [Ioana - added a sanity check for hugepages]
    Signed-off-by: Ioana Ciornei
    [Fixed format issues]
    Signed-off-by: Diana Craciun

    Laurentiu Tudor
     
  • Add parameter allowing to specify s2 page table
    protection and type bits and update the callers
    accordingly.
    The parameter will be used in a forthcoming patch.

    Signed-off-by: Laurentiu Tudor

    Laurentiu Tudor
     

15 Nov, 2019

1 commit


14 Nov, 2019

1 commit

  • On a system without KVM_COMPAT, we prevent IOCTLs from being issued
    by a compat task. Although this prevents most silly things from
    happening, it can still confuse a 32bit userspace that is able
    to open the kvm device (the qemu test suite seems to be pretty
    mad with this behaviour).

    Take a more radical approach and return a -ENODEV to the compat
    task.

    Reported-by: Peter Maydell
    Signed-off-by: Marc Zyngier
    Signed-off-by: Paolo Bonzini

    Marc Zyngier
     

13 Nov, 2019

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Fix unwinding of KVM_CREATE_VM failure, VT-d posted interrupts,
    DAX/ZONE_DEVICE, and module unload/reload"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved
    KVM: VMX: Introduce pi_is_pir_empty() helper
    KVM: VMX: Do not change PID.NDST when loading a blocked vCPU
    KVM: VMX: Consider PID.PIR to determine if vCPU has pending interrupts
    KVM: VMX: Fix comment to specify PID.ON instead of PIR.ON
    KVM: X86: Fix initialization of MSR lists
    KVM: fix placement of refcount initialization
    KVM: Fix NULL-ptr deref after kvm_create_vm fails

    Linus Torvalds
     

12 Nov, 2019

1 commit

  • Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
    instead manually handle ZONE_DEVICE on a case-by-case basis. For things
    like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
    pages, e.g. put pages grabbed via gup(). But for flows such as setting
    A/D bits or shifting refcounts for transparent huge pages, KVM needs to
    to avoid processing ZONE_DEVICE pages as the flows in question lack the
    underlying machinery for proper handling of ZONE_DEVICE pages.

    This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
    when running a KVM guest backed with /dev/dax memory, as KVM straight up
    doesn't put any references to ZONE_DEVICE pages acquired by gup().

    Note, Dan Williams proposed an alternative solution of doing put_page()
    on ZONE_DEVICE pages immediately after gup() in order to simplify the
    auditing needed to ensure is_zone_device_page() is called if and only if
    the backing device is pinned (via gup()). But that approach would break
    kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
    unmap() when accessing guest memory, unlike KVM's secondary MMU, which
    coordinates with mmu_notifier invalidations to avoid creating stale
    page references, i.e. doesn't rely on pages being pinned.

    [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.pl

    Reported-by: Adam Borowski
    Analyzed-by: David Hildenbrand
    Acked-by: Dan Williams
    Cc: stable@vger.kernel.org
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

11 Nov, 2019

2 commits

  • Reported by syzkaller:

    =============================
    WARNING: suspicious RCU usage
    -----------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    no locks held by repro_11/12688.

    stack backtrace:
    Call Trace:
    dump_stack+0x7d/0xc5
    lockdep_rcu_suspicious+0x123/0x170
    kvm_dev_ioctl+0x9a9/0x1260 [kvm]
    do_vfs_ioctl+0x1a1/0xfb0
    ksys_ioctl+0x6d/0x80
    __x64_sys_ioctl+0x73/0xb0
    do_syscall_64+0x108/0xaa0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Commit a97b0e773e4 (kvm: call kvm_arch_destroy_vm if vm creation fails)
    sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
    fails, we need to decrease this count. By moving it earlier, we can push
    the decrease to out_err_no_arch_destroy_vm without introducing yet another
    error label.

    syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000

    Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
    Fixes: a97b0e773e49 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
    Cc: Jim Mattson
    Analyzed-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Reported by syzkaller:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
    RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
    Call Trace:
    kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
    kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
    vfs_ioctl fs/ioctl.c:46 [inline]
    file_ioctl fs/ioctl.c:509 [inline]
    do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
    ksys_ioctl+0x62/0x90 fs/ioctl.c:713
    __do_sys_ioctl fs/ioctl.c:720 [inline]
    __se_sys_ioctl fs/ioctl.c:718 [inline]
    __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
    do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Commit 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
    moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
    initialization, NULL will be returned instead of error code, NULL will not be intercepted
    in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
    fixes it.

    Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
    was also reported by syzkaller:

    wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
    __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
    synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
    synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
    kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
    kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
    kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
    kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]

    so do it.

    Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
    Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
    Fixes: 9121923c457d ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
    Cc: Jim Mattson
    Cc: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

05 Nov, 2019

1 commit

  • The page table pages corresponding to broken down large pages are zapped in
    FIFO order, so that the large page can potentially be recovered, if it is
    not longer being used for execution. This removes the performance penalty
    for walking deeper EPT page tables.

    By default, one large page will last about one hour once the guest
    reaches a steady state.

    Signed-off-by: Junaid Shahid
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Thomas Gleixner

    Junaid Shahid
     

04 Nov, 2019

1 commit


31 Oct, 2019

1 commit

  • In kvm_create_vm(), if we've successfully called kvm_arch_init_vm(), but
    then fail later in the function, we need to call kvm_arch_destroy_vm()
    so that it can do any necessary cleanup (like freeing memory).

    Fixes: 44a95dae1d229a ("KVM: x86: Detect and Initialize AVIC support")

    Signed-off-by: John Sperbeck
    Signed-off-by: Jim Mattson
    Reviewed-by: Junaid Shahid
    [Remove dependency on "kvm: Don't clear reference count on
    kvm_create_vm() error path" which was not committed. - Paolo]
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

25 Oct, 2019

1 commit


22 Oct, 2019

2 commits


20 Oct, 2019

3 commits

  • The PMU emulation code uses the perf event sample period to trigger
    the overflow detection. This works fine for the *first* overflow
    handling, but results in a huge number of interrupts on the host,
    unrelated to the number of interrupts handled in the guest (a x20
    factor is pretty common for the cycle counter). On a slow system
    (such as a SW model), this can result in the guest only making
    forward progress at a glacial pace.

    It turns out that the clue is in the name. The sample period is
    exactly that: a period. And once the an overflow has occured,
    the following period should be the full width of the associated
    counter, instead of whatever the guest had initially programed.

    Reset the sample period to the architected value in the overflow
    handler, which now results in a number of host interrupts that is
    much closer to the number of interrupts in the guest.

    Fixes: b02386eb7dac ("arm64: KVM: Add PMU overflow interrupt routing")
    Reviewed-by: Andrew Murray
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • The current convention for KVM to request a chained event from the
    host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).

    But as it turns out, this bit gets set *after* we create the kernel
    event that backs our virtual counter, meaning that we never get
    a 64bit counter.

    Moving the setting to an earlier point solves the problem.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Reviewed-by: Andrew Murray
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • When a counter is disabled, its value is sampled before the event
    is being disabled, and the value written back in the shadow register.

    In that process, the value gets truncated to 32bit, which is adequate
    for any counter but the cycle counter (defined as a 64bit counter).

    This obviously results in a corrupted counter, and things like
    "perf record -e cycles" not working at all when run in a guest...
    A similar, but less critical bug exists in kvm_pmu_get_counter_value.

    Make the truncation conditional on the counter not being the cycle
    counter, which results in a minor code reorganisation.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Reviewed-by: Andrew Murray
    Reported-by: Julien Thierry
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

03 Oct, 2019

1 commit


01 Oct, 2019

1 commit