01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

30 Sep, 2018

1 commit


29 Sep, 2018

1 commit

  • Michael writes:
    "powerpc fixes for 4.19 #3

    A reasonably big batch of fixes due to me being away for a few weeks.

    A fix for the TM emulation support on Power9, which could result in
    corrupting the guest r11 when running under KVM.

    Two fixes to the TM code which could lead to userspace GPR corruption
    if we take an SLB miss at exactly the wrong time.

    Our dynamic patching code had a bug that meant we could patch freed
    __init text, which could lead to corrupting userspace memory.

    csum_ipv6_magic() didn't work on little endian platforms since we
    optimised it recently.

    A fix for an endian bug when reading a device tree property telling
    us how many storage keys the machine has available.

    Fix a crash seen on some configurations of PowerVM when migrating the
    partition from one machine to another.

    A fix for a regression in the setup of our CPU to NUMA node mapping
    in KVM guests.

    A fix to our selftest Makefiles to make them work since a recent
    change to the shared Makefile logic."

    * tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    selftests/powerpc: Fix Makefiles for headers_install change
    powerpc/numa: Use associativity if VPHN hcall is successful
    powerpc/tm: Avoid possible userspace r1 corruption on reclaim
    powerpc/tm: Fix userspace r13 corruption
    powerpc/pseries: Fix unitialized timer reset on migration
    powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
    powerpc: fix csum_ipv6_magic() on little endian platforms
    powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
    powerpc: Avoid code patching freed init sections
    KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds

    Greg Kroah-Hartman
     

28 Sep, 2018

2 commits

  • Update device_add_disk() to take an 'groups' argument so that
    individual drivers can register a device with additional sysfs
    attributes.
    This avoids race condition the driver would otherwise have if these
    groups were to be created with sysfs_add_groups().

    Signed-off-by: Martin Wilck
    Signed-off-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • Commit

    1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active")

    can occasionally cause system resets when kexec-ing a second kernel even
    if SEV is not active.

    That's because get_sev_encryption_bit() uses 32-bit rIP-relative
    addressing to read the value of enc_bit - a variable which caches a
    previously detected encryption bit position - but kexec may allocate
    the early boot code to a higher location, beyond the 32-bit addressing
    limit.

    In this case, garbage will be read and get_sev_encryption_bit() will
    return the wrong value, leading to accessing memory with the wrong
    encryption setting.

    Therefore, remove enc_bit, and thus get rid of the need to do 32-bit
    rIP-relative addressing in the first place.

    [ bp: massage commit message heavily. ]

    Fixes: 1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active")
    Suggested-by: Borislav Petkov
    Signed-off-by: Kairui Song
    Signed-off-by: Borislav Petkov
    Reviewed-by: Tom Lendacky
    Cc: linux-kernel@vger.kernel.org
    Cc: tglx@linutronix.de
    Cc: mingo@redhat.com
    Cc: hpa@zytor.com
    Cc: brijesh.singh@amd.com
    Cc: kexec@lists.infradead.org
    Cc: dyoung@redhat.com
    Cc: bhe@redhat.com
    Cc: ghook@redhat.com
    Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com

    Kairui Song
     

26 Sep, 2018

4 commits


25 Sep, 2018

6 commits

  • Currently associativity is used to lookup node-id even if the
    preceding VPHN hcall failed. However this can cause CPU to be made
    part of the wrong node, (most likely to be node 0). This is because
    VPHN is not enabled on KVM guests.

    With 2ea6263 ("powerpc/topology: Get topology for shared processors at
    boot"), associativity is used to set to the wrong node. Hence KVM
    guest topology is broken.

    For example : A 4 node KVM guest before would have reported.

    [root@localhost ~]# numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 1 2 3
    node 0 size: 1746 MB
    node 0 free: 1604 MB
    node 1 cpus: 4 5 6 7
    node 1 size: 2044 MB
    node 1 free: 1765 MB
    node 2 cpus: 8 9 10 11
    node 2 size: 2044 MB
    node 2 free: 1837 MB
    node 3 cpus: 12 13 14 15
    node 3 size: 2044 MB
    node 3 free: 1903 MB
    node distances:
    node 0 1 2 3
    0: 10 40 40 40
    1: 40 10 40 40
    2: 40 40 10 40
    3: 40 40 40 10

    Would now report:

    [root@localhost ~]# numactl -H
    available: 4 nodes (0-3)
    node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 1746 MB
    node 0 free: 1244 MB
    node 1 cpus:
    node 1 size: 2044 MB
    node 1 free: 2032 MB
    node 2 cpus: 1
    node 2 size: 2044 MB
    node 2 free: 2028 MB
    node 3 cpus:
    node 3 size: 2044 MB
    node 3 free: 2032 MB
    node distances:
    node 0 1 2 3
    0: 10 40 40 40
    1: 40 10 40 40
    2: 40 40 10 40
    3: 40 40 40 10

    Fix this by skipping associativity lookup if the VPHN hcall failed.

    Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Michael Ellerman

    Srikar Dronamraju
     
  • Current we store the userspace r1 to PACATMSCRATCH before finally
    saving it to the thread struct.

    In theory an exception could be taken here (like a machine check or
    SLB miss) that could write PACATMSCRATCH and hence corrupt the
    userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
    others do.

    We've never actually seen this happen but it's theoretically
    possible. Either way, the code is fragile as it is.

    This patch saves r1 to the kernel stack (which can't fault) before we
    turn MSR[RI] back on. PACATMSCRATCH is still used but only with
    MSR[RI] off. We then copy r1 from the kernel stack to the thread
    struct once we have MSR[RI] back on.

    Suggested-by: Breno Leitao
    Signed-off-by: Michael Neuling
    Signed-off-by: Michael Ellerman

    Michael Neuling
     
  • When we treclaim we store the userspace checkpointed r13 to a scratch
    SPR and then later save the scratch SPR to the user thread struct.

    Unfortunately, this doesn't work as accessing the user thread struct
    can take an SLB fault and the SLB fault handler will write the same
    scratch SPRG that now contains the userspace r13.

    To fix this, we store r13 to the kernel stack (which can't fault)
    before we access the user thread struct.

    Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
    as a random userspace segfault with r13 looking like a kernel address.

    Signed-off-by: Michael Neuling
    Reviewed-by: Breno Leitao
    Signed-off-by: Michael Ellerman

    Michael Neuling
     
  • Building a riscv kernel with CONFIG_FUNCTION_TRACER and
    CONFIG_MODVERSIONS enabled results in these two warnings:

    MODPOST vmlinux.o
    WARNING: EXPORT symbol "return_to_handler" [vmlinux] version generation failed, symbol will not be versioned.
    WARNING: EXPORT symbol "_mcount" [vmlinux] version generation failed, symbol will not be versioned.

    When exporting symbols from an assembly file, the MODVERSIONS code
    requires their prototypes to be defined in asm-prototypes.h (see
    scripts/Makefile.build). Since both of these symbols have prototypes
    defined in linux/ftrace.h, include this header from RISC-V's
    asm-prototypes.h.

    Reported-by: Karsten Merker
    Signed-off-by: James Cowgill
    Signed-off-by: Palmer Dabbelt

    James Cowgill
     
  • We only use it in biovec_phys_mergeable and a m68k paravirt driver,
    so just opencode it there. Also remove the pointless unsigned long cast
    for the offset in the opencoded instances.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Geert Uytterhoeven
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Turn the macro into an inline, move it to blk.h and simplify the
    arch hooks a bit.

    Also rename the function to biovec_phys_mergeable as there is no need
    to shout.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2018

1 commit

  • After migration of a powerpc LPAR, the kernel executes code to
    update the system state to reflect new platform characteristics.

    Such changes include modifications to device tree properties provided
    to the system by PHYP. Property notifications received by the
    post_mobility_fixup() code are passed along to the kernel in general
    through a call to of_update_property() which in turn passes such
    events back to all modules through entries like the '.notifier_call'
    function within the NUMA module.

    When the NUMA module updates its state, it resets its event timer. If
    this occurs after a previous call to stop_topology_update() or on a
    system without VPHN enabled, the code runs into an unitialized timer
    structure and crashes. This patch adds a safety check along this path
    toward the problem code.

    An example crash log is as follows.

    ibmvscsi 30000081: Re-enabling adapter!
    ------------[ cut here ]------------
    kernel BUG at kernel/time/timer.c:958!
    Oops: Exception in kernel mode, sig: 5 [#1]
    LE SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
    CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
    ...
    NIP mod_timer+0x4c/0x400
    LR reset_topology_timer+0x40/0x60
    Call Trace:
    0xc0000003f9407830 (unreliable)
    reset_topology_timer+0x40/0x60
    dt_update_callback+0x100/0x120
    notifier_call_chain+0x90/0x100
    __blocking_notifier_call_chain+0x60/0x90
    of_property_notify+0x90/0xd0
    of_update_property+0x104/0x150
    update_dt_property+0xdc/0x1f0
    pseries_devicetree_update+0x2d0/0x510
    post_mobility_fixup+0x7c/0xf0
    migration_store+0xa4/0xc0
    kobj_attr_store+0x30/0x60
    sysfs_kf_write+0x64/0xa0
    kernfs_fop_write+0x16c/0x240
    __vfs_write+0x40/0x200
    vfs_write+0xc8/0x240
    ksys_write+0x5c/0x100
    system_call+0x58/0x6c

    Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
    Cc: stable@vger.kernel.org # v3.10+
    Signed-off-by: Michael Bringmann
    Signed-off-by: Michael Ellerman

    Michael Bringmann
     

23 Sep, 2018

2 commits

  • Juergen writes:
    "xen:
    Two small fixes for xen drivers."

    * tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
    xen: issue warning message when out of grant maptrack entries
    xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code

    Greg Kroah-Hartman
     
  • Thomas writes:
    "A set of fixes for x86:

    - Resolve the kvmclock regression on AMD systems with memory
    encryption enabled. The rework of the kvmclock memory allocation
    during early boot results in encrypted storage, which is not
    shareable with the hypervisor. Create a new section for this data
    which is mapped unencrypted and take care that the later
    allocations for shared kvmclock memory is unencrypted as well.

    - Fix the build regression in the paravirt code introduced by the
    recent spectre v2 updates.

    - Ensure that the initial static page tables cover the fixmap space
    correctly so early console always works. This worked so far by
    chance, but recent modifications to the fixmap layout can -
    depending on kernel configuration - move the relevant entries to a
    different place which is not covered by the initial static page
    tables.

    - Address the regressions and issues which got introduced with the
    recent extensions to the Intel Recource Director Technology code.

    - Update maintainer entries to document reality"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/mm: Expand static page table for fixmap space
    MAINTAINERS: Add X86 MM entry
    x86/intel_rdt: Add Reinette as co-maintainer for RDT
    MAINTAINERS: Add Borislav to the x86 maintainers
    x86/paravirt: Fix some warning messages
    x86/intel_rdt: Fix incorrect loop end condition
    x86/intel_rdt: Fix exclusive mode handling of MBA resource
    x86/intel_rdt: Fix incorrect loop end condition
    x86/intel_rdt: Do not allow pseudo-locking of MBA resource
    x86/intel_rdt: Fix unchecked MSR access
    x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
    x86/intel_rdt: Global closid helper to support future fixes
    x86/intel_rdt: Fix size reporting of MBA resource
    x86/intel_rdt: Fix data type in parsing callbacks
    x86/kvm: Use __bss_decrypted attribute in shared variables
    x86/mm: Add .bss..decrypted section to hold shared variables

    Greg Kroah-Hartman
     

21 Sep, 2018

3 commits

  • Paolo writes:
    "It's mostly small bugfixes and cleanups, mostly around x86 nested
    virtualization. One important change, not related to nested
    virtualization, is that the ability for the guest kernel to trap
    CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is
    now masked by default. This is because the feature is detected
    through an MSR; a very bad idea that Intel seems to like more and
    more. Some applications choke if the other fields of that MSR are
    not initialized as on real hardware, hence we have to disable the
    whole MSR by default, as was the case before Linux 4.12."

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits)
    KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs
    kvm: selftests: Add platform_info_test
    KVM: x86: Control guest reads of MSR_PLATFORM_INFO
    KVM: x86: Turbo bits in MSR_PLATFORM_INFO
    nVMX x86: Check VPID value on vmentry of L2 guests
    nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2
    KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv
    KVM: VMX: check nested state and CR4.VMXE against SMM
    kvm: x86: make kvm_{load|put}_guest_fpu() static
    x86/hyper-v: rename ipi_arg_{ex,non_ex} structures
    KVM: VMX: use preemption timer to force immediate VMExit
    KVM: VMX: modify preemption timer bit only when arming timer
    KVM: VMX: immediately mark preemption timer expired only for zero value
    KVM: SVM: Switch to bitmap_zalloc()
    KVM/MMU: Fix comment in walk_shadow_page_lockless_end()
    kvm: selftests: use -pthread instead of -lpthread
    KVM: x86: don't reset root in kvm_mmu_setup()
    kvm: mmu: Don't read PDPTEs when paging is not enabled
    x86/kvm/lapic: always disable MMIO interface in x2APIC mode
    KVM: s390: Make huge pages unavailable in ucontrol VMs
    ...

    Greg Kroah-Hartman
     
  • We met a kernel panic when enabling earlycon, which is due to the fixmap
    address of earlycon is not statically setup.

    Currently the static fixmap setup in head_64.S only covers 2M virtual
    address space, while it actually could be in 4M space with different
    kernel configurations, e.g. when VSYSCALL emulation is disabled.

    So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
    and add a build time check to ensure that the fixmap is covered by the
    initial static page tables.

    Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable")
    Suggested-by: Thomas Gleixner
    Signed-off-by: Feng Tang
    Signed-off-by: Thomas Gleixner
    Tested-by: kernel test robot
    Reviewed-by: Juergen Gross (Xen parts)
    Cc: H Peter Anvin
    Cc: Peter Zijlstra
    Cc: Michal Hocko
    Cc: Yinghai Lu
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Andy Lutomirsky
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com

    Feng Tang
     
  • The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set
    their return value in "r" local var and break out of switch block
    when they encounter some error.
    This is because vcpu_load() is called before the switch block which
    have a proper cleanup of vcpu_put() afterwards.

    However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return
    immediately on error without performing above mentioned cleanup.

    Thus, change these handlers to behave as expected.

    Fixes: 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")

    Reviewed-by: Mark Kanda
    Reviewed-by: Patrick Colp
    Signed-off-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Liran Alon
     

20 Sep, 2018

19 commits

  • scan_pkey_feature() uses of_property_read_u32_array() to read the
    ibm,processor-storage-keys property and calls be32_to_cpu() on the
    value it gets. The problem is that of_property_read_u32_array() already
    returns the value converted to the CPU byte order.

    The value of pkeys_total ends up more or less sane because there's a min()
    call in pkey_initialize() which reduces pkeys_total to 32. So in practice
    the kernel ignores the fact that the hypervisor reserved one key for
    itself (the device tree advertises 31 keys in my test VM).

    This is wrong, but the effect in practice is that when a process tries to
    allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
    would indicate that there aren't any keys available

    Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
    Cc: stable@vger.kernel.org # v4.16+
    Signed-off-by: Thiago Jung Bauermann
    Signed-off-by: Michael Ellerman

    Thiago Jung Bauermann
     
  • On little endian platforms, csum_ipv6_magic() keeps len and proto in
    CPU byte order. This generates a bad results leading to ICMPv6 packets
    from other hosts being dropped by powerpc64le platforms.

    In order to fix this, len and proto should be converted to network
    byte order ie bigendian byte order. However checksumming 0x12345678
    and 0x56341278 provide the exact same result so it is enough to
    rotate the sum of len and proto by 1 byte.

    PPC32 only support bigendian so the fix is needed for PPC64 only

    Fixes: e9c4943a107b ("powerpc: Implement csum_ipv6_magic in assembly")
    Reported-by: Jianlin Shi
    Reported-by: Xin Long
    Cc: # 4.18+
    Signed-off-by: Christophe Leroy
    Tested-by: Xin Long
    Signed-off-by: Michael Ellerman

    Christophe Leroy
     
  • mpe: This was fixed originally in commit d3d4ffaae439
    ("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but
    contrary to what the merge commit says was inadvertently lost by me in
    commit ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next") which
    brought in changes that moved the code to a new file. So reapply it to
    the new file.

    Original commit message follows:

    We use PHB in mode1 which uses bit 59 to select a correct DMA window.
    However there is mode2 which uses bits 59:55 and allows up to 32 DMA
    windows per a PE.

    Even though documentation does not clearly specify that, it seems that
    the actual hardware does not support bits 59:55 even in mode1, in
    other words we can create a window as big as 1<
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     
  • Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
    to reads of MSR_PLATFORM_INFO.

    Disabling access to reads of this MSR gives userspace the control to "expose"
    this platform-dependent information to guests in a clear way. As it exists
    today, guests that read this MSR would get unpopulated information if userspace
    hadn't already set it (and prior to this patch series, only the CPUID faulting
    information could have been populated). This existing interface could be
    confusing if guests don't handle the potential for incorrect/incomplete
    information gracefully (e.g. zero reported for base frequency).

    Signed-off-by: Drew Schmitt
    Signed-off-by: Paolo Bonzini

    Drew Schmitt
     
  • Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
    the CPUID faulting bit was settable. But now any bit in
    MSR_PLATFORM_INFO would be settable. This can be used, for example, to
    convey frequency information about the platform on which the guest is
    running.

    Signed-off-by: Drew Schmitt
    Signed-off-by: Paolo Bonzini

    Drew Schmitt
     
  • According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
    following check needs to be enforced on vmentry of L2 guests:

    If the 'enable VPID' VM-execution control is 1, the value of the
    of the VPID VM-execution control field must not be 0000H.

    Signed-off-by: Krish Sadhukhan
    Reviewed-by: Mark Kanda
    Reviewed-by: Liran Alon
    Reviewed-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • According to section "Checks on VMX Controls" in Intel SDM vol 3C,
    the following check needs to be enforced on vmentry of L2 guests:

    - Bits 5:0 of the posted-interrupt descriptor address are all 0.
    - The posted-interrupt descriptor address does not set any bits
    beyond the processor's physical-address width.

    Signed-off-by: Krish Sadhukhan
    Reviewed-by: Mark Kanda
    Reviewed-by: Liran Alon
    Reviewed-by: Darren Kenny
    Reviewed-by: Karl Heubaum
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
    it is possible for a vCPU to be blocked while it is in guest-mode.

    According to Intel SDM 26.6.5 Interrupt-Window Exiting and
    Virtual-Interrupt Delivery: "These events wake the logical processor
    if it just entered the HLT state because of a VM entry".
    Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
    deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
    should be waken from the HLT state and injected with the interrupt.

    In addition, if while the vCPU is blocked (while it is in guest-mode),
    it receives a nested posted-interrupt, then the vCPU should also be
    waken and injected with the posted interrupt.

    To handle these cases, this patch enhances kvm_vcpu_has_events() to also
    check if there is a pending interrupt in L2 virtual APICv provided by
    L1. That is, it evaluates if there is a pending virtual interrupt for L2
    by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
    Evaluation of Pending Interrupts.

    Note that this also handles the case of nested posted-interrupt by the
    fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
    called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
    kvm_vcpu_running() -> vmx_check_nested_events() ->
    vmx_complete_nested_posted_interrupt().

    Reviewed-by: Nikita Leshenko
    Reviewed-by: Darren Kenny
    Signed-off-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Liran Alon
     
  • VMX cannot be enabled under SMM, check it when CR4 is set and when nested
    virtualization state is restored.

    This should fix some WARNs reported by syzkaller, mostly around
    alloc_shadow_vmcs.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The functions
    kvm_load_guest_fpu()
    kvm_put_guest_fpu()

    are only used locally, make them static. This requires also that both
    functions are moved because they are used before their implementation.
    Those functions were exported (via EXPORT_SYMBOL) before commit
    e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Paolo Bonzini

    Sebastian Andrzej Siewior
     
  • These structures are going to be used from KVM code so let's make
    their names reflect their Hyper-V origin.

    Signed-off-by: Vitaly Kuznetsov
    Reviewed-by: Roman Kagan
    Acked-by: K. Y. Srinivasan
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • A VMX preemption timer value of '0' is guaranteed to cause a VMExit
    prior to the CPU executing any instructions in the guest. Use the
    preemption timer (if it's supported) to trigger immediate VMExit
    in place of the current method of sending a self-IPI. This ensures
    that pending VMExit injection to L1 occurs prior to executing any
    instructions in the guest (regardless of nesting level).

    When deferring VMExit injection, KVM generates an immediate VMExit
    from the (possibly nested) guest by sending itself an IPI. Because
    hardware interrupts are blocked prior to VMEnter and are unblocked
    (in hardware) after VMEnter, this results in taking a VMExit(INTR)
    before any guest instruction is executed. But, as this approach
    relies on the IPI being received before VMEnter executes, it only
    works as intended when KVM is running as L0. Because there are no
    architectural guarantees regarding when IPIs are delivered, when
    running nested the INTR may "arrive" long after L2 is running e.g.
    L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

    For the most part, this unintended delay is not an issue since the
    events being injected to L1 also do not have architectural guarantees
    regarding their timing. The notable exception is the VMX preemption
    timer[1], which is architecturally guaranteed to cause a VMExit prior
    to executing any instructions in the guest if the timer value is '0'
    at VMEnter. Specifically, the delay in injecting the VMExit causes
    the preemption timer KVM unit test to fail when run in a nested guest.

    Note: this approach is viable even on CPUs with a broken preemption
    timer, as broken in this context only means the timer counts at the
    wrong rate. There are no known errata affecting timer value of '0'.

    [1] I/O SMIs also have guarantees on when they arrive, but I have
    no idea if/how those are emulated in KVM.

    Signed-off-by: Sean Christopherson
    [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Provide a singular location where the VMX preemption timer bit is
    set/cleared so that future usages of the preemption timer can ensure
    the VMCS bit is up-to-date without having to modify unrelated code
    paths. For example, the preemption timer can be used to force an
    immediate VMExit. Cache the status of the timer to avoid redundant
    VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
    VMEnters/VMExits.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • A VMX preemption timer value of '0' at the time of VMEnter is
    architecturally guaranteed to cause a VMExit prior to the CPU
    executing any instructions in the guest. This architectural
    definition is in place to ensure that a previously expired timer
    is correctly recognized by the CPU as it is possible for the timer
    to reach zero and not trigger a VMexit due to a higher priority
    VMExit being signalled instead, e.g. a pending #DB that morphs into
    a VMExit.

    Whether by design or coincidence, commit f4124500c2c1 ("KVM: nVMX:
    Fully emulate preemption timer") special cased timer values of '0'
    and '1' to ensure prompt delivery of the VMExit. Unlike '0', a
    timer value of '1' has no has no architectural guarantees regarding
    when it is delivered.

    Modify the timer emulation to trigger immediate VMExit if and only
    if the timer value is '0', and document precisely why '0' is special.
    Do this even if calibration of the virtual TSC failed, i.e. VMExit
    will occur immediately regardless of the frequency of the timer.
    Making only '0' a special case gives KVM leeway to be more aggressive
    in ensuring the VMExit is injected prior to executing instructions in
    the nested guest, and also eliminates any ambiguity as to why '1' is
    a special case, e.g. why wasn't the threshold for a "short timeout"
    set to 10, 100, 1000, etc...

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Switch to bitmap_zalloc() to show clearly what we are allocating.
    Besides that it returns pointer of bitmap type instead of opaque void *.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: Paolo Bonzini

    Andy Shevchenko
     
  • kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
    This patch is to fix the commit.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Tianyu Lan
     
  • Here is the code path which shows kvm_mmu_setup() is invoked after
    kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
    this means the root_hpa and prev_roots are guaranteed to be invalid. And
    it is not necessary to reset it again.

    kvm_vm_ioctl_create_vcpu()
    kvm_arch_vcpu_create()
    vmx_create_vcpu()
    kvm_vcpu_init()
    kvm_arch_vcpu_init()
    kvm_mmu_create()
    kvm_arch_vcpu_setup()
    kvm_mmu_setup()
    kvm_init_mmu()

    This patch set reset_roots to false in kmv_mmu_setup().

    Fixes: 50c28f21d045dde8c52548f8482d456b3f0956f5
    Signed-off-by: Wei Yang
    Reviewed-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Wei Yang
     
  • kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
    CR4.PAE = 1.

    Signed-off-by: Junaid Shahid
    Signed-off-by: Paolo Bonzini

    Junaid Shahid
     
  • When VMX is used with flexpriority disabled (because of no support or
    if disabled with module parameter) MMIO interface to lAPIC is still
    available in x2APIC mode while it shouldn't be (kvm-unit-tests):

    PASS: apic_disable: Local apic enabled in x2APIC mode
    PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
    FAIL: apic_disable: *0xfee00030: 50014

    The issue appears because we basically do nothing while switching to
    x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
    only check if lAPIC is disabled before proceeding to actual write.

    When APIC access is virtualized we correctly manipulate with VMX controls
    in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
    in x2APIC mode so there's no issue.

    Disabling MMIO interface seems to be easy. The question is: what do we
    do with these reads and writes? If we add apic_x2apic_mode() check to
    apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
    go to userspace. When lAPIC is in kernel, Qemu uses this interface to
    inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
    somehow works with disabled lAPIC but when we're in xAPIC mode we will
    get a real injected MSI from every write to lAPIC. Not good.

    The simplest solution seems to be to just ignore writes to the region
    and return ~0 for all reads when we're in x2APIC mode. This is what this
    patch does. However, this approach is inconsistent with what currently
    happens when flexpriority is enabled: we allocate APIC access page and
    create KVM memory region so in x2APIC modes all reads and writes go to
    this pre-allocated page which is, btw, the same for all vCPUs.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov