08 Oct, 2020

1 commit

  • * tag 'v5.4.70': (3051 commits)
    Linux 5.4.70
    netfilter: ctnetlink: add a range check for l3/l4 protonum
    ep_create_wakeup_source(): dentry name can change under you...
    ...

    Conflicts:
    arch/arm/mach-imx/pm-imx6.c
    arch/arm64/boot/dts/freescale/imx8mm-evk.dts
    arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
    drivers/crypto/caam/caamalg.c
    drivers/gpu/drm/imx/dw_hdmi-imx.c
    drivers/gpu/drm/imx/imx-ldb.c
    drivers/gpu/drm/imx/ipuv3/ipuv3-crtc.c
    drivers/mmc/host/sdhci-esdhc-imx.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/net/ethernet/freescale/enetc/enetc.c
    drivers/net/ethernet/freescale/enetc/enetc_pf.c
    drivers/thermal/imx_thermal.c
    drivers/usb/cdns3/ep0.c
    drivers/xen/swiotlb-xen.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c

    Signed-off-by: Jason Liu

    Jason Liu
     

01 Oct, 2020

5 commits

  • commit c4ad98e4b72cb5be30ea282fce935248f2300e62 upstream.

    KVM currently assumes that an instruction abort can never be a write.
    This is in general true, except when the abort is triggered by
    a S1PTW on instruction fetch that tries to update the S1 page tables
    (to set AF, for example).

    This can happen if the page tables have been paged out and brought
    back in without seeing a direct write to them (they are thus marked
    read only), and the fault handling code will make the PT executable(!)
    instead of writable. The guest gets stuck forever.

    In these conditions, the permission fault must be considered as
    a write so that the Stage-1 update can take place. This is essentially
    the I-side equivalent of the problem fixed by 60e21a0ef54c ("arm64: KVM:
    Take S1 walks into account when determining S2 write faults").

    Update kvm_is_write_fault() to return true on IABT+S1PTW, and introduce
    kvm_vcpu_trap_is_exec_fault() that only return true when no faulting
    on a S1 fault. Additionally, kvm_vcpu_dabt_iss1tw() is renamed to
    kvm_vcpu_abt_iss1tw(), as the above makes it plain that it isn't
    specific to data abort.

    Signed-off-by: Marc Zyngier
    Reviewed-by: Will Deacon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200915104218.1284701-2-maz@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • [ Upstream commit 57bdb436ce869a45881d8aa4bc5dac8e072dd2b6 ]

    If we're going to fail out the vgic_add_lpi(), let's make sure the
    allocated vgic_irq memory is also freed. Though it seems that both
    cases are unlikely to fail.

    Signed-off-by: Zenghui Yu
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200414030349.625-3-yuzenghui@huawei.com
    Signed-off-by: Sasha Levin

    Zenghui Yu
     
  • [ Upstream commit 969ce8b5260d8ec01e6f1949d2927a86419663ce ]

    It's likely that the vcpu fails to handle all virtual interrupts if
    userspace decides to destroy it, leaving the pending ones stay in the
    ap_list. If the un-handled one is a LPI, its vgic_irq structure will
    be eventually leaked because of an extra refcount increment in
    vgic_queue_irq_unlock().

    This was detected by kmemleak on almost every guest destroy, the
    backtrace is as follows:

    unreferenced object 0xffff80725aed5500 (size 128):
    comm "CPU 5/KVM", pid 40711, jiffies 4298024754 (age 166366.512s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 08 01 a9 73 6d 80 ff ff ...........sm...
    c8 61 ee a9 00 20 ff ff 28 1e 55 81 6c 80 ff ff .a... ..(.U.l...
    backtrace:
    [] kmem_cache_alloc_trace+0x2dc/0x418
    [] vgic_add_lpi+0x88/0x418
    [] vgic_its_cmd_handle_mapi+0x4dc/0x588
    [] vgic_its_process_commands.part.5+0x484/0x1198
    [] vgic_its_process_commands+0x50/0x80
    [] vgic_mmio_write_its_cwriter+0xac/0x108
    [] dispatch_mmio_write+0xd0/0x188
    [] __kvm_io_bus_write+0x134/0x240
    [] kvm_io_bus_write+0xe0/0x150
    [] io_mem_abort+0x484/0x7b8
    [] kvm_handle_guest_abort+0x4cc/0xa58
    [] handle_exit+0x24c/0x770
    [] kvm_arch_vcpu_ioctl_run+0x460/0x1988
    [] kvm_vcpu_ioctl+0x4f8/0xee0
    [] do_vfs_ioctl+0x160/0xcd8
    [] ksys_ioctl+0x98/0xd8

    Fix it by retiring all pending LPIs in the ap_list on the destroy path.

    p.s. I can also reproduce it on a normal guest shutdown. It is because
    userspace still send LPIs to vcpu (through KVM_SIGNAL_MSI ioctl) while
    the guest is being shutdown and unable to handle it. A little strange
    though and haven't dig further...

    Reviewed-by: James Morse
    Signed-off-by: Zenghui Yu
    [maz: moved the distributor deallocation down to avoid an UAF splat]
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200414030349.625-2-yuzenghui@huawei.com
    Signed-off-by: Sasha Levin

    Zenghui Yu
     
  • [ Upstream commit 7df003c85218b5f5b10a7f6418208f31e813f38f ]

    We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
    and found the zero_page refcount overflow.
    The cause of refcount overflow is increased in try_async_pf
    (get_user_page) without being decreased in mmu_set_spte()
    while handling ept violation.
    In kvm_release_pfn_clean(), only unreserved page will call
    put_page. However, zero page is reserved.
    So, as well as creating and destroy vm, the refcount of
    zero page will continue to increase until it overflows.

    step1:
    echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
    echo 1 > /sys/kernel/pages_to_scan/run
    echo 1 > /sys/kernel/pages_to_scan/use_zero_pages

    step2:
    just create several normal qemu kvm vms.
    And destroy it after 10s.
    Repeat this action all the time.

    After a long period of time, all domains hang because
    of the refcount of zero page overflow.

    Qemu print error log as follow:
    …
    error: kvm run failed Bad address
    EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
    ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
    EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
    ES =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
    SS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    DS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    FS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    GS =0010 00000000 ffffffff 00c09300 DPL=0 DS [-WA]
    LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
    TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
    GDT= 000f7070 00000037
    IDT= 000f70ae 00000000
    CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
    DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
    DR6=00000000ffff0ff0 DR7=0000000000000400
    EFER=0000000000000000
    Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
    …

    Meanwhile, a kernel warning is departed.

    [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
    [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G OE 5.2.0-rc2 #5
    [40914.836415] RIP: 0010:try_get_page+0x1f/0x30
    [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
    0 00 00 00 48 8b 47 08 a8
    [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
    [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
    [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
    [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
    [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
    [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
    [40914.836423] FS: 00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
    [40914.836425] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
    [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [40914.836428] Call Trace:
    [40914.836433] follow_page_pte+0x302/0x47b
    [40914.836437] __get_user_pages+0xf1/0x7d0
    [40914.836441] ? irq_work_queue+0x9/0x70
    [40914.836443] get_user_pages_unlocked+0x13f/0x1e0
    [40914.836469] __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
    [40914.836486] try_async_pf+0x87/0x240 [kvm]
    [40914.836503] tdp_page_fault+0x139/0x270 [kvm]
    [40914.836523] kvm_mmu_page_fault+0x76/0x5e0 [kvm]
    [40914.836588] vcpu_enter_guest+0xb45/0x1570 [kvm]
    [40914.836632] kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
    [40914.836645] kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
    [40914.836650] do_vfs_ioctl+0xa9/0x620
    [40914.836653] ksys_ioctl+0x60/0x90
    [40914.836654] __x64_sys_ioctl+0x16/0x20
    [40914.836658] do_syscall_64+0x5b/0x180
    [40914.836664] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [40914.836666] RIP: 0033:0x7fb61cb6bfc7

    Signed-off-by: LinFeng
    Signed-off-by: Zhuang Yanying
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Zhuang Yanying
     
  • [ Upstream commit 0bda9498dd45280e334bfe88b815ebf519602cc3 ]

    In kvm_vgic_dist_init() called from kvm_vgic_map_resources(), if
    dist->vgic_model is invalid, dist->spis will be freed without set
    dist->spis = NULL. And in vgicv2 resources clean up path,
    __kvm_vgic_destroy() will be called to free allocated resources.
    And dist->spis will be freed again in clean up chain because we
    forget to set dist->spis = NULL in kvm_vgic_dist_init() failed
    path. So double free would happen.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Marc Zyngier
    Reviewed-by: Eric Auger
    Link: https://lore.kernel.org/r/1574923128-19956-1-git-send-email-linmiaohe@huawei.com
    Signed-off-by: Sasha Levin

    Miaohe Lin
     

17 Sep, 2020

2 commits

  • commit f65886606c2d3b562716de030706dfe1bea4ed5e upstream.

    when kmalloc() fails in kvm_io_bus_unregister_dev(), before removing
    the bus, we should iterate over all other devices linked to it and call
    kvm_iodevice_destructor() for them

    Fixes: 90db10434b16 ("KVM: kvm_io_bus_unregister_dev() should never fail")
    Cc: stable@vger.kernel.org
    Reported-and-tested-by: syzbot+f196caa45793d6374707@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=f196caa45793d6374707
    Signed-off-by: Rustam Kovhaev
    Reviewed-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Rustam Kovhaev
     
  • commit 3fb884ffe921c99483a84b0175f3c03f048e9069 upstream.

    For the obscure cases where PMD and PUD are the same size
    (64kB pages with 42bit VA, for example, which results in only
    two levels of page tables), we can't map anything as a PUD,
    because there is... erm... no PUD to speak of. Everything is
    either a PMD or a PTE.

    So let's only try and map a PUD when its size is different from
    that of a PMD.

    Cc: stable@vger.kernel.org
    Fixes: b8e0ba7c8bea ("KVM: arm64: Add support for creating PUD hugepages at stage 2")
    Reported-by: Gavin Shan
    Reported-by: Eric Auger
    Reviewed-by: Alexandru Elisei
    Reviewed-by: Gavin Shan
    Tested-by: Gavin Shan
    Tested-by: Eric Auger
    Tested-by: Alexandru Elisei
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

26 Aug, 2020

2 commits

  • commit b5331379bc62611d1026173a09c73573384201d9 upstream.

    When an MMU notifier call results in unmapping a range that spans multiple
    PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
    since this avoids running into RCU stalls during VM teardown. Unfortunately,
    if the VM is destroyed as a result of OOM, then blocking is not permitted
    and the call to the scheduler triggers the following BUG():

    | BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
    | in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
    | INFO: lockdep is turned off.
    | CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
    | Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
    | Call trace:
    | dump_backtrace+0x0/0x284
    | show_stack+0x1c/0x28
    | dump_stack+0xf0/0x1a4
    | ___might_sleep+0x2bc/0x2cc
    | unmap_stage2_range+0x160/0x1ac
    | kvm_unmap_hva_range+0x1a0/0x1c8
    | kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
    | __mmu_notifier_invalidate_range_start+0x218/0x31c
    | mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
    | __oom_reap_task_mm+0x128/0x268
    | oom_reap_task+0xac/0x298
    | oom_reaper+0x178/0x17c
    | kthread+0x1e4/0x1fc
    | ret_from_fork+0x10/0x30

    Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
    only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
    flags.

    Cc:
    Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
    Cc: Marc Zyngier
    Cc: Suzuki K Poulose
    Cc: James Morse
    Signed-off-by: Will Deacon
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • commit fdfe7cbd58806522e799e2a50a15aee7f2cbb7b6 upstream.

    The 'flags' field of 'struct mmu_notifier_range' is used to indicate
    whether invalidate_range_{start,end}() are permitted to block. In the
    case of kvm_mmu_notifier_invalidate_range_start(), this field is not
    forwarded on to the architecture-specific implementation of
    kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
    whether or not to block.

    Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
    architectures are aware as to whether or not they are permitted to block.

    Cc:
    Cc: Marc Zyngier
    Cc: Suzuki K Poulose
    Cc: James Morse
    Signed-off-by: Will Deacon
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Will Deacon
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

05 Aug, 2020

1 commit

  • commit b757b47a2fcba584d4a32fd7ee68faca510ab96f upstream.

    If a stage-2 page-table contains an executable, read-only mapping at the
    pte level (e.g. due to dirty logging being enabled), a subsequent write
    fault to the same page which tries to install a larger block mapping
    (e.g. due to dirty logging having been disabled) will erroneously inherit
    the exec permission and consequently skip I-cache invalidation for the
    rest of the block.

    Ensure that exec permission is only inherited by write faults when the
    new mapping is of the same size as the existing one. A subsequent
    instruction abort will result in I-cache invalidation for the entire
    block mapping.

    Signed-off-by: Will Deacon
    Signed-off-by: Marc Zyngier
    Tested-by: Quentin Perret
    Reviewed-by: Quentin Perret
    Cc: Marc Zyngier
    Cc:
    Link: https://lore.kernel.org/r/20200723101714.15873-1-will@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

19 Jun, 2020

1 commit

  • * tag 'v5.4.47': (2193 commits)
    Linux 5.4.47
    KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
    KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
    ...

    Conflicts:
    arch/arm/boot/dts/imx6qdl.dtsi
    arch/arm/mach-imx/Kconfig
    arch/arm/mach-imx/common.h
    arch/arm/mach-imx/suspend-imx6.S
    arch/arm64/boot/dts/freescale/imx8qxp-mek.dts
    arch/powerpc/include/asm/cacheflush.h
    drivers/cpufreq/imx6q-cpufreq.c
    drivers/dma/imx-sdma.c
    drivers/edac/synopsys_edac.c
    drivers/firmware/imx/imx-scu.c
    drivers/net/ethernet/freescale/fec.h
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/phy_device.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/usb/cdns3/gadget.c
    drivers/usb/dwc3/gadget.c
    include/uapi/linux/dma-buf.h

    Signed-off-by: Jason Liu

    Jason Liu
     

17 Jun, 2020

3 commits

  • commit ef3e40a7ea8dbe2abd0a345032cd7d5023b9684f upstream.

    When using the PtrAuth feature in a guest, we need to save the host's
    keys before allowing the guest to program them. For that, we dump
    them in a per-CPU data structure (the so called host context).

    But both call sites that do this are in preemptible context,
    which may end up in disaster should the vcpu thread get preempted
    before reentering the guest.

    Instead, save the keys eagerly on each vcpu_load(). This has an
    increased overhead, but is at least safe.

    Cc: stable@vger.kernel.org
    Reviewed-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 0370964dd3ff7d3d406f292cb443a927952cbd05 upstream.

    On a VHE system, the EL1 state is left in the CPU most of the time,
    and only syncronized back to memory when vcpu_put() is called (most
    of the time on preemption).

    Which means that when injecting an exception, we'd better have a way
    to either:
    (1) write directly to the EL1 sysregs
    (2) synchronize the state back to memory, and do the changes there

    For an AArch64, we already do (1), so we are safe. Unfortunately,
    doing the same thing for AArch32 would be pretty invasive. Instead,
    we can easily implement (2) by calling the put/load architectural
    backends, and keep preemption disabled. We can then reload the
    state back into EL1.

    Cc: stable@vger.kernel.org
    Reported-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit e649b3f0188f8fd34dd0dde8d43fd3312b902fb2 upstream.

    Commit b1394e745b94 ("KVM: x86: fix APIC page invalidation") tried
    to fix inappropriate APIC page invalidation by re-introducing arch
    specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
    kvm_mmu_notifier_invalidate_range_start. However, the patch left a
    possible race where the VMCS APIC address cache is updated *before*
    it is unmapped:

    (Invalidator) kvm_mmu_notifier_invalidate_range_start()
    (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
    (KVM VCPU) vcpu_enter_guest()
    (KVM VCPU) kvm_vcpu_reload_apic_access_page()
    (Invalidator) actually unmap page

    Because of the above race, there can be a mismatch between the
    host physical address stored in the APIC_ACCESS_PAGE VMCS field and
    the host physical address stored in the EPT entry for the APIC GPA
    (0xfee0000). When this happens, the processor will not trap APIC
    accesses, and will instead show the raw contents of the APIC-access page.
    Because Windows OS periodically checks for unexpected modifications to
    the LAPIC register, this will show up as a BSOD crash with BugCheck
    CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
    https://bugzilla.redhat.com/show_bug.cgi?id=1751017.

    The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
    cannot guarantee that no additional references are taken to the pages in
    the range before kvm_mmu_notifier_invalidate_range_end(). Fortunately,
    this case is supported by the MMU notifier API, as documented in
    include/linux/mmu_notifier.h:

    * If the subsystem
    * can't guarantee that no additional references are taken to
    * the pages in the range, it has to implement the
    * invalidate_range() notifier to remove any references taken
    * after invalidate_range_start().

    The fix therefore is to reload the APIC-access page field in the VMCS
    from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().

    Cc: stable@vger.kernel.org
    Fixes: b1394e745b94 ("KVM: x86: fix APIC page invalidation")
    Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
    Signed-off-by: Eiichi Tsukata
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Eiichi Tsukata
     

20 May, 2020

1 commit

  • [ Upstream commit 9a50ebbffa9862db7604345f5fd763122b0f6fed ]

    When a guest tries to read the active state of its interrupts,
    we currently just return whatever state we have in memory. This
    means that if such an interrupt lives in a List Register on another
    CPU, we fail to obsertve the latest active state for this interrupt.

    In order to remedy this, stop all the other vcpus so that they exit
    and we can observe the most recent value for the state. This is
    similar to what we are doing for the write side of the same
    registers, and results in new MMIO handlers for userspace (which
    do not need to stop the guest, as it is supposed to be stopped
    already).

    Reported-by: Julien Grall
    Reviewed-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin

    Marc Zyngier
     

14 May, 2020

2 commits

  • commit 0225fd5e0a6a32af7af0aefac45c8ebf19dc5183 upstream.

    In the unlikely event that a 32bit vcpu traps into the hypervisor
    on an instruction that is located right at the end of the 32bit
    range, the emulation of that instruction is going to increment
    PC past the 32bit range. This isn't great, as userspace can then
    observe this value and get a bit confused.

    Conversly, userspace can do things like (in the context of a 64bit
    guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0,
    set PC to a 64bit value, change PSTATE to AArch32-USR, and observe
    that PC hasn't been truncated. More confusion.

    Fix both by:
    - truncating PC increments for 32bit guests
    - sanitizing all 32bit regs every time a core reg is changed by
    userspace, and that PSTATE indicates a 32bit mode.

    Cc: stable@vger.kernel.org
    Acked-by: Will Deacon
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 1c32ca5dc6d00012f0c964e5fdd7042fcc71efb1 upstream.

    When deciding whether a guest has to be stopped we check whether this
    is a private interrupt or not. Unfortunately, there's an off-by-one bug
    here, and we fail to recognize a whole range of interrupts as being
    global (GICv2 SPIs 32-63).

    Fix the condition from > to be >=.

    Cc: stable@vger.kernel.org
    Fixes: abd7229626b93 ("KVM: arm/arm64: Simplify active_change_prepare and plug race")
    Reported-by: André Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

08 Mar, 2020

1 commit

  • Merge Linux stable release v5.4.24 into imx_5.4.y

    * tag 'v5.4.24': (3306 commits)
    Linux 5.4.24
    blktrace: Protect q->blk_trace with RCU
    kvm: nVMX: VMWRITE checks unsupported field before read-only field
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6sll-evk.dts
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
    drivers/clk/imx/clk-composite-8m.c
    drivers/gpio/gpio-mxc.c
    drivers/irqchip/Kconfig
    drivers/mmc/host/sdhci-of-esdhc.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/can/flexcan.c
    drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
    drivers/net/ethernet/mscc/ocelot.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/realtek.c
    drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/tee/optee/shm_pool.c
    drivers/usb/cdns3/gadget.c
    kernel/sched/cpufreq.c
    net/core/xdp.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c
    sound/soc/sof/core.c
    sound/soc/sof/imx/Kconfig
    sound/soc/sof/loader.c

    Jason Liu
     

05 Mar, 2020

1 commit

  • commit fcfbc617547fc6d9552cb6c1c563b6a90ee98085 upstream.

    When reading/writing using the guest/host cache, check for a bad hva
    before checking for a NULL memslot, which triggers the slow path for
    handing cross-page accesses. Because the memslot is nullified on error
    by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
    crossing into a new page, then the kvm_{read,write}_guest() slow path
    could potentially write/access the first chunk prior to detecting the
    bad hva.

    Arguably, performing a partial access is semantically correct from an
    architectural perspective, but that behavior is certainly not intended.
    In the original implementation, memslot was not explicitly nullified
    and therefore the partial access behavior varied based on whether the
    memslot itself was null, or if the hva was simply bad. The current
    behavior was introduced as a seemingly unintentional side effect in
    commit f1b9dd5eb86c ("kvm: Disallow wraparound in
    kvm_gfn_to_hva_cache_init"), which justified the change with "since some
    callers don't check the return code from this function, it sit seems
    prudent to clear ghc->memslot in the event of an error".

    Regardless of intent, the partial access is dependent on _not_ checking
    the result of the cache initialization, which is arguably a bug in its
    own right, at best simply weird.

    Fixes: 8f964525a121 ("KVM: Allow cross page reads and writes from cached translations.")
    Cc: Jim Mattson
    Cc: Andrew Honig
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

15 Feb, 2020

7 commits

  • commit 4a267aa707953a9a73d1f5dc7f894dd9024a92be upstream.

    According to the ARM ARM, registers CNT{P,V}_TVAL_EL0 have bits [63:32]
    RES0 [1]. When reading the register, the value is truncated to the least
    significant 32 bits [2], and on writes, TimerValue is treated as a signed
    32-bit integer [1, 2].

    When the guest behaves correctly and writes 32-bit values, treating TVAL
    as an unsigned 64 bit register works as expected. However, things start
    to break down when the guest writes larger values, because
    (u64)0x1_ffff_ffff = 8589934591. but (s32)0x1_ffff_ffff = -1, and the
    former will cause the timer interrupt to be asserted in the future, but
    the latter will cause it to be asserted now. Let's treat TVAL as a
    signed 32-bit register on writes, to match the behaviour described in
    the architecture, and the behaviour experimentally exhibited by the
    virtual timer on a non-vhe host.

    [1] Arm DDI 0487E.a, section D13.8.18
    [2] Arm DDI 0487E.a, section D11.2.4

    Signed-off-by: Alexandru Elisei
    [maz: replaced the read-side mask with lower_32_bits]
    Signed-off-by: Marc Zyngier
    Fixes: 8fa761624871 ("KVM: arm/arm64: arch_timer: Fix CNTP_TVAL calculation")
    Link: https://lore.kernel.org/r/20200127103652.2326-1-alexandru.elisei@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Alexandru Elisei
     
  • commit aa76829171e98bd75a0cc00b6248eca269ac7f4f upstream.

    At the moment a SW_INCR counter always overflows on 32-bit
    boundary, independently on whether the n+1th counter is
    programmed as CHAIN.

    Check whether the SW_INCR counter is a 64b counter and if so,
    implement the 64b logic.

    Fixes: 80f393a23be6 ("KVM: arm/arm64: Support chained PMU counters")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-4-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 3837407c1aa1101ed5e214c7d6041e7a23335c6e upstream.

    The specification says PMSWINC increments PMEVCNTR_EL1 by 1
    if PMEVCNTR_EL0 is enabled and configured to count SW_INCR.

    For PMEVCNTR_EL0 to be enabled, we need both PMCNTENSET to
    be set for the corresponding event counter but we also need
    the PMCR.E bit to be set.

    Fixes: 7a0adc7064b8 ("arm64: KVM: Add access handler for PMSWINC register")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Andrew Murray
    Acked-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200124142535.29386-2-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     
  • commit 21aecdbd7f3ab02c9b82597dc733ee759fb8b274 upstream.

    KVM's inject_abt64() injects an external-abort into an aarch64 guest.
    The KVM_CAP_ARM_INJECT_EXT_DABT is intended to do exactly this, but
    for an aarch32 guest inject_abt32() injects an implementation-defined
    exception, 'Lockdown fault'.

    Change this to external abort. For non-LPAE we now get the documented:
    | Unhandled fault: external abort on non-linefetch (0x008) at 0x9c800f00
    and for LPAE:
    | Unhandled fault: synchronous external abort (0x210) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-3-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit 018f22f95e8a6c3e27188b7317ef2c70a34cb2cd upstream.

    Beata reports that KVM_SET_VCPU_EVENTS doesn't inject the expected
    exception to a non-LPAE aarch32 guest.

    The host intends to inject DFSR.FS=0x14 "IMPLEMENTATION DEFINED fault
    (Lockdown fault)", but the guest receives DFSR.FS=0x04 "Fault on
    instruction cache maintenance". This fault is hooked by
    do_translation_fault() since ARMv6, which goes on to silently 'handle'
    the exception, and restart the faulting instruction.

    It turns out, when TTBCR.EAE is clear DFSR is split, and FS[4] has
    to shuffle up to DFSR[10].

    As KVM only does this in one place, fix up the static values. We
    now get the expected:
    | Unhandled fault: lock abort (0x404) at 0x9c800f00

    Fixes: 74a64a981662a ("KVM: arm/arm64: Unify 32bit fault injection")
    Reported-by: Beata Michalska
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121123356.203000-2-james.morse@arm.com
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     
  • commit cf2d23e0bac9f6b5cd1cba8898f5f05ead40e530 upstream.

    kvm_test_age_hva() is called upon mmu_notifier_test_young(), but wrong
    address range has been passed to handle_hva_to_gpa(). With the wrong
    address range, no young bits will be checked in handle_hva_to_gpa().
    It means zero is always returned from mmu_notifier_test_young().

    This fixes the issue by passing correct address range to the underly
    function handle_hva_to_gpa(), so that the hardware young (access) bit
    will be visited.

    Fixes: 35307b9a5f7e ("arm/arm64: KVM: Implement Stage-2 page aging")
    Signed-off-by: Gavin Shan
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20200121055659.19560-1-gshan@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit 8c58be34494b7f1b2adb446e2d8beeb90e5de65b upstream.

    Saving/restoring an unmapped collection is a valid scenario. For
    example this happens if a MAPTI command was sent, featuring an
    unmapped collection. At the moment the CTE fails to be restored.
    Only compare against the number of online vcpus if the rdist
    base is set.

    Fixes: ea1ad53e1e31a ("KVM: arm64: vgic-its: Collection table save/restore")
    Signed-off-by: Eric Auger
    Signed-off-by: Marc Zyngier
    Reviewed-by: Zenghui Yu
    Link: https://lore.kernel.org/r/20191213094237.19627-1-eric.auger@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Eric Auger
     

11 Feb, 2020

8 commits

  • [ Upstream commit 42cde48b2d39772dba47e680781a32a6c4b7dc33 ]

    Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
    on read-only memslots due to gfn_to_hva() assuming writes. Functionally,
    this allows x86 to create large mappings for read-only memslots that
    are backed by HugeTLB mappings.

    Note, the changelog for commit 05da45583de9 ("KVM: MMU: large page
    support") states "If the largepage contains write-protected pages, a
    large pte is not used.", but "write-protected" refers to pages that are
    temporarily read-only, e.g. read-only memslots didn't even exist at the
    time.

    Fixes: 4d8b81abc47b ("KVM: introduce readonly memslot")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ]

    Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
    correct set of memslots is used when handling x86 page faults in SMM.

    Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • [ Upstream commit 736c291c9f36b07f8889c61764c28edce20e715d ]

    Convert a plethora of parameters and variables in the MMU and page fault
    flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

    Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
    addresses. When TDP is enabled, the fault address is a guest physical
    address and thus can be a 64-bit value, even when both KVM and its guest
    are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
    64-bit field, not a natural width field.

    Using a gva_t for the fault address means KVM will incorrectly drop the
    upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
    translate L2 GPAs to L1 GPAs.

    Opportunistically rename variables and parameters to better reflect the
    dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
    "addr" instead of "vaddr" when the address may be either a GVA or an L2
    GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
    a confusing "gpa_t gva" declaration; this also sets the stage for a
    future patch to combing nonpaging_page_fault() and tdp_page_fault() with
    minimal churn.

    Sprinkle in a few comments to document flows where an address is known
    to be a GVA and thus can be safely truncated to a 32-bit value. Add
    WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
    document such cases and detect bugs.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Sasha Levin

    Sean Christopherson
     
  • commit 917248144db5d7320655dbb41d3af0b8a0f3d589 upstream.

    __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
    * relatively expensive
    * in certain cases (such as when done from atomic context) cannot be called

    Stashing gfn-to-pfn mapping should help with both cases.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit 1eff70a9abd46f175defafd29bc17ad456f398a7 upstream.

    kvm_vcpu_(un)map operates on gfns from any current address space.
    In certain cases we want to make sure we are not mapping SMRAM
    and for that we can use kvm_(un)map_gfn() that we are introducing
    in this patch.

    This is part of CVE-2019-3016.

    Signed-off-by: Boris Ostrovsky
    Reviewed-by: Joao Martins
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Boris Ostrovsky
     
  • commit b6ae256afd32f96bec0117175b329d0dd617655e upstream.

    On AArch64 you can do a sign-extended load to either a 32-bit or 64-bit
    register, and we should only sign extend the register up to the width of
    the register as specified in the operation (by using the 32-bit Wn or
    64-bit Xn register specifier).

    As it turns out, the architecture provides this decoding information in
    the SF ("Sixty-Four" -- how cute...) bit.

    Let's take advantage of this with the usual 32-bit/64-bit header file
    dance and do the right thing on AArch64 hosts.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191212195055.5541-1-christoffer.dall@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     
  • commit 1cfbb484de158e378e8971ac40f3082e53ecca55 upstream.

    Confusingly, there are three SPSR layouts that a kernel may need to deal
    with:

    (1) An AArch64 SPSR_ELx view of an AArch64 pstate
    (2) An AArch64 SPSR_ELx view of an AArch32 pstate
    (3) An AArch32 SPSR_* view of an AArch32 pstate

    When the KVM AArch32 support code deals with SPSR_{EL2,HYP}, it's either
    dealing with #2 or #3 consistently. On arm64 the PSR_AA32_* definitions
    match the AArch64 SPSR_ELx view, and on arm the PSR_AA32_* definitions
    match the AArch32 SPSR_* view.

    However, when we inject an exception into an AArch32 guest, we have to
    synthesize the AArch32 SPSR_* that the guest will see. Thus, an AArch64
    host needs to synthesize layout #3 from layout #2.

    This patch adds a new host_spsr_to_spsr32() helper for this, and makes
    use of it in the KVM AArch32 support code. For arm64 we need to shuffle
    the DIT bit around, and remove the SS bit, while for arm we can use the
    value as-is.

    I've open-coded the bit manipulation for now to avoid having to rework
    the existing PSR_* definitions into PSR64_AA32_* and PSR32_AA32_*
    definitions. I hope to perform a more thorough refactoring in future so
    that we can handle pstate view manipulation more consistently across the
    kernel tree.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-4-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit 3c2483f15499b877ccb53250d88addb8c91da147 upstream.

    When KVM injects an exception into a guest, it generates the CPSR value
    from scratch, configuring CPSR.{M,A,I,T,E}, and setting all other
    bits to zero.

    This isn't correct, as the architecture specifies that some CPSR bits
    are (conditionally) cleared or set upon an exception, and others are
    unchanged from the original context.

    This patch adds logic to match the architectural behaviour. To make this
    simple to follow/audit/extend, documentation references are provided,
    and bits are configured in order of their layout in SPSR_EL2. This
    layout can be seen in the diagram on ARM DDI 0487E.a page C5-426.

    Note that this code is used by both arm and arm64, and is intended to
    fuction with the SPSR_EL2 and SPSR_HYP layouts.

    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Reviewed-by: Alexandru Elisei
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200108134324.46500-3-mark.rutland@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     

31 Dec, 2019

1 commit

  • commit 6d674e28f642e3ff676fbae2d8d1b872814d32b6 upstream.

    A device mapping is normally always mapped at Stage-2, since there
    is very little gain in having it faulted in.

    Nonetheless, it is possible to end-up in a situation where the device
    mapping has been removed from Stage-2 (userspace munmaped the VFIO
    region, and the MMU notifier did its job), but present in a userspace
    mapping (userpace has mapped it back at the same address). In such
    a situation, the device mapping will be demand-paged as the guest
    performs memory accesses.

    This requires to be careful when dealing with mapping size, cache
    management, and to handle potential execution of a device mapping.

    Reported-by: Alexandru Elisei
    Signed-off-by: Marc Zyngier
    Tested-by: Alexandru Elisei
    Reviewed-by: James Morse
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191211165651.7889-2-maz@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

16 Dec, 2019

1 commit

  • This is the 5.4.3 stable release

    Conflicts:
    drivers/cpufreq/imx-cpufreq-dt.c
    drivers/spi/spi-fsl-qspi.c

    The conflict is very minor, fixed it when do the merge. The imx-cpufreq-dt.c
    is just one line code-style change, using upstream one, no any function change.

    The spi-fsl-qspi.c has minor conflicts when merge upstream fixes: c69b17da53b2
    spi: spi-fsl-qspi: Clear TDH bits in FLSHCR register

    After merge, basic boot sanity test and basic qspi test been done on i.mx

    Signed-off-by: Jason Liu

    Jason Liu
     

13 Dec, 2019

1 commit

  • commit ca185b260951d3b55108c0b95e188682d8a507b7 upstream.

    It's possible that two LPIs locate in the same "byte_offset" but target
    two different vcpus, where their pending status are indicated by two
    different pending tables. In such a scenario, using last_byte_offset
    optimization will lead KVM relying on the wrong pending table entry.
    Let us use last_ptr instead, which can be treated as a byte index into
    a pending table and also, can be vcpu specific.

    Fixes: 280771252c1b ("KVM: arm64: vgic-v3: KVM_DEV_ARM_VGIC_SAVE_PENDING_TABLES")
    Cc: stable@vger.kernel.org
    Signed-off-by: Zenghui Yu
    Signed-off-by: Marc Zyngier
    Acked-by: Eric Auger
    Link: https://lore.kernel.org/r/20191029071919.177-4-yuzenghui@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Zenghui Yu
     

25 Nov, 2019

2 commits

  • FSL-MC bus devices uses device-ids from 0x10000 to 0x20000.
    So to support MSI interrupts for mc-bus devices we need
    vgi-ITS device-id table of size 2^17 to support deviceid
    range from 0x10000 to 0x20000.

    Signed-off-by: Bharat Bhushan

    Bharat Bhushan
     
  • Instead of hardcoding checks for qman cacheable
    mmio region physical addresses extract mapping
    information from the user-space mapping.
    The involves several steps;
    - get access to a pte part of the user-space mapping
    by using get_locked_pte() / pte_unmap_unlock() apis
    - extract memtype (normal / device), shareability from
    the pte
    - convert to S2 translation bits in newly added
    function stage1_to_stage2_pgprot()
    - finish making the s2 translation with the obtained bits

    Another explored option was using vm_area_struct::vm_page_prot
    which is set in vfio-mc mmap code to the correct page bits.
    However, experiments show that these bits are later altered
    in the generic mmap code (e.g. the shareability bit is always
    set on arm64).
    The only place where the original bits can still be found
    is the user-space mapping, using the method described above.

    Signed-off-by: Laurentiu Tudor
    [Bharat - Fixed mem_type check issue]
    [changed "ifdef ARM64" to CONFIG_ARM64]
    Signed-off-by: Bharat Bhushan
    [Ioana - added a sanity check for hugepages]
    Signed-off-by: Ioana Ciornei
    [Fixed format issues]
    Signed-off-by: Diana Craciun

    Laurentiu Tudor