01 Dec, 2020

1 commit

  • Commit 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP block size
    configurable") updated kvmppc_xive_vcpu_id_valid() in a way that
    allows userspace to trigger an assertion in skiboot and crash the host:

    [ 696.186248988,3] XIVE[ IC 08 ] eq_blk != vp_blk (0 vs. 1) for target 0x4300008c/0
    [ 696.186314757,0] Assert fail: hw/xive.c:2370:0
    [ 696.186342458,0] Aborting!
    xive-kvCPU 0043 Backtrace:
    S: 0000000031e2b8f0 R: 0000000030013840 .backtrace+0x48
    S: 0000000031e2b990 R: 000000003001b2d0 ._abort+0x4c
    S: 0000000031e2ba10 R: 000000003001b34c .assert_fail+0x34
    S: 0000000031e2ba90 R: 0000000030058984 .xive_eq_for_target.part.20+0xb0
    S: 0000000031e2bb40 R: 0000000030059fdc .xive_setup_silent_gather+0x2c
    S: 0000000031e2bc20 R: 000000003005a334 .opal_xive_set_vp_info+0x124
    S: 0000000031e2bd20 R: 00000000300051a4 opal_entry+0x134
    --- OPAL call token: 0x8a caller R1: 0xc000001f28563850 ---

    XIVE maintains the interrupt context state of non-dispatched vCPUs in
    an internal VP structure. We allocate a bunch of those on startup to
    accommodate all possible vCPUs. Each VP has an id, that we derive from
    the vCPU id for efficiency:

    static inline u32 kvmppc_xive_vp(struct kvmppc_xive *xive, u32 server)
    {
    return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
    }

    The KVM XIVE device used to allocate KVM_MAX_VCPUS VPs. This was
    limitting the number of concurrent VMs because the VP space is
    limited on the HW. Since most of the time, VMs run with a lot less
    vCPUs, commit 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP
    block size configurable") gave the possibility for userspace to
    tune the size of the VP block through the KVM_DEV_XIVE_NR_SERVERS
    attribute.

    The check in kvmppc_pack_vcpu_id() was changed from

    cpu < KVM_MAX_VCPUS * xive->kvm->arch.emul_smt_mode

    to

    cpu < xive->nr_servers * xive->kvm->arch.emul_smt_mode

    The previous check was based on the fact that the VP block had
    KVM_MAX_VCPUS entries and that kvmppc_pack_vcpu_id() guarantees
    that packed vCPU ids are below KVM_MAX_VCPUS. We've changed the
    size of the VP block, but kvmppc_pack_vcpu_id() has nothing to
    do with it and it certainly doesn't ensure that the packed vCPU
    ids are below xive->nr_servers. kvmppc_xive_vcpu_id_valid() might
    thus return true when the VM was configured with a non-standard
    VSMT mode, even if the packed vCPU id is higher than what we
    expect. We end up using an unallocated VP id, which confuses
    OPAL. The assert in OPAL is probably abusive and should be
    converted to a regular error that the kernel can handle, but
    we shouldn't really use broken VP ids in the first place.

    Fix kvmppc_xive_vcpu_id_valid() so that it checks the packed
    vCPU id is below xive->nr_servers, which is explicitly what we
    want.

    Fixes: 062cfab7069f ("KVM: PPC: Book3S HV: XIVE: Make VP block size configurable")
    Cc: stable@vger.kernel.org # v5.5+
    Signed-off-by: Greg Kurz
    Reviewed-by: Cédric Le Goater
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/160673876747.695514.1809676603724514920.stgit@bahia.lan

    Greg Kurz
     

16 Nov, 2020

1 commit

  • When accessing the ESB page of a source interrupt, the fault handler
    will retrieve the page address from the XIVE interrupt 'xive_irq_data'
    structure. If the associated KVM XIVE interrupt is not valid, that is
    not allocated at the HW level for some reason, the fault handler will
    dereference a NULL pointer leading to the oops below :

    WARNING: CPU: 40 PID: 59101 at arch/powerpc/kvm/book3s_xive_native.c:259 xive_native_esb_fault+0xe4/0x240 [kvm]
    CPU: 40 PID: 59101 Comm: qemu-system-ppc Kdump: loaded Tainted: G W --------- - - 4.18.0-240.el8.ppc64le #1
    NIP: c00800000e949fac LR: c00000000044b164 CTR: c00800000e949ec8
    REGS: c000001f69617840 TRAP: 0700 Tainted: G W --------- - - (4.18.0-240.el8.ppc64le)
    MSR: 9000000000029033 CR: 44044282 XER: 00000000
    CFAR: c00000000044b160 IRQMASK: 0
    GPR00: c00000000044b164 c000001f69617ac0 c00800000e96e000 c000001f69617c10
    GPR04: 05faa2b21e000080 0000000000000000 0000000000000005 ffffffffffffffff
    GPR08: 0000000000000000 0000000000000001 0000000000000000 0000000000000001
    GPR12: c00800000e949ec8 c000001ffffd3400 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 0000000000000000 0000000000000000 c000001f5c065160 c000000001c76f90
    GPR24: c000001f06f20000 c000001f5c065100 0000000000000008 c000001f0eb98c78
    GPR28: c000001dcab40000 c000001dcab403d8 c000001f69617c10 0000000000000011
    NIP [c00800000e949fac] xive_native_esb_fault+0xe4/0x240 [kvm]
    LR [c00000000044b164] __do_fault+0x64/0x220
    Call Trace:
    [c000001f69617ac0] [0000000137a5dc20] 0x137a5dc20 (unreliable)
    [c000001f69617b50] [c00000000044b164] __do_fault+0x64/0x220
    [c000001f69617b90] [c000000000453838] do_fault+0x218/0x930
    [c000001f69617bf0] [c000000000456f50] __handle_mm_fault+0x350/0xdf0
    [c000001f69617cd0] [c000000000457b1c] handle_mm_fault+0x12c/0x310
    [c000001f69617d10] [c00000000007ef44] __do_page_fault+0x264/0xbb0
    [c000001f69617df0] [c00000000007f8c8] do_page_fault+0x38/0xd0
    [c000001f69617e30] [c00000000000a714] handle_page_fault+0x18/0x38
    Instruction dump:
    40c2fff0 7c2004ac 2fa90000 409e0118 73e90001 41820080 e8bd0008 7c2004ac
    7ca90074 39400000 915c0000 7929d182 2fa50000 419e0080 e89e0018
    ---[ end trace 66c6ff034c53f64f ]---
    xive-kvm: xive_native_esb_fault: accessing invalid ESB page for source 8 !

    Fix that by checking the validity of the KVM XIVE interrupt structure.

    Fixes: 6520ca64cde7 ("KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages")
    Cc: stable@vger.kernel.org # v5.2+
    Reported-by: Greg Kurz
    Signed-off-by: Cédric Le Goater
    Tested-by: Greg Kurz
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20201105134713.656160-1-clg@kaod.org

    Cédric Le Goater
     

26 Oct, 2020

1 commit

  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     

24 Oct, 2020

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "For x86, there is a new alternative and (in the future) more scalable
    implementation of extended page tables that does not need a reverse
    map from guest physical addresses to host physical addresses.

    For now it is disabled by default because it is still lacking a few of
    the existing MMU's bells and whistles. However it is a very solid
    piece of work and it is already available for people to hammer on it.

    Other updates:

    ARM:
    - New page table code for both hypervisor and guest stage-2
    - Introduction of a new EL2-private host context
    - Allow EL2 to have its own private per-CPU variables
    - Support of PMU event filtering
    - Complete rework of the Spectre mitigation

    PPC:
    - Fix for running nested guests with in-kernel IRQ chip
    - Fix race condition causing occasional host hard lockup
    - Minor cleanups and bugfixes

    x86:
    - allow trapping unknown MSRs to userspace
    - allow userspace to force #GP on specific MSRs
    - INVPCID support on AMD
    - nested AMD cleanup, on demand allocation of nested SVM state
    - hide PV MSRs and hypercalls for features not enabled in CPUID
    - new test for MSR_IA32_TSC writes from host and guest
    - cleanups: MMU, CPUID, shared MSRs
    - LAPIC latency optimizations ad bugfixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
    kvm: x86/mmu: NX largepage recovery for TDP MMU
    kvm: x86/mmu: Don't clear write flooding count for direct roots
    kvm: x86/mmu: Support MMIO in the TDP MMU
    kvm: x86/mmu: Support write protection for nesting in tdp MMU
    kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
    kvm: x86/mmu: Support dirty logging for the TDP MMU
    kvm: x86/mmu: Support changed pte notifier in tdp MMU
    kvm: x86/mmu: Add access tracking for tdp_mmu
    kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
    kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
    kvm: x86/mmu: Add TDP MMU PF handler
    kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
    kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
    KVM: Cache as_id in kvm_memory_slot
    kvm: x86/mmu: Add functions to handle changed TDP SPTEs
    kvm: x86/mmu: Allocate and free TDP MMU roots
    kvm: x86/mmu: Init / Uninit the TDP MMU
    kvm: x86/mmu: Introduce tdp_iter
    KVM: mmu: extract spte.h and spte.c
    KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
    ...

    Linus Torvalds
     

22 Oct, 2020

1 commit


17 Oct, 2020

1 commit

  • Pull powerpc updates from Michael Ellerman:

    - A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
    it for powerpc, as well as a related fix for sparc.

    - Remove support for PowerPC 601.

    - Some fixes for watchpoints & addition of a new ptrace flag for
    detecting ISA v3.1 (Power10) watchpoint features.

    - A fix for kernels using 4K pages and the hash MMU on bare metal
    Power9 systems with > 16TB of RAM, or RAM on the 2nd node.

    - A basic idle driver for shallow stop states on Power10.

    - Tweaks to our sched domains code to better inform the scheduler about
    the hardware topology on Power9/10, where two SMT4 cores can be
    presented by firmware as an SMT8 core.

    - A series doing further reworks & cleanups of our EEH code.

    - Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
    to prevent root from overwriting kernel memory.

    - Other smaller features, fixes & cleanups.

    Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
    Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
    Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
    Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
    Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
    Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
    Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
    Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
    Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
    Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
    Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
    Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
    Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
    Yingliang, zhengbin.

    * tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
    Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
    selftests/powerpc: Fix eeh-basic.sh exit codes
    cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
    powerpc/time: Make get_tb() common to PPC32 and PPC64
    powerpc/time: Make get_tbl() common to PPC32 and PPC64
    powerpc/time: Remove get_tbu()
    powerpc/time: Avoid using get_tbl() and get_tbu() internally
    powerpc/time: Make mftb() common to PPC32 and PPC64
    powerpc/time: Rename mftbl() to mftb()
    powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
    powerpc/32s: Rename head_32.S to head_book3s_32.S
    powerpc/32s: Setup the early hash table at all time.
    powerpc/time: Remove ifdef in get_dec() and set_dec()
    powerpc: Remove get_tb_or_rtc()
    powerpc: Remove __USE_RTC()
    powerpc: Tidy up a bit after removal of PowerPC 601.
    powerpc: Remove support for PowerPC 601
    powerpc: Remove PowerPC 601
    powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
    powerpc: Remove CONFIG_PPC601_SYNC_FIX
    ...

    Linus Torvalds
     

14 Oct, 2020

3 commits

  • Patch series "memblock: seasonal cleaning^w cleanup", v3.

    These patches simplify several uses of memblock iterators and hide some of
    the memblock implementation details from the rest of the system.

    This patch (of 17):

    The memory size calculation in kvm_cma_reserve() traverses memblock.memory
    rather than simply call memblock_phys_mem_size(). The comment in that
    function suggests that at some point there should have been call to
    memblock_analyze() before memblock_phys_mem_size() could be used. As of
    now, there is no memblock_analyze() at all and memblock_phys_mem_size()
    can be used as soon as cold-plug memory is registered with memblock.

    Replace loop over memblock.memory with a call to memblock_phys_mem_size().

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Ingo Molnar
    Cc: Hari Bathini
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Miguel Ojeda
    Cc: Thomas Bogendoerfer
    Link: https://lkml.kernel.org/r/20200818151634.14343-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20200818151634.14343-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • In support of device-dax growing the ability to front physically
    dis-contiguous ranges of memory, update devm_memremap_pages() to track
    multiple ranges with a single reference counter and devm instance.

    Convert all [devm_]memremap_pages() users to specify the number of ranges
    they are mapping in their 'struct dev_pagemap' instance.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'struct resource' in 'struct dev_pagemap' is only used for holding
    resource span information. The other fields, 'name', 'flags', 'desc',
    'parent', 'sibling', and 'child' are all unused wasted space.

    This is in preparation for introducing a multi-range extension of
    devm_memremap_pages().

    The bulk of this change is unwinding all the places internal to libnvdimm
    that used 'struct resource' unnecessarily, and replacing instances of
    'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

    P2PDMA had a minor usage of the resource flags field, but only to report
    failures with "%pR". That is replaced with an open coded print of the
    range.

    [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
    Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam

    Signed-off-by: Dan Williams
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky [xen]
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

22 Sep, 2020

3 commits

  • Build the kernel with `C=2`:
    arch/powerpc/kvm/book3s_hv_nested.c:572:25: warning: symbol
    'kvmhv_alloc_nested' was not declared. Should it be static?
    arch/powerpc/kvm/book3s_64_mmu_radix.c:350:6: warning: symbol
    'kvmppc_radix_set_pte_at' was not declared. Should it be static?
    arch/powerpc/kvm/book3s_hv.c:3568:5: warning: symbol
    'kvmhv_p9_guest_entry' was not declared. Should it be static?
    arch/powerpc/kvm/book3s_hv_rm_xics.c:767:15: warning: symbol 'eoi_rc'
    was not declared. Should it be static?
    arch/powerpc/kvm/book3s_64_vio_hv.c:240:13: warning: symbol
    'iommu_tce_kill_rm' was not declared. Should it be static?
    arch/powerpc/kvm/book3s_64_vio.c:492:6: warning: symbol
    'kvmppc_tce_iommu_do_map' was not declared. Should it be static?
    arch/powerpc/kvm/book3s_pr.c:572:6: warning: symbol 'kvmppc_set_pvr_pr'
    was not declared. Should it be static?

    Those symbols are used only in the files that define them so make them
    static to fix the warnings.

    Signed-off-by: Wang Wensheng
    Signed-off-by: Paul Mackerras

    Wang Wensheng
     
  • The variable ret is being initialized with '-ENOMEM' that is meaningless.
    So remove it.

    Signed-off-by: Jing Xiangfeng
    Reviewed-by: Fabiano Rosas
    Signed-off-by: Paul Mackerras

    Jing Xiangfeng
     
  • Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

    Signed-off-by: Qinglang Miao
    Reviewed-by: Cédric Le Goater
    Signed-off-by: Paul Mackerras

    Qinglang Miao
     

17 Sep, 2020

3 commits

  • POWER8 and POWER9 machines have a hardware deviation where generation
    of a hypervisor decrementer exception is suppressed if the HDICE bit
    in the LPCR register is 0 at the time when the HDEC register
    decrements from 0 to -1. When entering a guest, KVM first writes the
    HDEC register with the time until it wants the CPU to exit the guest,
    and then writes the LPCR with the guest value, which includes
    HDICE = 1. If HDEC decrements from 0 to -1 during the interval
    between those two events, it is possible that we can enter the guest
    with HDEC already negative but no HDEC exception pending, meaning that
    no HDEC interrupt will occur while the CPU is in the guest, or at
    least not until HDEC wraps around. Thus it is possible for the CPU to
    keep executing in the guest for a long time; up to about 4 seconds on
    POWER8, or about 4.46 years on POWER9 (except that the host kernel
    hard lockup detector will fire first).

    To fix this, we set the LPCR[HDICE] bit before writing HDEC on guest
    entry.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     
  • The current nested KVM code does not support HPT guests. This is
    informed/enforced in some ways:

    - Hosts < P9 will not be able to enable the nested HV feature;

    - The nested hypervisor MMU capabilities will not contain
    KVM_CAP_PPC_MMU_HASH_V3;

    - QEMU reflects the MMU capabilities in the
    'ibm,arch-vec-5-platform-support' device-tree property;

    - The nested guest, at 'prom_parse_mmu_model' ignores the
    'disable_radix' kernel command line option if HPT is not supported;

    - The KVM_PPC_CONFIGURE_V3_MMU ioctl will fail if trying to use HPT.

    There is, however, still a way to start a HPT guest by using
    max-compat-cpu=power8 at the QEMU machine options. This leads to the
    guest being set to use hash after QEMU calls the KVM_PPC_ALLOCATE_HTAB
    ioctl.

    With the guest set to hash, the nested hypervisor goes through the
    entry path that has no knowledge of nesting (kvmppc_run_vcpu) and
    crashes when it tries to execute an hypervisor-privileged (mtspr
    HDEC) instruction at __kvmppc_vcore_entry:

    root@L1:~ $ qemu-system-ppc64 -machine pseries,max-cpu-compat=power8 ...

    [ 538.543303] CPU: 83 PID: 25185 Comm: CPU 0/KVM Not tainted 5.9.0-rc4 #1
    [ 538.543355] NIP: c00800000753f388 LR: c00800000753f368 CTR: c0000000001e5ec0
    [ 538.543417] REGS: c0000013e91e33b0 TRAP: 0700 Not tainted (5.9.0-rc4)
    [ 538.543470] MSR: 8000000002843033 CR: 22422882 XER: 20040000
    [ 538.543546] CFAR: c00800000753f4b0 IRQMASK: 3
    GPR00: c0080000075397a0 c0000013e91e3640 c00800000755e600 0000000080000000
    GPR04: 0000000000000000 c0000013eab19800 c000001394de0000 00000043a054db72
    GPR08: 00000000003b1652 0000000000000000 0000000000000000 c0080000075502e0
    GPR12: c0000000001e5ec0 c0000007ffa74200 c0000013eab19800 0000000000000008
    GPR16: 0000000000000000 c00000139676c6c0 c000000001d23948 c0000013e91e38b8
    GPR20: 0000000000000053 0000000000000000 0000000000000001 0000000000000000
    GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000001
    GPR28: 0000000000000001 0000000000000053 c0000013eab19800 0000000000000001
    [ 538.544067] NIP [c00800000753f388] __kvmppc_vcore_entry+0x90/0x104 [kvm_hv]
    [ 538.544121] LR [c00800000753f368] __kvmppc_vcore_entry+0x70/0x104 [kvm_hv]
    [ 538.544173] Call Trace:
    [ 538.544196] [c0000013e91e3640] [c0000013e91e3680] 0xc0000013e91e3680 (unreliable)
    [ 538.544260] [c0000013e91e3820] [c0080000075397a0] kvmppc_run_core+0xbc8/0x19d0 [kvm_hv]
    [ 538.544325] [c0000013e91e39e0] [c00800000753d99c] kvmppc_vcpu_run_hv+0x404/0xc00 [kvm_hv]
    [ 538.544394] [c0000013e91e3ad0] [c0080000072da4fc] kvmppc_vcpu_run+0x34/0x48 [kvm]
    [ 538.544472] [c0000013e91e3af0] [c0080000072d61b8] kvm_arch_vcpu_ioctl_run+0x310/0x420 [kvm]
    [ 538.544539] [c0000013e91e3b80] [c0080000072c7450] kvm_vcpu_ioctl+0x298/0x778 [kvm]
    [ 538.544605] [c0000013e91e3ce0] [c0000000004b8c2c] sys_ioctl+0x1dc/0xc90
    [ 538.544662] [c0000013e91e3dc0] [c00000000002f9a4] system_call_exception+0xe4/0x1c0
    [ 538.544726] [c0000013e91e3e20] [c00000000000d140] system_call_common+0xf0/0x27c
    [ 538.544787] Instruction dump:
    [ 538.544821] f86d1098 60000000 60000000 48000099 e8ad0fe8 e8c500a0 e9264140 75290002
    [ 538.544886] 7d1602a6 7cec42a6 40820008 7d0807b4 7d083a14 f90d10a0 480104fd
    [ 538.544953] ---[ end trace 74423e2b948c2e0c ]---

    This patch makes the KVM_PPC_ALLOCATE_HTAB ioctl fail when running in
    the nested hypervisor, causing QEMU to abort.

    Reported-by: Satheesh Rajendran
    Signed-off-by: Fabiano Rosas
    Reviewed-by: Greg Kurz
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Fabiano Rosas
     
  • ENOTSUPP is a linux only thingy, the value of which is unknown to
    userspace, not to be confused with ENOTSUP which linux maps to
    EOPNOTSUPP, as permitted by POSIX [1]:

    [EOPNOTSUPP]
    Operation not supported on socket. The type of socket (address family
    or protocol) does not support the requested operation. A conforming
    implementation may assign the same values for [EOPNOTSUPP] and [ENOTSUP].

    Return -EOPNOTSUPP instead of -ENOTSUPP for the following ioctls:
    - KVM_GET_FPU for Book3s and BookE
    - KVM_SET_FPU for Book3s and BookE
    - KVM_GET_DIRTY_LOG for BookE

    This doesn't affect QEMU which doesn't call the KVM_GET_FPU and
    KVM_SET_FPU ioctls on POWER anyway since they are not supported,
    and _buggily_ ignores anything but -EPERM for KVM_GET_DIRTY_LOG.

    [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html

    Signed-off-by: Greg Kurz
    Acked-by: Thadeu Lima de Souza Cascardo
    Signed-off-by: Paul Mackerras

    Greg Kurz
     

08 Sep, 2020

1 commit

  • The ISA v3.1 the copy-paste facility has a new memory move functionality
    which allows the copy buffer to be pasted to domestic memory (RAM) as
    opposed to foreign memory (accelerator).

    This means the POWER9 trick of avoiding the cp_abort on context switch if
    the process had not mapped foreign memory does not work on POWER10. Do the
    cp_abort unconditionally there.

    KVM must also cp_abort on guest exit to prevent copy buffer state leaking
    between contexts.

    Signed-off-by: Nicholas Piggin
    Acked-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200825075535.224536-1-npiggin@gmail.com

    Nicholas Piggin
     

03 Sep, 2020

1 commit

  • Similarly to what was done with XICS-on-XIVE and XIVE native KVM devices
    with commit 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy'
    method by a 'release' method"), convert the historical XICS KVM device to
    implement the 'release' method. This is needed to run nested guests with
    an in-kernel IRQ chip. A typical POWER9 guest can select XICS or XIVE
    during boot, which requires to be able to destroy and to re-create the
    KVM device. Only the historical XICS KVM device is available under pseries
    at the current time and it still uses the legacy 'destroy' method.

    Switching to 'release' means that vCPUs might still be running when the
    device is destroyed. In order to avoid potential use-after-free, the
    kvmppc_xics structure is allocated on first usage and kept around until
    the VM exits. The same pointer is used each time a KVM XICS device is
    being created, but this is okay since we only have one per VM.

    Clear the ICP of each vCPU with vcpu->mutex held. This ensures that the
    next time the vCPU resumes execution, it won't be going into the XICS
    code anymore.

    Signed-off-by: Greg Kurz
    Reviewed-by: Cédric Le Goater
    Tested-by: Cédric Le Goater
    Signed-off-by: Paul Mackerras

    Greg Kurz
     

22 Aug, 2020

1 commit

  • The 'flags' field of 'struct mmu_notifier_range' is used to indicate
    whether invalidate_range_{start,end}() are permitted to block. In the
    case of kvm_mmu_notifier_invalidate_range_start(), this field is not
    forwarded on to the architecture-specific implementation of
    kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
    whether or not to block.

    Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
    architectures are aware as to whether or not they are permitted to block.

    Cc:
    Cc: Marc Zyngier
    Cc: Suzuki K Poulose
    Cc: James Morse
    Signed-off-by: Will Deacon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Will Deacon
     

10 Aug, 2020

1 commit


08 Aug, 2020

1 commit

  • Pull powerpc updates from Michael Ellerman:

    - Add support for (optionally) using queued spinlocks & rwlocks.

    - Support for a new faster system call ABI using the scv instruction on
    Power9 or later.

    - Drop support for the PROT_SAO mmap/mprotect flag as it will be
    unsupported on Power10 and future processors, leaving us with no way
    to implement the functionality it requests. This risks breaking
    userspace, though we believe it is unused in practice.

    - A bug fix for, and then the removal of, our custom stack expansion
    checking. We now allow stack expansion up to the rlimit, like other
    architectures.

    - Remove the remnants of our (previously disabled) topology update
    code, which tried to react to NUMA layout changes on virtualised
    systems, but was prone to crashes and other problems.

    - Add PMU support for Power10 CPUs.

    - A change to our signal trampoline so that we don't unbalance the link
    stack (branch return predictor) in the signal delivery path.

    - Lots of other cleanups, refactorings, smaller features and so on as
    usual.

    Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey
    Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju
    T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan
    S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris
    Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan
    Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn
    Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand,
    Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel
    Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh
    Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan
    Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton
    Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan
    Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran,
    Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud,
    Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy
    Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh
    Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar
    Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza
    Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov,
    Wei Yongjun, Wen Xiong, YueHaibing.

    * tag 'powerpc-5.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (337 commits)
    selftests/powerpc: Fix pkey syscall redefinitions
    powerpc: Fix circular dependency between percpu.h and mmu.h
    powerpc/powernv/sriov: Fix use of uninitialised variable
    selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs
    powerpc/40x: Fix assembler warning about r0
    powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric
    powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
    cpuidle: pseries: Fixup exit latency for CEDE(0)
    cpuidle: pseries: Add function to parse extended CEDE records
    cpuidle: pseries: Set the latency-hint before entering CEDE
    selftests/powerpc: Fix online CPU selection
    powerpc/perf: Consolidate perf_callchain_user_[64|32]()
    powerpc/pseries/hotplug-cpu: Remove double free in error path
    powerpc/pseries/mobility: Add pr_debug() for device tree changes
    powerpc/pseries/mobility: Set pr_fmt()
    powerpc/cacheinfo: Warn if cache object chain becomes unordered
    powerpc/cacheinfo: Improve diagnostics about malformed cache lists
    powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages
    powerpc/cacheinfo: Set pr_fmt()
    powerpc: fix function annotations to avoid section mismatch warnings with gcc-10
    ...

    Linus Torvalds
     

06 Aug, 2020

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "Ralph has been working on nouveau's use of hmm_range_fault() and
    migrate_vma() which resulted in this small series. It adds reporting
    of the page table order from hmm_range_fault() and some optimization
    of migrate_vma():

    - Report the size of the page table mapping out of hmm_range_fault().

    This makes it easier to establish a large/huge/etc mapping in the
    device's page table.

    - Allow devices to ignore the invalidations during migration in cases
    where the migration is not going to change pages.

    For instance migrating pages to a device does not require the
    device to invalidate pages already in the device.

    - Update nouveau and hmm_tests to use the above"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    mm/hmm/test: use the new migration invalidation
    nouveau/svm: use the new migration invalidation
    mm/notifier: add migration invalidation type
    mm/migrate: add a flags parameter to migrate_vma
    nouveau: fix storing invalid ptes
    nouveau/hmm: support mapping large sysmem pages
    nouveau: fix mapping 2MB sysmem pages
    nouveau/hmm: fault one page at a time
    mm/hmm: add tests for hmm_pfn_to_map_order()
    mm/hmm: provide the page mapping order in hmm_range_fault()

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

29 Jul, 2020

4 commits

  • With the proposed change in percpu bootmem allocator to use page
    mapping [1], the percpu first chunk memory area can come from vmalloc
    ranges. This makes the HMI (Hypervisor Maintenance Interrupt) handler
    crash the kernel whenever percpu variable is accessed in real mode.
    This patch fixes this issue by moving the HMI IRQ stat inside paca for
    safe access in realmode.

    [1] https://lore.kernel.org/linuxppc-dev/20200608070904.387440-1-aneesh.kumar@linux.ibm.com/

    Suggested-by: Aneesh Kumar K.V
    Signed-off-by: Mahesh Salgaonkar
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/159290806973.3642154.5244613424529764050.stgit@jupiter

    Mahesh Salgaonkar
     
  • Current kernel gives:

    [ 0.000000] cma: Reserved 26224 MiB at 0x0000007959000000
    [ 0.000000] hugetlb_cma: reserve 65536 MiB, up to 16384 MiB per node
    [ 0.000000] cma: Reserved 16384 MiB at 0x0000001800000000

    With the fix

    [ 0.000000] kvm_cma_reserve: reserving 26214 MiB for global area
    [ 0.000000] cma: Reserved 26224 MiB at 0x0000007959000000
    [ 0.000000] hugetlb_cma: reserve 65536 MiB, up to 16384 MiB per node
    [ 0.000000] cma: Reserved 16384 MiB at 0x0000001800000000

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200713150749.25245-2-aneesh.kumar@linux.ibm.com

    Aneesh Kumar K.V
     
  • This comment refers to the non-existent CONFIG_PPC_BOOK3S_XX, which
    confuses scripts/checkkconfigsymbols.py.

    Change it to use the correct symbol.

    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200724131728.1643966-8-mpe@ellerman.id.au

    Michael Ellerman
     
  • The src_owner field in struct migrate_vma is being used for two purposes,
    it acts as a selection filter for which types of pages are to be migrated
    and it identifies device private pages owned by the caller.

    Split this into separate parameters so the src_owner field can be used
    just to identify device private pages owned by the caller of
    migrate_vma_setup().

    Rename the src_owner field to pgmap_owner to reflect it is now used only
    to identify which device private pages to migrate.

    Link: https://lore.kernel.org/r/20200723223004.9586-3-rcampbell@nvidia.com
    Signed-off-by: Ralph Campbell
    Reviewed-by: Bharata B Rao
    Signed-off-by: Jason Gunthorpe

    Ralph Campbell
     

28 Jul, 2020

7 commits

  • When a secure memslot is dropped, all the pages backed in the secure
    device (aka really backed by secure memory by the Ultravisor)
    should be paged out to a normal page. Previously, this was
    achieved by triggering the page fault mechanism which is calling
    kvmppc_svm_page_out() on each pages.

    This can't work when hot unplugging a memory slot because the memory
    slot is flagged as invalid and gfn_to_pfn() is then not trying to access
    the page, so the page fault mechanism is not triggered.

    Since the final goal is to make a call to kvmppc_svm_page_out() it seems
    simpler to call directly instead of triggering such a mechanism. This
    way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
    memslot.

    Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
    the call to __kvmppc_svm_page_out() is made. As
    __kvmppc_svm_page_out needs the vma pointer to migrate the pages,
    the VMA is fetched in a lazy way, to not trigger find_vma() all
    the time. In addition, the mmap_sem is held in read mode during
    that time, not in write mode since the virual memory layout is not
    impacted, and kvm->arch.uvmem_lock prevents concurrent operation
    on the secure device.

    Reviewed-by: Bharata B Rao
    Signed-off-by: Laurent Dufour
    [modified check on the VMA in kvmppc_uvmem_drop_pages]
    Signed-off-by: Ram Pai
    [modified the changelog description]
    Signed-off-by: Paul Mackerras

    Laurent Dufour
     
  • kvmppc_svm_page_out() will need to be called by kvmppc_uvmem_drop_pages()
    so move it up earlier in this file.

    Furthermore it will be interesting to call this function when already
    holding the kvm->arch.uvmem_lock, so prefix the original function with __
    and remove the locking in it, and introduce a wrapper which call that
    function with the lock held.

    There is no functional change.

    Reviewed-by: Bharata B Rao
    Signed-off-by: Laurent Dufour
    Signed-off-by: Ram Pai
    Signed-off-by: Paul Mackerras

    Laurent Dufour
     
  • When a memory slot is hot plugged to a SVM, PFNs associated with the
    GFNs in that slot must be migrated to the secure-PFNs, aka device-PFNs.

    Call kvmppc_uv_migrate_mem_slot() to accomplish this.
    Disable page-merge for all pages in the memory slot.

    Reviewed-by: Bharata B Rao
    Signed-off-by: Ram Pai
    [rearranged the code, and modified the commit log]
    Signed-off-by: Laurent Dufour
    Signed-off-by: Paul Mackerras

    Laurent Dufour
     
  • The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the
    pages of the SVM before calling H_SVM_INIT_DONE. This causes a huge
    delay in tranistioning the VM to SVM. The Ultravisor is only interested
    in the pages that contain the kernel, initrd and other important data
    structures. The rest contain throw-away content.

    However if not all pages are requested by the Ultravisor, the Hypervisor
    continues to consider the GFNs corresponding to the non-requested pages
    as normal GFNs. This can lead to data-corruption and undefined behavior.

    In H_SVM_INIT_DONE handler, move all the PFNs associated with the SVM's
    GFNs to secure-PFNs. Skip the GFNs that are already Paged-in or Shared
    or Paged-in followed by a Paged-out.

    Reviewed-by: Bharata B Rao
    Signed-off-by: Ram Pai
    Signed-off-by: Paul Mackerras

    Ram Pai
     
  • During the life of SVM, its GFNs transition through normal, secure and
    shared states. Since the kernel does not track GFNs that are shared, it
    is not possible to disambiguate a shared GFN from a GFN whose PFN has
    not yet been migrated to a secure-PFN. Also it is not possible to
    disambiguate a secure-GFN from a GFN whose GFN has been pagedout from
    the ultravisor.

    The ability to identify the state of a GFN is needed to skip migration
    of its PFN to secure-PFN during ESM transition.

    The code is re-organized to track the states of a GFN as explained
    below.

    ************************************************************************
    1. States of a GFN
    ---------------
    The GFN can be in one of the following states.

    (a) Secure - The GFN is secure. The GFN is associated with
    a Secure VM, the contents of the GFN is not accessible
    to the Hypervisor. This GFN can be backed by a secure-PFN,
    or can be backed by a normal-PFN with contents encrypted.
    The former is true when the GFN is paged-in into the
    ultravisor. The latter is true when the GFN is paged-out
    of the ultravisor.

    (b) Shared - The GFN is shared. The GFN is associated with a
    a secure VM. The contents of the GFN is accessible to
    Hypervisor. This GFN is backed by a normal-PFN and its
    content is un-encrypted.

    (c) Normal - The GFN is a normal. The GFN is associated with
    a normal VM. The contents of the GFN is accesible to
    the Hypervisor. Its content is never encrypted.

    2. States of a VM.
    ---------------

    (a) Normal VM: A VM whose contents are always accessible to
    the hypervisor. All its GFNs are normal-GFNs.

    (b) Secure VM: A VM whose contents are not accessible to the
    hypervisor without the VM's consent. Its GFNs are
    either Shared-GFN or Secure-GFNs.

    (c) Transient VM: A Normal VM that is transitioning to secure VM.
    The transition starts on successful return of
    H_SVM_INIT_START, and ends on successful return
    of H_SVM_INIT_DONE. This transient VM, can have GFNs
    in any of the three states; i.e Secure-GFN, Shared-GFN,
    and Normal-GFN. The VM never executes in this state
    in supervisor-mode.

    3. Memory slot State.
    ------------------
    The state of a memory slot mirrors the state of the
    VM the memory slot is associated with.

    4. VM State transition.
    --------------------

    A VM always starts in Normal Mode.

    H_SVM_INIT_START moves the VM into transient state. During this
    time the Ultravisor may request some of its GFNs to be shared or
    secured. So its GFNs can be in one of the three GFN states.

    H_SVM_INIT_DONE moves the VM entirely from transient state to
    secure-state. At this point any left-over normal-GFNs are
    transitioned to Secure-GFN.

    H_SVM_INIT_ABORT moves the transient VM back to normal VM.
    All its GFNs are moved to Normal-GFNs.

    UV_TERMINATE transitions the secure-VM back to normal-VM. All
    the secure-GFN and shared-GFNs are tranistioned to normal-GFN
    Note: The contents of the normal-GFN is undefined at this point.

    5. GFN state implementation:
    -------------------------

    Secure GFN is associated with a secure-PFN; also called uvmem_pfn,
    when the GFN is paged-in. Its pfn[] has KVMPPC_GFN_UVMEM_PFN flag
    set, and contains the value of the secure-PFN.
    It is associated with a normal-PFN; also called mem_pfn, when
    the GFN is pagedout. Its pfn[] has KVMPPC_GFN_MEM_PFN flag set.
    The value of the normal-PFN is not tracked.

    Shared GFN is associated with a normal-PFN. Its pfn[] has
    KVMPPC_UVMEM_SHARED_PFN flag set. The value of the normal-PFN
    is not tracked.

    Normal GFN is associated with normal-PFN. Its pfn[] has
    no flag set. The value of the normal-PFN is not tracked.

    6. Life cycle of a GFN
    --------------------
    --------------------------------------------------------------
    | | Share | Unshare | SVM |H_SVM_INIT_DONE|
    | |operation |operation | abort/ | |
    | | | | terminate | |
    -------------------------------------------------------------
    | | | | | |
    | Secure | Shared | Secure |Normal |Secure |
    | | | | | |
    | Shared | Shared | Secure |Normal |Shared |
    | | | | | |
    | Normal | Shared | Secure |Normal |Secure |
    --------------------------------------------------------------

    7. Life cycle of a VM
    --------------------
    --------------------------------------------------------------------
    | | start | H_SVM_ |H_SVM_ |H_SVM_ |UV_SVM_ |
    | | VM |INIT_START|INIT_DONE|INIT_ABORT |TERMINATE |
    | | | | | | |
    --------- ----------------------------------------------------------
    | | | | | | |
    | Normal | Normal | Transient|Error |Error |Normal |
    | | | | | | |
    | Secure | Error | Error |Error |Error |Normal |
    | | | | | | |
    |Transient| N/A | Error |Secure |Normal |Normal |
    --------------------------------------------------------------------

    ************************************************************************

    Reviewed-by: Bharata B Rao
    Reviewed-by: Thiago Jung Bauermann
    Signed-off-by: Ram Pai
    Signed-off-by: Paul Mackerras

    Ram Pai
     
  • Page-merging of pages in memory-slots associated with a Secure VM
    is disabled in H_SVM_PAGE_IN handler.

    This operation should have been done the much earlier; the moment the VM
    is initiated for secure-transition. Delaying this operation increases
    the probability for those pages to acquire new references, making it
    impossible to migrate those pages in H_SVM_PAGE_IN handler.

    Disable page-migration in H_SVM_INIT_START handling.

    Reviewed-by: Bharata B Rao
    Signed-off-by: Ram Pai
    Signed-off-by: Paul Mackerras

    Ram Pai
     
  • Without this fix, git is confused. It generates wrong
    function context for code changes in subsequent patches.
    Weird, but true.

    Signed-off-by: Ram Pai
    Signed-off-by: Paul Mackerras

    Ram Pai
     

26 Jul, 2020

1 commit


23 Jul, 2020

2 commits

  • On PAPR+ the hcall() on 0x1B0 is called H_DISABLE_AND_GET, but got
    defined as H_DISABLE_AND_GETC instead.

    This define was introduced with a typo in commit
    ("[PATCH] powerpc: Extends HCALL interface for InfiniBand usage"), and was
    later used without having the typo noticed.

    Signed-off-by: Leonardo Bras
    Acked-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20200707004812.190765-1-leobras.c@gmail.com

    Leonardo Bras
     
  • In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
    structure. For historical reasons, many kvm-related function parameters
    retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
    patch does a unified cleanup of these remaining redundant parameters.

    [paulus@ozlabs.org - Fixed places that were missed in book3s_interrupts.S]

    Signed-off-by: Tianjia Zhang
    Signed-off-by: Paul Mackerras

    Tianjia Zhang
     

22 Jul, 2020

2 commits

  • Power ISA v3.1 has added new performance monitoring unit (PMU) special
    purpose registers (SPRs). They are:

    Monitor Mode Control Register 3 (MMCR3)
    Sampled Instruction Event Register A (SIER2)
    Sampled Instruction Event Register B (SIER3)

    Add support to save/restore these new SPRs while entering/exiting
    guest. Also include changes to support KVM_REG_PPC_MMCR3/SIER2/SIER3.
    Add new SPRs to KVM API documentation.

    Signed-off-by: Athira Rajeev
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/1594996707-3727-6-git-send-email-atrajeev@linux.vnet.ibm.com

    Athira Rajeev
     
  • Currently `kvm_vcpu_arch` stores all Monitor Mode Control registers
    in a flat array in order: mmcr0, mmcr1, mmcra, mmcr2, mmcrs
    Split this to give mmcra and mmcrs its own entries in vcpu and
    use a flat array for mmcr0 to mmcr2. This patch implements this
    cleanup to make code easier to read.

    Signed-off-by: Athira Rajeev
    [mpe: Fix MMCRA/MMCR2 uapi breakage as noted by paulus]
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/1594996707-3727-3-git-send-email-atrajeev@linux.vnet.ibm.com

    Athira Rajeev
     

21 Jul, 2020

2 commits

  • The kvm_vcpu_read_guest/kvm_vcpu_write_guest used for nested guests
    eventually call srcu_dereference_check to dereference a memslot and
    lockdep produces a warning as neither kvm->slots_lock nor
    kvm->srcu lock is held and kvm->users_count is above zero (>100 in fact).

    This wraps mentioned VCPU read/write helpers in srcu read lock/unlock as
    it is done in other places. This uses vcpu->srcu_idx when possible.

    These helpers are only used for nested KVM so this may explain why
    we did not see these before.

    Here is an example of a warning:

    =============================
    WARNING: suspicious RCU usage
    5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897 Not tainted
    -----------------------------
    include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by qemu-system-ppc/2752:
    #0: c000200359016be0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x144/0xd80 [kvm]

    stack backtrace:
    CPU: 80 PID: 2752 Comm: qemu-system-ppc Not tainted 5.7.0-rc3-le_dma-bypass.3.2_a+fstn1 #897
    Call Trace:
    [c0002003591ab240] [c000000000b23ab4] dump_stack+0x190/0x25c (unreliable)
    [c0002003591ab2b0] [c00000000023f954] lockdep_rcu_suspicious+0x140/0x164
    [c0002003591ab330] [c008000004a445f8] kvm_vcpu_gfn_to_memslot+0x4c0/0x510 [kvm]
    [c0002003591ab3a0] [c008000004a44c18] kvm_vcpu_read_guest+0xa0/0x180 [kvm]
    [c0002003591ab410] [c008000004ff9bd8] kvmhv_enter_nested_guest+0x90/0xb80 [kvm_hv]
    [c0002003591ab980] [c008000004fe07bc] kvmppc_pseries_do_hcall+0x7b4/0x1c30 [kvm_hv]
    [c0002003591aba10] [c008000004fe5d30] kvmppc_vcpu_run_hv+0x10a8/0x1a30 [kvm_hv]
    [c0002003591abae0] [c008000004a5d954] kvmppc_vcpu_run+0x4c/0x70 [kvm]
    [c0002003591abb10] [c008000004a56e54] kvm_arch_vcpu_ioctl_run+0x56c/0x7c0 [kvm]
    [c0002003591abba0] [c008000004a3ddc4] kvm_vcpu_ioctl+0x4ac/0xd80 [kvm]
    [c0002003591abd20] [c0000000006ebb58] ksys_ioctl+0x188/0x210
    [c0002003591abd70] [c0000000006ebc28] sys_ioctl+0x48/0xb0
    [c0002003591abdb0] [c000000000042764] system_call_exception+0x1d4/0x2e0
    [c0002003591abe20] [c00000000000cce8] system_call_common+0xe8/0x214

    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Paul Mackerras

    Alexey Kardashevskiy
     
  • POWER8 and POWER9 have 12-bit LPIDs. Change LPID_RSVD to support up to
    (4096 - 2) guests on these processors. POWER7 is kept the same with a
    limitation of (1024 - 2), but it might be time to drop KVM support for
    POWER7.

    Tested with 2048 guests * 4 vCPUs on a witherspoon system with 512G
    RAM and a bit of swap.

    Signed-off-by: Cédric Le Goater
    Signed-off-by: Paul Mackerras

    Cédric Le Goater