16 Mar, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - some cleanups
    - direct physical timer assignment
    - cache sanitization for 32-bit guests

    s390:
    - interrupt cleanup
    - introduction of the Guest Information Block
    - preparation for processor subfunctions in cpu models

    PPC:
    - bug fixes and improvements, especially related to machine checks
    and protection keys

    x86:
    - many, many cleanups, including removing a bunch of MMU code for
    unnecessary optimizations
    - AVIC fixes

    Generic:
    - memcg accounting"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
    kvm: vmx: fix formatting of a comment
    KVM: doc: Document the life cycle of a VM and its resources
    MAINTAINERS: Add KVM selftests to existing KVM entry
    Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
    KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
    KVM: PPC: Fix compilation when KVM is not enabled
    KVM: Minor cleanups for kvm_main.c
    KVM: s390: add debug logging for cpu model subfunctions
    KVM: s390: implement subfunction processor calls
    arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
    KVM: arm/arm64: Remove unused timer variable
    KVM: PPC: Book3S: Improve KVM reference counting
    KVM: PPC: Book3S HV: Fix build failure without IOMMU support
    Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
    x86: kvmguest: use TSC clocksource if invariant TSC is exposed
    KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
    KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
    KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
    KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
    KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
    ...

    Linus Torvalds
     

07 Mar, 2019

1 commit

  • Pull ACPI updates from Rafael Wysocki:
    "These are ACPICA updates including ACPI 6.3 support among other
    things, APEI updates including the ARM Software Delegated Exception
    Interface (SDEI) support, ACPI EC driver fixes and cleanups and other
    assorted improvements.

    Specifics:

    - Update the ACPICA code in the kernel to upstream revision 20190215
    including ACPI 6.3 support and more:
    * New predefined methods: _NBS, _NCH, _NIC, _NIH, and _NIG (Erik
    Schmauss).
    * Update of the PCC Identifier structure in PDTT (Erik Schmauss).
    * Support for new Generic Affinity Structure subtable in SRAT
    (Erik Schmauss).
    * New PCC operation region support (Erik Schmauss).
    * Support for GICC statistical profiling for MADT (Erik Schmauss).
    * New Error Disconnect Recover notification support (Erik
    Schmauss).
    * New PPTT Processor Structure Flags fields support (Erik
    Schmauss).
    * ACPI 6.3 HMAT updates (Erik Schmauss).
    * GTDT Revision 3 support (Erik Schmauss).
    * Legacy module-level code (MLC) support removal (Erik Schmauss).
    * Update/clarification of messages for control method failures
    (Bob Moore).
    * Warning on creation of a zero-length opregion (Bob Moore).
    * acpiexec option to dump extra info for memory leaks (Bob Moore).
    * More ACPI error to firmware error conversions (Bob Moore).
    * Debugger fix (Bob Moore).
    * Copyrights update (Bob Moore)

    - Clean up sleep states support code in ACPICA (Christoph Hellwig)

    - Rework in_nmi() handling in the APEI code and add suppor for the
    ARM Software Delegated Exception Interface (SDEI) to it (James
    Morse)

    - Fix possible out-of-bounds accesses in BERT-related core (Ross
    Lagerwall)

    - Fix the APEI code parsing HEST that includes a Deferred Machine
    Check subtable (Yazen Ghannam)

    - Use DEFINE_DEBUGFS_ATTRIBUTE for APEI-related debugfs files
    (YueHaibing)

    - Switch the APEI ERST code to the new generic UUID API (Andy
    Shevchenko)

    - Update the MAINTAINERS entry for APEI (Borislav Petkov)

    - Fix and clean up the ACPI EC driver (Rafael Wysocki, Zhang Rui)

    - Fix DMI checks handling in the ACPI backlight driver and add the
    "Lunch Box" chassis-type check to it (Hans de Goede)

    - Add support for using ACPI table overrides included in built-in
    initrd images (Shunyong Yang)

    - Update ACPI device enumeration to treat the PWM2 device as "always
    present" on Lenovo Yoga Book (Yauhen Kharuzhy)

    - Fix up the enumeration of device objects with the PRP0001 device ID
    (Andy Shevchenko)

    - Clean up PPTT parsing error messages (John Garry)

    - Clean up debugfs files creation handling (Greg Kroah-Hartman,
    Rafael Wysocki)

    - Clean up the ACPI DPTF Makefile (Masahiro Yamada)"

    * tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
    ACPI / bus: Respect PRP0001 when retrieving device match data
    ACPICA: Update version to 20190215
    ACPI/ACPICA: Trivial: fix spelling mistakes and fix whitespace formatting
    ACPICA: ACPI 6.3: add GTDT Revision 3 support
    ACPICA: ACPI 6.3: HMAT updates
    ACPICA: ACPI 6.3: PPTT add additional fields in Processor Structure Flags
    ACPICA: ACPI 6.3: add Error Disconnect Recover Notification value
    ACPICA: ACPI 6.3: MADT: add support for statistical profiling in GICC
    ACPICA: ACPI 6.3: add PCC operation region support for AML interpreter
    efi: cper: Fix possible out-of-bounds access
    ACPI: APEI: Fix possible out-of-bounds access to BERT region
    ACPICA: ACPI 6.3: SRAT: add Generic Affinity Structure subtable
    ACPICA: ACPI 6.3: Add Trigger order to PCC Identifier structure in PDTT
    ACPICA: ACPI 6.3: Adding predefined methods _NBS, _NCH, _NIC, _NIH, and _NIG
    ACPICA: Update/clarify messages for control method failures
    ACPICA: Debugger: Fix possible fault with the "test objects" command
    ACPICA: Interpreter: Emit warning for creation of a zero-length op region
    ACPICA: Remove legacy module-level code support
    ACPI / x86: Make PWM2 device always present at Lenovo Yoga Book
    ACPI / video: Extend chassis-type detection with a "Lunch Box" check
    ..

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main RCU related changes in this cycle were:

    - Additional cleanups after RCU flavor consolidation

    - Grace-period forward-progress cleanups and improvements

    - Documentation updates

    - Miscellaneous fixes

    - spin_is_locked() conversions to lockdep

    - SPDX changes to RCU source and header files

    - SRCU updates

    - Torture-test updates, including nolibc updates and moving nolibc to
    tools/include"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    locking/locktorture: Convert to SPDX license identifier
    linux/torture: Convert to SPDX license identifier
    torture: Convert to SPDX license identifier
    linux/srcu: Convert to SPDX license identifier
    linux/rcutree: Convert to SPDX license identifier
    linux/rcutiny: Convert to SPDX license identifier
    linux/rcu_sync: Convert to SPDX license identifier
    linux/rcu_segcblist: Convert to SPDX license identifier
    linux/rcupdate: Convert to SPDX license identifier
    linux/rcu_node_tree: Convert to SPDX license identifier
    rcu/update: Convert to SPDX license identifier
    rcu/tree: Convert to SPDX license identifier
    rcu/tiny: Convert to SPDX license identifier
    rcu/sync: Convert to SPDX license identifier
    rcu/srcu: Convert to SPDX license identifier
    rcu/rcutorture: Convert to SPDX license identifier
    rcu/rcu_segcblist: Convert to SPDX license identifier
    rcu/rcuperf: Convert to SPDX license identifier
    rcu/rcu.h: Convert to SPDX license identifier
    RCU/torture.txt: Remove section MODULE PARAMETERS
    ...

    Linus Torvalds
     

04 Mar, 2019

1 commit

  • * acpi-apei: (29 commits)
    efi: cper: Fix possible out-of-bounds access
    ACPI: APEI: Fix possible out-of-bounds access to BERT region
    MAINTAINERS: Add James Morse to the list of APEI reviewers
    ACPI / APEI: Add support for the SDEI GHES Notification type
    firmware: arm_sdei: Add ACPI GHES registration helper
    ACPI / APEI: Use separate fixmap pages for arm64 NMI-like notifications
    ACPI / APEI: Only use queued estatus entry during in_nmi_queue_one_entry()
    ACPI / APEI: Split ghes_read_estatus() to allow a peek at the CPER length
    ACPI / APEI: Make GHES estatus header validation more user friendly
    ACPI / APEI: Pass ghes and estatus separately to avoid a later copy
    ACPI / APEI: Let the notification helper specify the fixmap slot
    ACPI / APEI: Move locking to the notification helper
    arm64: KVM/mm: Move SEA handling behind a single 'claim' interface
    KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing
    ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue
    ACPI / APEI: Move NOTIFY_SEA between the estatus-queue and NOTIFY_NMI
    ACPI / APEI: Don't allow ghes_ack_error() to mask earlier errors
    ACPI / APEI: Generalise the estatus queue's notify code
    ACPI / APEI: Don't update struct ghes' flags in read/clear estatus
    ACPI / APEI: Remove spurious GHES_TO_CLEAR check
    ...

    Rafael J. Wysocki
     

01 Mar, 2019

1 commit

  • debugfs can now report an error code if something went wrong instead of
    just NULL. So if the return value is to be used as a "real" dentry, it
    needs to be checked if it is an error before dereferencing it.

    This is now happening because of ff9fb72bc077 ("debugfs: return error
    values, not NULL"). syzbot has found a way to trigger multiple debugfs
    files attempting to be created, which fails, and then the error code
    gets passed to dentry_path_raw() which obviously does not like it.

    Reported-by: Eric Biggers
    Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
    Cc: "Radim Krčmář"
    Cc: kvm@vger.kernel.org
    Acked-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

23 Feb, 2019

2 commits


22 Feb, 2019

1 commit

  • The 'timer' local variable became unused after commit bee038a67487
    ("KVM: arm/arm64: Rework the timer code to use a timer_map").
    Remove it to avoid [-Wunused-but-set-variable] warning.

    Cc: Christoffer Dall
    Cc: James Morse
    Cc: Suzuki K Pouloze
    Reviewed-by: Julien Thierry
    Signed-off-by: Shaokun Zhang
    Signed-off-by: Marc Zyngier

    Shaokun Zhang
     

21 Feb, 2019

10 commits

  • The value of "dirty_bitmap[i]" is already check before setting its value
    to mask. The following check of "mask" is redundant. The check of "mask" was
    introduced by commit 58d2930f4ee3 ("KVM: Eliminate extra function calls in
    kvm_get_dirty_log_protect()"), revert it.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • grow_halt_poll_ns() have a strange behaviour in case
    (vcpu->halt_poll_ns != 0) &&
    (vcpu->halt_poll_ns < halt_poll_ns_grow_start).

    In this case, vcpu->halt_poll_ns will be multiplied by grow factor
    (halt_poll_ns_grow) which will require several grow iteration in order
    to reach a value bigger than halt_poll_ns_grow_start.
    This means that growing vcpu->halt_poll_ns from value of 0 is slower
    than growing it from a positive value less than halt_poll_ns_grow_start.
    Which is misleading and inaccurate.

    Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
    to halt_poll_ns_grow_start in any case that
    (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
    Regardless if vcpu->halt_poll_ns is 0.

    use READ_ONCE to get a consistent number for all cases.

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
    start value when raising up vcpu->halt_poll_ns.
    It actually sets the first timeout to the first polling session.
    This value has significant effect on how tolerant we are to outliers.
    On the standard case, higher value is better - we will spend more time
    in the polling busyloop, handle events/interrupts faster and result
    in better performance.
    But on outliers it puts us in a busy loop that does nothing.
    Even if the shrink factor is zero, we will still waste time on the first
    iteration.
    The optimal value changes between different workloads. It depends on
    outliers rate and polling sessions length.
    As this value has significant effect on the dynamic halt-polling
    algorithm, it should be configurable and exposed.

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • grow_halt_poll_ns() have a strange behavior in case
    (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

    In this case, vcpu->halt_pol_ns will be set to zero.
    That results in shrinking instead of growing.

    Fix issue by changing grow_halt_poll_ns() to not modify
    vcpu->halt_poll_ns in case halt_poll_ns_grow is zero

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Suggested-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • ...now that KVM won't explode by moving it out of bit 0. Using bit 63
    eliminates the need to jump over bit 0, e.g. when calculating a new
    memslots generation or when propagating the memslots generation to an
    MMIO spte.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • x86 captures a subset of the memslot generation (19 bits) in its MMIO
    sptes so that it can expedite emulated MMIO handling by checking only
    the releveant spte, i.e. doesn't need to do a full page fault walk.

    Because the MMIO sptes capture only 19 bits (due to limited space in
    the sptes), there is a non-zero probability that the MMIO generation
    could wrap, e.g. after 500k memslot updates. Since normal usage is
    extremely unlikely to result in 500k memslot updates, a hack was added
    by commit 69c9ea93eaea ("KVM: MMU: init kvm generation close to mmio
    wrap-around value") to offset the MMIO generation in order to trigger
    a wraparound, e.g. after 150 memslot updates.

    When separate memslot generation sequences were assigned to each
    address space, commit 00f034a12fdd ("KVM: do not bias the generation
    number in kvm_current_mmio_generation") moved the offset logic into the
    initialization of the memslot generation itself so that the per-address
    space bit(s) were not dropped/corrupted by the MMIO shenanigans.

    Remove the offset hack for three reasons:

    - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
    wrapping the generation doesn't actually test the interesting case
    of having stale MMIO sptes with the new generation number, e.g. old
    sptes with a generation number of 0.

    - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
    performance rather important since the probability of invalidating
    MMIO sptes jumps from "effectively never" to "fairly likely". This
    limits what can be done in future patches, e.g. to simplify the
    invalidation code, as doing so without proper caution could lead to
    a noticeable performance regression.

    - Forcing the memslots generation, which is a 64-bit number, to wrap
    prevents KVM from assuming the memslots generation will never wrap.
    This in turn prevents KVM from using an arbitrary bit for the
    "update in-progress" flag, e.g. using bit 63 would immediately
    collide with using a large value as the starting generation number.
    The "update in-progress" flag is effectively forced into bit 0 so
    that it's (subtly) taken into account when incrementing the
    generation.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • KVM uses bit 0 of the memslots generation as an "update in-progress"
    flag, which is used by x86 to prevent caching MMIO access while the
    memslots are changing. Although the intended behavior is flag-like,
    e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
    caching data from in-flux memslots, the implementation oftentimes treats
    the bit as part of the generation number itself, e.g. incrementing the
    generation increments twice, once to set the flag and once to clear it.

    Prior to commit 4bd518f1598d ("KVM: use separate generations for
    each address space"), incorporating the "update in-progress" bit into
    the generation number largely made sense, e.g. "real" generations are
    even, "bogus" generations are odd, most code doesn't need to be aware of
    the bit, etc...

    Now that unique memslots generation numbers are assigned to each address
    space, stealthing the in-progress status into the generation number
    results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
    over bit 0 when initializing the memslots generation without any hint as
    to why.

    Explicitly define the flag and convert as much code as possible (which
    isn't much) to actually treat it like a flag. This paves the way for
    eventually using a different bit for "update in-progress" so that it can
    be a flag in truth instead of a awkward extension to the generation
    number.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • kvm_arch_memslots_updated() is at this point in time an x86-specific
    hook for handling MMIO generation wraparound. x86 stashes 19 bits of
    the memslots generation number in its MMIO sptes in order to avoid
    full page fault walks for repeat faults on emulated MMIO addresses.
    Because only 19 bits are used, wrapping the MMIO generation number is
    possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
    the generation has changed so that it can invalidate all MMIO sptes in
    case the effective MMIO generation has wrapped so as to avoid using a
    stale spte, e.g. a (very) old spte that was created with generation==0.

    Given that the purpose of kvm_arch_memslots_updated() is to prevent
    consuming stale entries, it needs to be called before the new generation
    is propagated to memslots. Invalidating the MMIO sptes after updating
    memslots means that there is a window where a vCPU could dereference
    the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
    spte that was created with (pre-wrap) generation==0.

    Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • There are many KVM kernel memory allocations which are tied to the life of
    the VM process and should be charged to the VM process's cgroup. If the
    allocations aren't tied to the process, the OOM killer will not know
    that killing the process will free the associated kernel memory.
    Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
    charged to the VM process's cgroup.

    Tested:
    Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
    introduced no new failures.
    Ran a kernel memory accounting test which creates a VM to touch
    memory and then checks that the kernel memory allocated for the
    process is within certain bounds.
    With this patch we account for much more of the vmalloc and slab memory
    allocated for the VM.

    There remain a few allocations which should be charged to the VM's
    cgroup but are not. In they include:
    vcpu->run
    kvm->coalesced_mmio_ring
    There allocations are unaccounted in this patch because they are mapped
    to userspace, and accounting them to a cgroup causes problems. This
    should be addressed in a future patch.

    Signed-off-by: Ben Gardon
    Reviewed-by: Shakeel Butt
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Paolo Bonzini

    Gustavo A. R. Silva
     

20 Feb, 2019

14 commits

  • The 'gpa_end' local variable is never used and let's remove it.

    Cc: Christoffer Dall
    Signed-off-by: Shaokun Zhang
    Signed-off-by: Marc Zyngier

    Shaokun Zhang
     
  • There is a spelling mistake in a kvm_err error message. Fix it.

    Signed-off-by: Colin Ian King
    Signed-off-by: Marc Zyngier

    Colin Ian King
     
  • As the comment block in include/trace/define_trace.h says,
    TRACE_INCLUDE_PATH should be a relative path to the define_trace.h

    ../../virt/kvm/arm is the correct relative path.

    ../../../virt/kvm/arm is working by coincidence because the top
    Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
    search path, but we should not rely on it.

    Acked-by: Christoffer Dall
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Marc Zyngier

    Masahiro Yamada
     
  • When a guest gets scheduled, KVM performs a "load" operation,
    which for the timer includes evaluating the virtual "active" state
    of the interrupt, and replicating it on the physical side. This
    ensures that the deactivation in the guest will also take place
    in the physical GIC distributor.

    If the interrupt is not yet active, we flag it as inactive on the
    physical side. This means that on restoring the timer registers,
    if the timer has expired, we'll immediately take an interrupt.
    That's absolutely fine, as the interrupt will then be flagged as
    active on the physical side. What this assumes though is that we'll
    enter the guest right after having taken the interrupt, and that
    the guest will quickly ACK the interrupt, making it active at on
    the virtual side.

    It turns out that quite often, this assumption doesn't really hold.
    The guest may be preempted on the back on this interrupt, either
    from kernel space or whilst running at EL1 when a host interrupt
    fires. When this happens, we repeat the whole sequence on the
    next load (interrupt marked as inactive, timer registers restored,
    interrupt fires). And if it takes a really long time for a guest
    to activate the interrupt (as it does with nested virt), we end-up
    with many such events in quick succession, leading to the guest only
    making very slow progress.

    This can also be seen with the number of virtual timer interrupt on the
    host being far greater than the same number in the guest.

    An easy way to fix this is to evaluate the timer state when performing
    the "load" operation, just like we do when the interrupt actually fires.
    If the timer has a pending virtual interrupt at this stage, then we
    can safely flag the physical interrupt as being active, which prevents
    spurious exits.

    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Move this little function to the header files for arm/arm64 so other
    code can make use of it directly.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We are currently emulating two timers in two different ways. When we
    add support for nested virtualization in the future, we are going to be
    emulating either two timers in two diffferent ways, or four timers in a
    single way.

    We need a unified data structure to keep track of how we map virtual
    state to physical state and we need to cleanup some of the timer code to
    operate more independently on a struct arch_timer_context instead of
    trying to consider the global state of the VCPU and recomputing all
    state.

    Co-written with Marc Zyngier

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • VHE systems don't have to emulate the physical timer, we can simply
    assign the EL1 physical timer directly to the VM as the host always
    uses the EL2 timers.

    In order to minimize the amount of cruft, AArch32 gets definitions for
    the physical timer too, but is should be generally unused on this
    architecture.

    Co-written with Marc Zyngier

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Prepare for having 4 timer data structures (2 for now).

    Move loaded to the cpu data structure and not the individual timer
    structure, in preparation for assigning the EL1 phys timer as well.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • At the moment we have separate system register emulation handlers for
    each timer register. Actually they are quite similar, and we rely on
    kvm_arm_timer_[gs]et_reg() for the actual emulation anyways, so let's
    just merge all of those handlers into one function, which just marshalls
    the arguments and then hands off to a set of common accessors.
    This makes extending the emulation to include EL2 timers much easier.

    Signed-off-by: Andre Przywara
    [Fixed 32-bit VM breakage and reduced to reworking existing code]
    Signed-off-by: Christoffer Dall
    [Fixed 32bit host, general cleanup]
    Signed-off-by: Marc Zyngier

    Andre Przywara
     
  • We previously incorrectly named the define for this system register.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • Instead of calling into kvm_timer_[un]schedule from the main kvm
    blocking path, test if the VCPU is on the wait queue from the load/put
    path and perform the background timer setup/cancel in this path.

    This has the distinct advantage that we no longer race between load/put
    and schedule/unschedule and programming and canceling of the bg_timer
    always happens when the timer state is not loaded.

    Note that we must now remove the checks in kvm_timer_blocking that do
    not schedule a background timer if one of the timers can fire, because
    we no longer have a guarantee that kvm_vcpu_check_block() will be called
    before kvm_timer_blocking.

    Reported-by: Andre Przywara
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • In preparation for nested virtualization where we are going to have more
    than a single VMID per VM, let's factor out the VMID data into a
    separate VMID data structure and change the VMID allocator to operate on
    this new structure instead of using a struct kvm.

    This also means that udate_vttbr now becomes update_vmid, and that the
    vttbr itself is generated on the fly based on the stage 2 page table
    base address and the vmid.

    We cache the physical address of the pgd when allocating the pgd to
    avoid doing the calculation on every entry to the guest and to avoid
    calling into potentially non-hyp-mapped code from hyp/EL2.

    If we wanted to merge the VMID allocator with the arm64 ASID allocator
    at some point in the future, it should actually become easier to do that
    after this patch.

    Note that to avoid mapping the kvm_vmid_bits variable into hyp, we
    simply forego the masking of the vmid value in kvm_get_vttbr and rely on
    update_vmid to always assign a valid vmid value (within the supported
    range).

    Reviewed-by: Marc Zyngier
    [maz: minor cleanups]
    Reviewed-by: Julien Thierry
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We currently eagerly save/restore MPIDR. It turns out to be
    slightly pointless:
    - On the host, this value is known as soon as we're scheduled on a
    physical CPU
    - In the guest, this value cannot change, as it is set by KVM
    (and this is a read-only register)

    The result of the above is that we can perfectly avoid the eager
    saving of MPIDR_EL1, and only keep the restore. We just have
    to setup the host contexts appropriately at boot time.

    Signed-off-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • Until now, we haven't differentiated between HYP calls that
    have a return value and those who don't. As we're about to
    change this, introduce kvm_call_hyp_ret(), and change all
    call sites that actually make use of a return value.

    Signed-off-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

14 Feb, 2019

1 commit


13 Feb, 2019

1 commit

  • …/linux-rcu into core/rcu

    Pull the latest RCU tree from Paul E. McKenney:

    - Additional cleanups after RCU flavor consolidation
    - Grace-period forward-progress cleanups and improvements
    - Documentation updates
    - Miscellaneous fixes
    - spin_is_locked() conversions to lockdep
    - SPDX changes to RCU source and header files
    - SRCU updates
    - Torture-test updates, including nolibc updates and moving
    nolibc to tools/include

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

08 Feb, 2019

2 commits

  • To split up APEIs in_nmi() path, the caller needs to always be
    in_nmi(). KVM shouldn't have to know about this, pull the RAS plumbing
    out into a header file.

    Currently guest synchronous external aborts are claimed as RAS
    notifications by handle_guest_sea(), which is hidden in the arch codes
    mm/fault.c. 32bit gets a dummy declaration in system_misc.h.

    There is going to be more of this in the future if/when the kernel
    supports the SError-based firmware-first notification mechanism and/or
    kernel-first notifications for both synchronous external abort and
    SError. Each of these will come with some Kconfig symbols and a
    handful of header files.

    Create a header file for all this.

    This patch gives handle_guest_sea() a 'kvm_' prefix, and moves the
    declarations to kvm_ras.h as preparation for a future patch that moves
    the ACPI-specific RAS code out of mm/fault.c.

    Signed-off-by: James Morse
    Reviewed-by: Punit Agrawal
    Acked-by: Marc Zyngier
    Tested-by: Tyler Baicar
    Acked-by: Catalin Marinas
    Signed-off-by: Rafael J. Wysocki

    James Morse
     
  • kvm_ioctl_create_device() does the following:

    1. creates a device that holds a reference to the VM object (with a borrowed
    reference, the VM's refcount has not been bumped yet)
    2. initializes the device
    3. transfers the reference to the device to the caller's file descriptor table
    4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
    reference

    The ownership transfer in step 3 must not happen before the reference to the VM
    becomes a proper, non-borrowed reference, which only happens in step 4.
    After step 3, an attacker can close the file descriptor and drop the borrowed
    reference, which can cause the refcount of the kvm object to drop to zero.

    This means that we need to grab a reference for the device before
    anon_inode_getfd(), otherwise the VM can disappear from under us.

    Fixes: 852b6d57dc7f ("kvm: add device control API")
    Cc: stable@kernel.org
    Signed-off-by: Jann Horn
    Signed-off-by: Paolo Bonzini

    Jann Horn
     

07 Feb, 2019

3 commits

  • We restrict mapping the PUD huge pages in stage2 to only when the
    stage2 has 4 level page table, leaving the feature unused with
    the default IPA size. But we could use it even with a 3
    level page table, i.e, when the PUD level is folded into PGD,
    just like the stage1. Relax the condition to allow using the
    PUD huge page mappings at stage2 when it is possible.

    Cc: Christoffer Dall
    Reviewed-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     
  • We currently initialize the group of private IRQs during
    kvm_vgic_vcpu_init, and the value of the group depends on the GIC model
    we are emulating. However, CPUs created before creating (and
    initializing) the VGIC might end up with the wrong group if the VGIC
    is created as GICv3 later.

    Since we have no enforced ordering of creating the VGIC and creating
    VCPUs, we can end up with part the VCPUs being properly intialized and
    the remaining incorrectly initialized. That also means that we have no
    single place to do the per-cpu data structure initialization which
    depends on knowing the emulated GIC model (which is only the group
    field).

    This patch removes the incorrect comment from kvm_vgic_vcpu_init and
    initializes the group of all previously created VCPUs's private
    interrupts in vgic_init in addition to the existing initialization in
    kvm_vgic_vcpu_init.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The current kvm_psci_vcpu_on implementation will directly try to
    manipulate the state of the VCPU to reset it. However, since this is
    not done on the thread that runs the VCPU, we can end up in a strangely
    corrupted state when the source and target VCPUs are running at the same
    time.

    Fix this by factoring out all reset logic from the PSCI implementation
    and forwarding the required information along with a request to the
    target VCPU.

    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

26 Jan, 2019

1 commit

  • lockdep_assert_held() is better suited to checking locking requirements,
    since it only checks if the current thread holds the lock regardless of
    whether someone else does. This is also a step towards possibly removing
    spin_is_locked().

    Signed-off-by: Paul E. McKenney
    Cc: Paolo Bonzini
    Cc: "Radim Krčmář"
    Cc:
    Acked-by: Paolo Bonzini

    Paul E. McKenney