Eric Lee / smarc-fsl-linux-kernel

16 Mar, 2019

1 commit

636deed6c Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"ARM:
- some cleanups
- direct physical timer assignment
- cache sanitization for 32-bit guests

s390:
- interrupt cleanup
- introduction of the Guest Information Block
- preparation for processor subfunctions in cpu models

PPC:
- bug fixes and improvements, especially related to machine checks
and protection keys

x86:
- many, many cleanups, including removing a bunch of MMU code for
unnecessary optimizations
- AVIC fixes

Generic:
- memcg accounting"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
kvm: vmx: fix formatting of a comment
KVM: doc: Document the life cycle of a VM and its resources
MAINTAINERS: Add KVM selftests to existing KVM entry
Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
KVM: PPC: Fix compilation when KVM is not enabled
KVM: Minor cleanups for kvm_main.c
KVM: s390: add debug logging for cpu model subfunctions
KVM: s390: implement subfunction processor calls
arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
KVM: arm/arm64: Remove unused timer variable
KVM: PPC: Book3S: Improve KVM reference counting
KVM: PPC: Book3S HV: Fix build failure without IOMMU support
Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
x86: kvmguest: use TSC clocksource if invariant TSC is exposed
KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
...

Linus Torvalds
2019-03-16 06:00:28 +0800

07 Mar, 2019

1 commit

d276709ce Merge tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull ACPI updates from Rafael Wysocki:
"These are ACPICA updates including ACPI 6.3 support among other
things, APEI updates including the ARM Software Delegated Exception
Interface (SDEI) support, ACPI EC driver fixes and cleanups and other
assorted improvements.

Specifics:

- Update the ACPICA code in the kernel to upstream revision 20190215
including ACPI 6.3 support and more:
* New predefined methods: _NBS, _NCH, _NIC, _NIH, and _NIG (Erik
Schmauss).
* Update of the PCC Identifier structure in PDTT (Erik Schmauss).
* Support for new Generic Affinity Structure subtable in SRAT
(Erik Schmauss).
* New PCC operation region support (Erik Schmauss).
* Support for GICC statistical profiling for MADT (Erik Schmauss).
* New Error Disconnect Recover notification support (Erik
Schmauss).
* New PPTT Processor Structure Flags fields support (Erik
Schmauss).
* ACPI 6.3 HMAT updates (Erik Schmauss).
* GTDT Revision 3 support (Erik Schmauss).
* Legacy module-level code (MLC) support removal (Erik Schmauss).
* Update/clarification of messages for control method failures
(Bob Moore).
* Warning on creation of a zero-length opregion (Bob Moore).
* acpiexec option to dump extra info for memory leaks (Bob Moore).
* More ACPI error to firmware error conversions (Bob Moore).
* Debugger fix (Bob Moore).
* Copyrights update (Bob Moore)

- Clean up sleep states support code in ACPICA (Christoph Hellwig)

- Rework in_nmi() handling in the APEI code and add suppor for the
ARM Software Delegated Exception Interface (SDEI) to it (James
Morse)

- Fix possible out-of-bounds accesses in BERT-related core (Ross
Lagerwall)

- Fix the APEI code parsing HEST that includes a Deferred Machine
Check subtable (Yazen Ghannam)

- Use DEFINE_DEBUGFS_ATTRIBUTE for APEI-related debugfs files
(YueHaibing)

- Switch the APEI ERST code to the new generic UUID API (Andy
Shevchenko)

- Update the MAINTAINERS entry for APEI (Borislav Petkov)

- Fix and clean up the ACPI EC driver (Rafael Wysocki, Zhang Rui)

- Fix DMI checks handling in the ACPI backlight driver and add the
"Lunch Box" chassis-type check to it (Hans de Goede)

- Add support for using ACPI table overrides included in built-in
initrd images (Shunyong Yang)

- Update ACPI device enumeration to treat the PWM2 device as "always
present" on Lenovo Yoga Book (Yauhen Kharuzhy)

- Fix up the enumeration of device objects with the PRP0001 device ID
(Andy Shevchenko)

- Clean up PPTT parsing error messages (John Garry)

- Clean up debugfs files creation handling (Greg Kroah-Hartman,
Rafael Wysocki)

- Clean up the ACPI DPTF Makefile (Masahiro Yamada)"

* tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
ACPI / bus: Respect PRP0001 when retrieving device match data
ACPICA: Update version to 20190215
ACPI/ACPICA: Trivial: fix spelling mistakes and fix whitespace formatting
ACPICA: ACPI 6.3: add GTDT Revision 3 support
ACPICA: ACPI 6.3: HMAT updates
ACPICA: ACPI 6.3: PPTT add additional fields in Processor Structure Flags
ACPICA: ACPI 6.3: add Error Disconnect Recover Notification value
ACPICA: ACPI 6.3: MADT: add support for statistical profiling in GICC
ACPICA: ACPI 6.3: add PCC operation region support for AML interpreter
efi: cper: Fix possible out-of-bounds access
ACPI: APEI: Fix possible out-of-bounds access to BERT region
ACPICA: ACPI 6.3: SRAT: add Generic Affinity Structure subtable
ACPICA: ACPI 6.3: Add Trigger order to PCC Identifier structure in PDTT
ACPICA: ACPI 6.3: Adding predefined methods _NBS, _NCH, _NIC, _NIH, and _NIG
ACPICA: Update/clarify messages for control method failures
ACPICA: Debugger: Fix possible fault with the "test objects" command
ACPICA: Interpreter: Emit warning for creation of a zero-length op region
ACPICA: Remove legacy module-level code support
ACPI / x86: Make PWM2 device always present at Lenovo Yoga Book
ACPI / video: Extend chassis-type detection with a "Lunch Box" check
..

Linus Torvalds
2019-03-07 05:33:11 +0800

06 Mar, 2019

1 commit

3717f613f Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main RCU related changes in this cycle were:

- Additional cleanups after RCU flavor consolidation

- Grace-period forward-progress cleanups and improvements

- Documentation updates

- Miscellaneous fixes

- spin_is_locked() conversions to lockdep

- SPDX changes to RCU source and header files

- SRCU updates

- Torture-test updates, including nolibc updates and moving nolibc to
tools/include"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
locking/locktorture: Convert to SPDX license identifier
linux/torture: Convert to SPDX license identifier
torture: Convert to SPDX license identifier
linux/srcu: Convert to SPDX license identifier
linux/rcutree: Convert to SPDX license identifier
linux/rcutiny: Convert to SPDX license identifier
linux/rcu_sync: Convert to SPDX license identifier
linux/rcu_segcblist: Convert to SPDX license identifier
linux/rcupdate: Convert to SPDX license identifier
linux/rcu_node_tree: Convert to SPDX license identifier
rcu/update: Convert to SPDX license identifier
rcu/tree: Convert to SPDX license identifier
rcu/tiny: Convert to SPDX license identifier
rcu/sync: Convert to SPDX license identifier
rcu/srcu: Convert to SPDX license identifier
rcu/rcutorture: Convert to SPDX license identifier
rcu/rcu_segcblist: Convert to SPDX license identifier
rcu/rcuperf: Convert to SPDX license identifier
rcu/rcu.h: Convert to SPDX license identifier
RCU/torture.txt: Remove section MODULE PARAMETERS
...

Linus Torvalds
2019-03-06 06:49:11 +0800

04 Mar, 2019

1 commit

dcaed592b Merge branch 'acpi-apei' ... Browse Code »

* acpi-apei: (29 commits)
efi: cper: Fix possible out-of-bounds access
ACPI: APEI: Fix possible out-of-bounds access to BERT region
MAINTAINERS: Add James Morse to the list of APEI reviewers
ACPI / APEI: Add support for the SDEI GHES Notification type
firmware: arm_sdei: Add ACPI GHES registration helper
ACPI / APEI: Use separate fixmap pages for arm64 NMI-like notifications
ACPI / APEI: Only use queued estatus entry during in_nmi_queue_one_entry()
ACPI / APEI: Split ghes_read_estatus() to allow a peek at the CPER length
ACPI / APEI: Make GHES estatus header validation more user friendly
ACPI / APEI: Pass ghes and estatus separately to avoid a later copy
ACPI / APEI: Let the notification helper specify the fixmap slot
ACPI / APEI: Move locking to the notification helper
arm64: KVM/mm: Move SEA handling behind a single 'claim' interface
KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing
ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue
ACPI / APEI: Move NOTIFY_SEA between the estatus-queue and NOTIFY_NMI
ACPI / APEI: Don't allow ghes_ack_error() to mask earlier errors
ACPI / APEI: Generalise the estatus queue's notify code
ACPI / APEI: Don't update struct ghes' flags in read/clear estatus
ACPI / APEI: Remove spurious GHES_TO_CLEAR check
...

Rafael J. Wysocki
2019-03-04 18:16:35 +0800

01 Mar, 2019

1 commit

8ed0579c1 kvm: properly check debugfs dentry before using it ... Browse Code »

debugfs can now report an error code if something went wrong instead of
just NULL. So if the return value is to be used as a "real" dentry, it
needs to be checked if it is an error before dereferencing it.

This is now happening because of ff9fb72bc077 ("debugfs: return error
values, not NULL"). syzbot has found a way to trigger multiple debugfs
files attempting to be created, which fails, and then the error code
gets passed to dentry_path_raw() which obviously does not like it.

Reported-by: Eric Biggers
Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
Cc: "Radim Krčmář"
Cc: kvm@vger.kernel.org
Acked-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman
Signed-off-by: Linus Torvalds

Greg Kroah-Hartman
2019-03-01 00:57:32 +0800

23 Feb, 2019

2 commits

71783e09b Merge tag 'kvmarm-for-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvm… ... Browse Code »

…arm/kvmarm into kvm-next

KVM/arm updates for Linux v5.1

- A number of pre-nested code rework
- Direct physical timer assignment on VHE systems
- kvm_call_hyp type safety enforcement
- Set/Way cache sanitisation for 32bit guests
- Build system cleanups
- A bunch of janitorial fixes

Paolo Bonzini
2019-02-23 00:45:05 +0800
a24201077 KVM: Minor cleanups for kvm_main.c ... Browse Code »

This patch contains two minor cleanups: firstly it puts exported symbol
for kvm_io_bus_write() by following the function definition; secondly it
removes a redundant blank line.

Signed-off-by: Leo Yan
Signed-off-by: Paolo Bonzini

Leo Yan
2019-02-23 00:43:57 +0800

22 Feb, 2019

1 commit

7f5d9c1bc KVM: arm/arm64: Remove unused timer variable ... Browse Code »

The 'timer' local variable became unused after commit bee038a67487
("KVM: arm/arm64: Rework the timer code to use a timer_map").
Remove it to avoid [-Wunused-but-set-variable] warning.

Cc: Christoffer Dall
Cc: James Morse
Cc: Suzuki K Pouloze
Reviewed-by: Julien Thierry
Signed-off-by: Shaokun Zhang
Signed-off-by: Marc Zyngier

Shaokun Zhang
2019-02-22 17:41:52 +0800

21 Feb, 2019

10 commits

a67794caf Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" ... Browse Code »

The value of "dirty_bitmap[i]" is already check before setting its value
to mask. The following check of "mask" is redundant. The check of "mask" was
introduced by commit 58d2930f4ee3 ("KVM: Eliminate extra function calls in
kvm_get_dirty_log_protect()"), revert it.

Signed-off-by: Lan Tianyu
Signed-off-by: Paolo Bonzini

Lan Tianyu
2019-02-21 05:48:52 +0800
dee339b5c KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start ... Browse Code »

grow_halt_poll_ns() have a strange behaviour in case
(vcpu->halt_poll_ns != 0) &&
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).

In this case, vcpu->halt_poll_ns will be multiplied by grow factor
(halt_poll_ns_grow) which will require several grow iteration in order
to reach a value bigger than halt_poll_ns_grow_start.
This means that growing vcpu->halt_poll_ns from value of 0 is slower
than growing it from a positive value less than halt_poll_ns_grow_start.
Which is misleading and inaccurate.

Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
to halt_poll_ns_grow_start in any case that
(vcpu->halt_poll_ns < halt_poll_ns_grow_start).
Regardless if vcpu->halt_poll_ns is 0.

use READ_ONCE to get a consistent number for all cases.

Reviewed-by: Boris Ostrovsky
Reviewed-by: Liran Alon
Signed-off-by: Nir Weiner
Signed-off-by: Paolo Bonzini

Nir Weiner
2019-02-21 05:48:51 +0800
49113d360 KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter ... Browse Code »

The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
start value when raising up vcpu->halt_poll_ns.
It actually sets the first timeout to the first polling session.
This value has significant effect on how tolerant we are to outliers.
On the standard case, higher value is better - we will spend more time
in the polling busyloop, handle events/interrupts faster and result
in better performance.
But on outliers it puts us in a busy loop that does nothing.
Even if the shrink factor is zero, we will still waste time on the first
iteration.
The optimal value changes between different workloads. It depends on
outliers rate and polling sessions length.
As this value has significant effect on the dynamic halt-polling
algorithm, it should be configurable and exposed.

Reviewed-by: Boris Ostrovsky
Reviewed-by: Liran Alon
Signed-off-by: Nir Weiner
Signed-off-by: Paolo Bonzini

Nir Weiner
2019-02-21 05:48:50 +0800
7fa08e71b KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns ... Browse Code »

grow_halt_poll_ns() have a strange behavior in case
(halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

In this case, vcpu->halt_pol_ns will be set to zero.
That results in shrinking instead of growing.

Fix issue by changing grow_halt_poll_ns() to not modify
vcpu->halt_poll_ns in case halt_poll_ns_grow is zero

Reviewed-by: Boris Ostrovsky
Reviewed-by: Liran Alon
Signed-off-by: Nir Weiner
Suggested-by: Liran Alon
Signed-off-by: Paolo Bonzini

Nir Weiner
2019-02-21 05:48:50 +0800
164bf7e56 KVM: Move the memslot update in-progress flag to bit 63 ... Browse Code »

...now that KVM won't explode by moving it out of bit 0. Using bit 63
eliminates the need to jump over bit 0, e.g. when calculating a new
memslots generation or when propagating the memslots generation to an
MMIO spte.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-02-21 05:48:37 +0800
0e32958ec KVM: Remove the hack to trigger memslot generation wraparound ... Browse Code »

x86 captures a subset of the memslot generation (19 bits) in its MMIO
sptes so that it can expedite emulated MMIO handling by checking only
the releveant spte, i.e. doesn't need to do a full page fault walk.

Because the MMIO sptes capture only 19 bits (due to limited space in
the sptes), there is a non-zero probability that the MMIO generation
could wrap, e.g. after 500k memslot updates. Since normal usage is
extremely unlikely to result in 500k memslot updates, a hack was added
by commit 69c9ea93eaea ("KVM: MMU: init kvm generation close to mmio
wrap-around value") to offset the MMIO generation in order to trigger
a wraparound, e.g. after 150 memslot updates.

When separate memslot generation sequences were assigned to each
address space, commit 00f034a12fdd ("KVM: do not bias the generation
number in kvm_current_mmio_generation") moved the offset logic into the
initialization of the memslot generation itself so that the per-address
space bit(s) were not dropped/corrupted by the MMIO shenanigans.

Remove the offset hack for three reasons:

- While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
wrapping the generation doesn't actually test the interesting case
of having stale MMIO sptes with the new generation number, e.g. old
sptes with a generation number of 0.

- Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
performance rather important since the probability of invalidating
MMIO sptes jumps from "effectively never" to "fairly likely". This
limits what can be done in future patches, e.g. to simplify the
invalidation code, as doing so without proper caution could lead to
a noticeable performance regression.

- Forcing the memslots generation, which is a 64-bit number, to wrap
prevents KVM from assuming the memslots generation will never wrap.
This in turn prevents KVM from using an arbitrary bit for the
"update in-progress" flag, e.g. using bit 63 would immediately
collide with using a large value as the starting generation number.
The "update in-progress" flag is effectively forced into bit 0 so
that it's (subtly) taken into account when incrementing the
generation.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-02-21 05:48:36 +0800
361209e05 KVM: Explicitly define the "memslot update in-progress" bit ... Browse Code »

KVM uses bit 0 of the memslots generation as an "update in-progress"
flag, which is used by x86 to prevent caching MMIO access while the
memslots are changing. Although the intended behavior is flag-like,
e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
caching data from in-flux memslots, the implementation oftentimes treats
the bit as part of the generation number itself, e.g. incrementing the
generation increments twice, once to set the flag and once to clear it.

Prior to commit 4bd518f1598d ("KVM: use separate generations for
each address space"), incorporating the "update in-progress" bit into
the generation number largely made sense, e.g. "real" generations are
even, "bogus" generations are odd, most code doesn't need to be aware of
the bit, etc...

Now that unique memslots generation numbers are assigned to each address
space, stealthing the in-progress status into the generation number
results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
over bit 0 when initializing the memslots generation without any hint as
to why.

Explicitly define the flag and convert as much code as possible (which
isn't much) to actually treat it like a flag. This paves the way for
eventually using a different bit for "update in-progress" so that it can
be a flag in truth instead of a awkward extension to the generation
number.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-02-21 05:48:34 +0800
152482580 KVM: Call kvm_arch_memslots_updated() before updating memslots ... Browse Code »

kvm_arch_memslots_updated() is at this point in time an x86-specific
hook for handling MMIO generation wraparound. x86 stashes 19 bits of
the memslots generation number in its MMIO sptes in order to avoid
full page fault walks for repeat faults on emulated MMIO addresses.
Because only 19 bits are used, wrapping the MMIO generation number is
possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
the generation has changed so that it can invalidate all MMIO sptes in
case the effective MMIO generation has wrapped so as to avoid using a
stale spte, e.g. a (very) old spte that was created with generation==0.

Given that the purpose of kvm_arch_memslots_updated() is to prevent
consuming stale entries, it needs to be called before the new generation
is propagated to memslots. Invalidating the MMIO sptes after updating
memslots means that there is a window where a vCPU could dereference
the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
spte that was created with (pre-wrap) generation==0.

Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
Cc:
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-02-21 05:48:32 +0800
b12ce36a4 kvm: Add memcg accounting to KVM allocations ... Browse Code »

There are many KVM kernel memory allocations which are tied to the life of
the VM process and should be charged to the VM process's cgroup. If the
allocations aren't tied to the process, the OOM killer will not know
that killing the process will free the associated kernel memory.
Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
charged to the VM process's cgroup.

Tested:
Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
introduced no new failures.
Ran a kernel memory accounting test which creates a VM to touch
memory and then checks that the kernel memory allocated for the
process is within certain bounds.
With this patch we account for much more of the vmalloc and slab memory
allocated for the VM.

There remain a few allocations which should be charged to the VM's
cgroup but are not. In they include:
vcpu->run
kvm->coalesced_mmio_ring
There allocations are unaccounted in this patch because they are mapped
to userspace, and accounting them to a cgroup causes problems. This
should be addressed in a future patch.

Signed-off-by: Ben Gardon
Reviewed-by: Shakeel Butt
Signed-off-by: Paolo Bonzini

Ben Gardon
2019-02-21 05:48:29 +0800
90952cd38 kvm: Use struct_size() in kmalloc() ... Browse Code »

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
int stuff;
void *entry[];
};

instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Paolo Bonzini

Gustavo A. R. Silva
2019-02-21 05:48:20 +0800

20 Feb, 2019

14 commits

c2be79a0b KVM: arm/arm64: Remove unused gpa_end variable ... Browse Code »

The 'gpa_end' local variable is never used and let's remove it.

Cc: Christoffer Dall
Signed-off-by: Shaokun Zhang
Signed-off-by: Marc Zyngier

Shaokun Zhang
2019-02-20 19:00:27 +0800
a37f0c3c4 KVM: arm/arm64: fix spelling mistake: "auxilary" -> "auxiliary" ... Browse Code »

There is a spelling mistake in a kvm_err error message. Fix it.

Signed-off-by: Colin Ian King
Signed-off-by: Marc Zyngier

Colin Ian King
2019-02-20 05:05:54 +0800
49dfe94fe KVM: arm/arm64: Fix TRACE_INCLUDE_PATH ... Browse Code »

As the comment block in include/trace/define_trace.h says,
TRACE_INCLUDE_PATH should be a relative path to the define_trace.h

../../virt/kvm/arm is the correct relative path.

../../../virt/kvm/arm is working by coincidence because the top
Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header
search path, but we should not rely on it.

Acked-by: Christoffer Dall
Signed-off-by: Masahiro Yamada
Signed-off-by: Marc Zyngier

Masahiro Yamada
2019-02-20 05:05:51 +0800
bae561c0c KVM: arm/arm64: arch_timer: Mark physical interrupt active when a virtual interrupt is pending ... Browse Code »

When a guest gets scheduled, KVM performs a "load" operation,
which for the timer includes evaluating the virtual "active" state
of the interrupt, and replicating it on the physical side. This
ensures that the deactivation in the guest will also take place
in the physical GIC distributor.

If the interrupt is not yet active, we flag it as inactive on the
physical side. This means that on restoring the timer registers,
if the timer has expired, we'll immediately take an interrupt.
That's absolutely fine, as the interrupt will then be flagged as
active on the physical side. What this assumes though is that we'll
enter the guest right after having taken the interrupt, and that
the guest will quickly ACK the interrupt, making it active at on
the virtual side.

It turns out that quite often, this assumption doesn't really hold.
The guest may be preempted on the back on this interrupt, either
from kernel space or whilst running at EL1 when a host interrupt
fires. When this happens, we repeat the whole sequence on the
next load (interrupt marked as inactive, timer registers restored,
interrupt fires). And if it takes a really long time for a guest
to activate the interrupt (as it does with nested virt), we end-up
with many such events in quick succession, leading to the guest only
making very slow progress.

This can also be seen with the number of virtual timer interrupt on the
host being far greater than the same number in the guest.

An easy way to fix this is to evaluate the timer state when performing
the "load" operation, just like we do when the interrupt actually fires.
If the timer has a pending virtual interrupt at this stage, then we
can safely flag the physical interrupt as being active, which prevents
spurious exits.

Signed-off-by: Marc Zyngier

Marc Zyngier
2019-02-20 05:05:50 +0800
64cf98fa5 KVM: arm/arm64: Move kvm_is_write_fault to header file ... Browse Code »

Move this little function to the header files for arm/arm64 so other
code can make use of it directly.

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2019-02-20 05:05:45 +0800
bee038a67 KVM: arm/arm64: Rework the timer code to use a timer_map ... Browse Code »

We are currently emulating two timers in two different ways. When we
add support for nested virtualization in the future, we are going to be
emulating either two timers in two diffferent ways, or four timers in a
single way.

We need a unified data structure to keep track of how we map virtual
state to physical state and we need to cleanup some of the timer code to
operate more independently on a struct arch_timer_context instead of
trying to consider the global state of the VCPU and recomputing all
state.

Co-written with Marc Zyngier

Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall

Christoffer Dall
2019-02-20 05:05:43 +0800
9e01dc76b KVM: arm/arm64: arch_timer: Assign the phys timer on VHE systems ... Browse Code »

VHE systems don't have to emulate the physical timer, we can simply
assign the EL1 physical timer directly to the VM as the host always
uses the EL2 timers.

In order to minimize the amount of cruft, AArch32 gets definitions for
the physical timer too, but is should be generally unused on this
architecture.

Co-written with Marc Zyngier

Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall

Christoffer Dall
2019-02-20 05:05:42 +0800
e604dd5d4 KVM: arm/arm64: timer: Rework data structures for multiple timers ... Browse Code »

Prepare for having 4 timer data structures (2 for now).

Move loaded to the cpu data structure and not the individual timer
structure, in preparation for assigning the EL1 phys timer as well.

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2019-02-20 05:05:41 +0800
84135d3d1 KVM: arm/arm64: consolidate arch timer trap handlers ... Browse Code »

At the moment we have separate system register emulation handlers for
each timer register. Actually they are quite similar, and we rely on
kvm_arm_timer_[gs]et_reg() for the actual emulation anyways, so let's
just merge all of those handlers into one function, which just marshalls
the arguments and then hands off to a set of common accessors.
This makes extending the emulation to include EL2 timers much easier.

Signed-off-by: Andre Przywara
[Fixed 32-bit VM breakage and reduced to reworking existing code]
Signed-off-by: Christoffer Dall
[Fixed 32bit host, general cleanup]
Signed-off-by: Marc Zyngier

Andre Przywara
2019-02-20 05:05:40 +0800
b98c079ba KVM: arm64: Fix ICH_ELRSR_EL2 sysreg naming ... Browse Code »

We previously incorrectly named the define for this system register.

Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall

Marc Zyngier
2019-02-20 05:05:39 +0800
accb99bcd KVM: arm/arm64: Simplify bg_timer programming ... Browse Code »

Instead of calling into kvm_timer_[un]schedule from the main kvm
blocking path, test if the VCPU is on the wait queue from the load/put
path and perform the background timer setup/cancel in this path.

This has the distinct advantage that we no longer race between load/put
and schedule/unschedule and programming and canceling of the bg_timer
always happens when the timer state is not loaded.

Note that we must now remove the checks in kvm_timer_blocking that do
not schedule a background timer if one of the timers can fire, because
we no longer have a guarantee that kvm_vcpu_check_block() will be called
before kvm_timer_blocking.

Reported-by: Andre Przywara
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2019-02-20 05:05:36 +0800
e329fb75d KVM: arm/arm64: Factor out VMID into struct kvm_vmid ... Browse Code »

In preparation for nested virtualization where we are going to have more
than a single VMID per VM, let's factor out the VMID data into a
separate VMID data structure and change the VMID allocator to operate on
this new structure instead of using a struct kvm.

This also means that udate_vttbr now becomes update_vmid, and that the
vttbr itself is generated on the fly based on the stage 2 page table
base address and the vmid.

We cache the physical address of the pgd when allocating the pgd to
avoid doing the calculation on every entry to the guest and to avoid
calling into potentially non-hyp-mapped code from hyp/EL2.

If we wanted to merge the VMID allocator with the arm64 ASID allocator
at some point in the future, it should actually become easier to do that
after this patch.

Note that to avoid mapping the kvm_vmid_bits variable into hyp, we
simply forego the masking of the vmid value in kvm_get_vttbr and rely on
update_vmid to always assign a valid vmid value (within the supported
range).

Reviewed-by: Marc Zyngier
[maz: minor cleanups]
Reviewed-by: Julien Thierry
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2019-02-20 05:05:35 +0800
32f139551 arm/arm64: KVM: Statically configure the host's view of MPIDR ... Browse Code »

We currently eagerly save/restore MPIDR. It turns out to be
slightly pointless:
- On the host, this value is known as soon as we're scheduled on a
physical CPU
- In the guest, this value cannot change, as it is set by KVM
(and this is a read-only register)

The result of the above is that we can perfectly avoid the eager
saving of MPIDR_EL1, and only keep the restore. We just have
to setup the host contexts appropriately at boot time.

Signed-off-by: Marc Zyngier
Acked-by: Christoffer Dall
Signed-off-by: Christoffer Dall

Marc Zyngier
2019-02-20 05:05:35 +0800
7aa8d1464 arm/arm64: KVM: Introduce kvm_call_hyp_ret() ... Browse Code »

Until now, we haven't differentiated between HYP calls that
have a return value and those who don't. As we're about to
change this, introduce kvm_call_hyp_ret(), and change all
call sites that actually make use of a return value.

Signed-off-by: Marc Zyngier
Acked-by: Christoffer Dall
Signed-off-by: Christoffer Dall

Marc Zyngier
2019-02-20 05:05:24 +0800

14 Feb, 2019

1 commit

08e16754c Merge tag 'kvm-arm-fixes-for-5.0' of git://git.kernel.org/pub/scm/linux/kernel/g… ... Browse Code »

…it/kvmarm/kvmarm into kvm-master

KVM/ARM fixes for 5.0:

- Fix the way we reset vcpus, plugging the race that could happen on VHE
- Fix potentially inconsistent group setting for private interrupts
- Don't generate UNDEF when LORegion feature is present
- Relax the restriction on using stage2 PUD huge mapping
- Turn some spinlocks into raw_spinlocks to help RT compliance

Paolo Bonzini
2019-02-14 02:39:24 +0800

13 Feb, 2019

1 commit

cae45e1c6 Merge branch 'rcu-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck… ... Browse Code »

…/linux-rcu into core/rcu

Pull the latest RCU tree from Paul E. McKenney:

- Additional cleanups after RCU flavor consolidation
- Grace-period forward-progress cleanups and improvements
- Documentation updates
- Miscellaneous fixes
- spin_is_locked() conversions to lockdep
- SPDX changes to RCU source and header files
- SRCU updates
- Torture-test updates, including nolibc updates and moving
nolibc to tools/include

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2019-02-13 15:36:18 +0800

08 Feb, 2019

2 commits

0db5e0223 KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing ... Browse Code »

To split up APEIs in_nmi() path, the caller needs to always be
in_nmi(). KVM shouldn't have to know about this, pull the RAS plumbing
out into a header file.

Currently guest synchronous external aborts are claimed as RAS
notifications by handle_guest_sea(), which is hidden in the arch codes
mm/fault.c. 32bit gets a dummy declaration in system_misc.h.

There is going to be more of this in the future if/when the kernel
supports the SError-based firmware-first notification mechanism and/or
kernel-first notifications for both synchronous external abort and
SError. Each of these will come with some Kconfig symbols and a
handful of header files.

Create a header file for all this.

This patch gives handle_guest_sea() a 'kvm_' prefix, and moves the
declarations to kvm_ras.h as preparation for a future patch that moves
the ACPI-specific RAS code out of mm/fault.c.

Signed-off-by: James Morse
Reviewed-by: Punit Agrawal
Acked-by: Marc Zyngier
Tested-by: Tyler Baicar
Acked-by: Catalin Marinas
Signed-off-by: Rafael J. Wysocki

James Morse
2019-02-08 06:10:45 +0800
cfa393811 kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974) ... Browse Code »

kvm_ioctl_create_device() does the following:

1. creates a device that holds a reference to the VM object (with a borrowed
reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
reference

The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.

This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.

Fixes: 852b6d57dc7f ("kvm: add device control API")
Cc: stable@kernel.org
Signed-off-by: Jann Horn
Signed-off-by: Paolo Bonzini

Jann Horn
2019-02-08 02:02:38 +0800

07 Feb, 2019

3 commits

280cebfd0 KVM: arm64: Relax the restriction on using stage2 PUD huge mapping ... Browse Code »

We restrict mapping the PUD huge pages in stage2 to only when the
stage2 has 4 level page table, leaving the feature unused with
the default IPA size. But we could use it even with a 3
level page table, i.e, when the PUD level is folded into PGD,
just like the stage1. Relax the condition to allow using the
PUD huge page mappings at stage2 when it is possible.

Cc: Christoffer Dall
Reviewed-by: Marc Zyngier
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Suzuki K Poulose
2019-02-07 19:44:47 +0800
ab2d5eb03 KVM: arm/arm64: vgic: Always initialize the group of private IRQs ... Browse Code »

We currently initialize the group of private IRQs during
kvm_vgic_vcpu_init, and the value of the group depends on the GIC model
we are emulating. However, CPUs created before creating (and
initializing) the VGIC might end up with the wrong group if the VGIC
is created as GICv3 later.

Since we have no enforced ordering of creating the VGIC and creating
VCPUs, we can end up with part the VCPUs being properly intialized and
the remaining incorrectly initialized. That also means that we have no
single place to do the per-cpu data structure initialization which
depends on knowing the emulated GIC model (which is only the group
field).

This patch removes the incorrect comment from kvm_vgic_vcpu_init and
initializes the group of all previously created VCPUs's private
interrupts in vgic_init in addition to the existing initialization in
kvm_vgic_vcpu_init.

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2019-02-07 19:44:47 +0800
358b28f09 arm/arm64: KVM: Allow a VCPU to fully reset itself ... Browse Code »

The current kvm_psci_vcpu_on implementation will directly try to
manipulate the state of the VCPU to reset it. However, since this is
not done on the thread that runs the VCPU, we can end up in a strangely
corrupted state when the source and target VCPUs are running at the same
time.

Fix this by factoring out all reset logic from the PSCI implementation
and forwarding the required information along with a request to the
target VCPU.

Reviewed-by: Andrew Jones
Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall

Marc Zyngier
2019-02-07 19:44:13 +0800

26 Jan, 2019

1 commit

6706dae90 virt/kvm: Replace spin_is_locked() with lockdep ... Browse Code »

lockdep_assert_held() is better suited to checking locking requirements,
since it only checks if the current thread holds the lock regardless of
whether someone else does. This is also a step towards possibly removing
spin_is_locked().

Signed-off-by: Paul E. McKenney
Cc: Paolo Bonzini
Cc: "Radim Krčmář"
Cc:
Acked-by: Paolo Bonzini

Paul E. McKenney
2019-01-26 07:36:05 +0800