Eric Lee / smarc-fsl-linux-kernel

12 Jan, 2019

1 commit

98938aa8e KVM: validate userspace input in kvm_clear_dirty_log_protect() ... Browse Code »

The function at issue does not fully validate the content of the
structure pointed by the log parameter, though its content has just been
copied from userspace and lacks validation. Fix that.

Moreover, change the type of n to unsigned long as that is the type
returned by kvm_dirty_bitmap_bytes().

Signed-off-by: Tomas Bortoli
Reported-by: syzbot+028366e52c9ace67deb3@syzkaller.appspotmail.com
[Squashed the fix from Paolo. - Radim.]
Signed-off-by: Radim Krčmář

Tomas Bortoli
2019-01-12 01:38:07 +0800

06 Jan, 2019

1 commit

a65981109 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:

- procfs updates

- various misc bits

- lib/ updates

- epoll updates

- autofs

- fatfs

- a few more MM bits

* emailed patches from Andrew Morton : (58 commits)
mm/page_io.c: fix polled swap page in
checkpatch: add Co-developed-by to signature tags
docs: fix Co-Developed-by docs
drivers/base/platform.c: kmemleak ignore a known leak
fs: don't open code lru_to_page()
fs/: remove caller signal_pending branch predictions
mm/: remove caller signal_pending branch predictions
arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
kernel/sched/: remove caller signal_pending branch predictions
kernel/locking/mutex.c: remove caller signal_pending branch predictions
mm: select HAVE_MOVE_PMD on x86 for faster mremap
mm: speed up mremap by 20x on large regions
mm: treewide: remove unused address argument from pte_alloc functions
initramfs: cleanup incomplete rootfs
scripts/gdb: fix lx-version string output
kernel/kcov.c: mark write_comp_data() as notrace
kernel/sysctl: add panic_print into sysctl
panic: add options to print system info when panic happens
bfs: extra sanity checking and static inode bitmap
exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
...

Linus Torvalds
2019-01-06 01:16:18 +0800

05 Jan, 2019

1 commit

4cf589249 mm: treewide: remove unused address argument from pte_alloc functions ... Browse Code »

Patch series "Add support for fast mremap".

This series speeds up the mremap(2) syscall by copying page tables at
the PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something
subtle architecture related in the future that may make the scheme not
work. Also we find that there is no point in passing the 'address' to
pte_alloc since its unused. This patch therefore removes this argument
tree-wide resulting in a nice negative diff as well. Also ensuring
along the way that the enabled architectures do not do anything funky
with the 'address' argument that goes unnoticed by the optimization.

Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-, E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google)
Suggested-by: Kirill A. Shutemov
Acked-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Julia Lawall
Cc: Kirill A. Shutemov
Cc: William Kucharski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joel Fernandes (Google)
2019-01-05 05:13:47 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

29 Dec, 2018

1 commit

5d6527a78 mm/mmu_notifier: use structure for invalidate_range_start/end callback ... Browse Code »

Patch series "mmu notifier contextual informations", v2.

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback. This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space. When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap(). This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking. Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers. I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple from a code point of view. The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments. The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback. No functional changes
with this patch.

[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Jan Kara
Acked-by: Jason Gunthorpe [infiniband]
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2018-12-29 04:11:50 +0800

27 Dec, 2018

2 commits

792bf4d87 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The biggest RCU changes in this cycle were:

- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

- Replace calls of RCU-bh and RCU-sched update-side functions to
their vanilla RCU counterparts. This series is a step towards
complete removal of the RCU-bh and RCU-sched update-side functions.

( Note that some of these conversions are going upstream via their
respective maintainers. )

- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.

- Miscellaneous fixes.

- Automate generation of the initrd filesystem used for rcutorture
testing.

- Convert spin_is_locked() assertions to instead use lockdep.

( Note that some of these conversions are going upstream via their
respective maintainers. )

- SRCU updates, especially including a fix from Dennis Krein for a
bag-on-head-class bug.

- RCU torture-test updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
rcutorture: Don't do busted forward-progress testing
rcutorture: Use 100ms buckets for forward-progress callback histograms
rcutorture: Recover from OOM during forward-progress tests
rcutorture: Print forward-progress test age upon failure
rcutorture: Print time since GP end upon forward-progress failure
rcutorture: Print histogram of CB invocation at OOM time
rcutorture: Print GP age upon forward-progress failure
rcu: Print per-CPU callback counts for forward-progress failures
rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
rcutorture: Dump grace-period diagnostics upon forward-progress OOM
rcutorture: Prepare for asynchronous access to rcu_fwd_startat
torture: Remove unnecessary "ret" variables
rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
rcutorture: Break up too-long rcu_torture_fwd_prog() function
rcutorture: Remove cbflood facility
torture: Bring any extra CPUs online during kernel startup
rcutorture: Add call_rcu() flooding forward-progress tests
rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
...

Linus Torvalds
2018-12-27 05:07:19 +0800
42b00f122 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"ARM:
- selftests improvements
- large PUD support for HugeTLB
- single-stepping fixes
- improved tracing
- various timer and vGIC fixes

x86:
- Processor Tracing virtualization
- STIBP support
- some correctness fixes
- refactorings and splitting of vmx.c
- use the Hyper-V range TLB flush hypercall
- reduce order of vcpu struct
- WBNOINVD support
- do not use -ftrace for __noclone functions
- nested guest support for PAUSE filtering on AMD
- more Hyper-V enlightenments (direct mode for synthetic timers)

PPC:
- nested VFIO

s390:
- bugfixes only this time"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: x86: Add CPUID support for new instruction WBNOINVD
kvm: selftests: ucall: fix exit mmio address guessing
Revert "compiler-gcc: disable -ftracer for __noclone functions"
KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
KVM: Make kvm_set_spte_hva() return int
KVM: Replace old tlb flush function with new one to flush a specified range.
KVM/MMU: Add tlb flush with range helper function
KVM/VMX: Add hv tlb range flush support
x86/hyper-v: Add HvFlushGuestAddressList hypercall support
KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
KVM: x86: Disable Intel PT when VMXON in L1 guest
KVM: x86: Set intercept for Intel PT MSRs read/write
KVM: x86: Implement Intel PT MSRs read/write emulation
...

Linus Torvalds
2018-12-27 03:46:28 +0800

26 Dec, 2018

1 commit

5694cecdb Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 festive updates from Will Deacon:
"In the end, we ended up with quite a lot more than I expected:

- Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
kernel-side support to come later)

- Support for per-thread stack canaries, pending an update to GCC
that is currently undergoing review

- Support for kexec_file_load(), which permits secure boot of a kexec
payload but also happens to improve the performance of kexec
dramatically because we can avoid the sucky purgatory code from
userspace. Kdump will come later (requires updates to libfdt).

- Optimisation of our dynamic CPU feature framework, so that all
detected features are enabled via a single stop_machine()
invocation

- KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
they can benefit from global TLB entries when KASLR is not in use

- 52-bit virtual addressing for userspace (kernel remains 48-bit)

- Patch in LSE atomics for per-cpu atomic operations

- Custom preempt.h implementation to avoid unconditional calls to
preempt_schedule() from preempt_enable()

- Support for the new 'SB' Speculation Barrier instruction

- Vectorised implementation of XOR checksumming and CRC32
optimisations

- Workaround for Cortex-A76 erratum #1165522

- Improved compatibility with Clang/LLD

- Support for TX2 system PMUS for profiling the L3 cache and DMC

- Reflect read-only permissions in the linear map by default

- Ensure MMIO reads are ordered with subsequent calls to Xdelay()

- Initial support for memory hotplug

- Tweak the threshold when we invalidate the TLB by-ASID, so that
mremap() performance is improved for ranges spanning multiple PMDs.

- Minor refactoring and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits)
arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset()
arm64: sysreg: Use _BITUL() when defining register bits
arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches
arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4
arm64: docs: document pointer authentication
arm64: ptr auth: Move per-thread keys from thread_info to thread_struct
arm64: enable pointer authentication
arm64: add prctl control for resetting ptrauth keys
arm64: perf: strip PAC when unwinding userspace
arm64: expose user PAC bit positions via ptrace
arm64: add basic pointer authentication support
arm64/cpufeature: detect pointer authentication
arm64: Don't trap host pointer auth use to EL2
arm64/kvm: hide ptrauth from guests
arm64/kvm: consistently handle host HCR_EL2 flags
arm64: add pointer authentication register bits
arm64: add comments about EC exception levels
arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned
arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
arm64: enable per-task stack canaries
...

Linus Torvalds
2018-12-26 09:41:56 +0800

21 Dec, 2018

5 commits

0cf853c5e KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte() ... Browse Code »

This patch is to move tlb flush in kvm_set_pte_rmapp() to
kvm_mmu_notifier_change_pte() in order to avoid redundant tlb flush.

Signed-off-by: Lan Tianyu
Signed-off-by: Paolo Bonzini

Lan Tianyu
2018-12-21 18:28:42 +0800
748c0e312 KVM: Make kvm_set_spte_hva() return int ... Browse Code »

The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.

Signed-off-by: Lan Tianyu
Acked-by: Paul Mackerras
Signed-off-by: Paolo Bonzini

Lan Tianyu
2018-12-21 18:28:41 +0800
bdd303cb1 KVM: fix some typos ... Browse Code »

Signed-off-by: Wei Yang
[Preserved the iff and a probably intentional weird bracket notation.
Also dropped the style change to make a single-purpose patch. - Radim]
Signed-off-by: Radim Krčmář

Wei Yang
2018-12-21 18:28:26 +0800
7a86dab8c kvm: Change offset in kvm_write_guest_offset_cached to unsigned ... Browse Code »

Since the offset is added directly to the hva from the
gfn_to_hva_cache, a negative offset could result in an out of bounds
write. The existing BUG_ON only checks for addresses beyond the end of
the gfn_to_hva_cache, not for addresses before the start of the
gfn_to_hva_cache.

Note that all current call sites have non-negative offsets.

Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
Reported-by: Cfir Cohen
Signed-off-by: Jim Mattson
Reviewed-by: Cfir Cohen
Reviewed-by: Peter Shier
Reviewed-by: Krish Sadhukhan
Reviewed-by: Sean Christopherson
Signed-off-by: Radim Krčmář

Jim Mattson
2018-12-21 18:28:22 +0800
f1b9dd5eb kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init ... Browse Code »

Previously, in the case where (gpa + len) wrapped around, the entire
region was not validated, as the comment claimed. It doesn't actually
seem that wraparound should be allowed here at all.

Furthermore, since some callers don't check the return code from this
function, it seems prudent to clear ghc->memslot in the event of an
error.

Fixes: 8f964525a121f ("KVM: Allow cross page reads and writes from cached translations.")
Reported-by: Cfir Cohen
Signed-off-by: Jim Mattson
Reviewed-by: Cfir Cohen
Reviewed-by: Marc Orr
Cc: Andrew Honig
Signed-off-by: Radim Krčmář

Jim Mattson
2018-12-21 18:28:22 +0800

20 Dec, 2018

9 commits

8c5e14f43 Merge tag 'kvmarm-for-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/kv… ... Browse Code »

…marm/kvmarm into HEAD

KVM/arm updates for 4.21

- Large PUD support for HugeTLB
- Single-stepping fixes
- Improved tracing
- Various timer and vgic fixups

Paolo Bonzini
2018-12-20 03:33:55 +0800
58466766c arm/arm64: KVM: Add ARM_EXCEPTION_IS_TRAP macro ... Browse Code »

32 and 64bit use different symbols to identify the traps.
32bit has a fine grained approach (prefetch abort, data abort and HVC),
while 64bit is pretty happy with just "trap".

This has been fine so far, except that we now need to decode some
of that in tracepoints that are common to both architectures.

Introduce ARM_EXCEPTION_IS_TRAP which abstracts the trap symbols
and make the tracepoint use it.

Acked-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Marc Zyngier
2018-12-20 01:47:53 +0800
6794ad544 KVM: arm/arm64: Fix unintended stage 2 PMD mappings ... Browse Code »

There are two things we need to take care of when we create block
mappings in the stage 2 page tables:

(1) The alignment within a PMD between the host address range and the
guest IPA range must be the same, since otherwise we end up mapping
pages with the wrong offset.

(2) The head and tail of a memory slot may not cover a full block
size, and we have to take care to not map those with block
descriptors, since we could expose memory to the guest that the host
did not intend to expose.

So far, we have been taking care of (1), but not (2), and our commentary
describing (1) was somewhat confusing.

This commit attempts to factor out the checks of both into a common
function, and if we don't pass the check, we won't attempt any PMD
mappings for neither hugetlbfs nor THP.

Note that we used to only check the alignment for THP, not for
hugetlbfs, but as far as I can tell the check needs to be applied to
both scenarios.

Cc: Ralph Palutke
Cc: Lukas Braun
Reported-by: Lukas Braun
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-20 01:47:52 +0800
107352a24 arm/arm64: KVM: vgic: Force VM halt when changing the active state of GICv3 PPIs/SGIs ... Browse Code »

We currently only halt the guest when a vCPU messes with the active
state of an SPI. This is perfectly fine for GICv2, but isn't enough
for GICv3, where all vCPUs can access the state of any other vCPU.

Let's broaden the condition to include any GICv3 interrupt that
has an active state (i.e. all but LPIs).

Cc: stable@vger.kernel.org
Reviewed-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Marc Zyngier
2018-12-20 01:47:08 +0800
6e14ef1d1 KVM: arm/arm64: arch_timer: Simplify kvm_timer_vcpu_terminate ... Browse Code »

kvm_timer_vcpu_terminate can only be called in two scenarios:

1. As part of cleanup during a failed VCPU create
2. As part of freeing the whole VM (struct kvm refcount == 0)

In the first case, we cannot have programmed any timers or mapped any
IRQs, and therefore we do not have to cancel anything or unmap anything.

In the second case, the VCPU will have gone through kvm_timer_vcpu_put,
which will have canceled the emulated physical timer's hrtimer, and we
do not need to that here as well. We also do not care if the irq is
recorded as mapped or not in the VGIC data structure, because the whole
VM is going away. That leaves us only with having to ensure that we
cancel the bg_timer if we were blocking the last time we called
kvm_timer_vcpu_put().

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-20 01:47:07 +0800
8a411b060 KVM: arm/arm64: Remove arch timer workqueue ... Browse Code »

The use of a work queue in the hrtimer expire function for the bg_timer
is a leftover from the time when we would inject interrupts when the
bg_timer expired.

Since we are no longer doing that, we can instead call
kvm_vcpu_wake_up() directly from the hrtimer function and remove all
workqueue functionality from the arch timer code.

Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-20 01:47:07 +0800
71a7e47f3 KVM: arm/arm64: Fixup the kvm_exit tracepoint ... Browse Code »

The kvm_exit tracepoint strangely always reported exits as being IRQs.
This seems to be because either the __print_symbolic or the tracepoint
macros use a variable named idx.

Take this chance to update the fields in the tracepoint to reflect the
concepts in the arm64 architecture that we pass to the tracepoint and
move the exception type table to the same location and header files as
the exits code.

We also clear out the exception code to 0 for IRQ exits (which
translates to UNKNOWN in text) to make it slighyly less confusing to
parse the trace output.

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-20 01:47:06 +0800
9009782a4 KVM: arm/arm64: vgic: Consider priority and active state for pending irq ... Browse Code »

When checking if there are any pending IRQs for the VM, consider the
active state and priority of the IRQs as well.

Otherwise we could be continuously scheduling a guest hypervisor without
it seeing an IRQ.

Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-20 01:47:06 +0800
c23b2e6fc KVM: arm/arm64: vgic: Fix off-by-one bug in vgic_get_irq() ... Browse Code »

When using the nospec API, it should be taken into account that:

"...if the CPU speculates past the bounds check then
* array_index_nospec() will clamp the index within the range of [0,
* size)."

The above is part of the header for macro array_index_nospec() in
linux/nospec.h

Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:

/* SGIs and PPIs */
if (intid arch.vgic_cpu.private_irqs[intid];

/* SPIs */
if (intid arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];

are valid values for intid.

Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
and VGIC_MAX_SPI + 1 as arguments for its parameter size.

Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva
[dropped the SPI part which was fixed separately]
Signed-off-by: Marc Zyngier

Gustavo A. R. Silva
2018-12-20 01:46:07 +0800

19 Dec, 2018

1 commit

987d1149b KVM: fix unregistering coalesced mmio zone from wrong bus ... Browse Code »

If you register a kvm_coalesced_mmio_zone with '.pio = 0' but then
unregister it with '.pio = 1', KVM_UNREGISTER_COALESCED_MMIO will try to
unregister it from KVM_PIO_BUS rather than KVM_MMIO_BUS, which is a
no-op. But it frees the kvm_coalesced_mmio_dev anyway, causing a
use-after-free.

Fix it by only unregistering and freeing the zone if the correct value
of 'pio' is provided.

Reported-by: syzbot+f87f60bb6f13f39b54e3@syzkaller.appspotmail.com
Fixes: 0804c849f1df ("kvm/x86 : add coalesced pio support")
Signed-off-by: Eric Biggers
Signed-off-by: Paolo Bonzini

Eric Biggers
2018-12-19 05:07:25 +0800

18 Dec, 2018

15 commits

bea2ef803 KVM: arm/arm64: vgic: Cap SPIs to the VM-defined maximum ... Browse Code »

SPIs should be checked against the VMs specific configuration, and
not the architectural maximum.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier

Marc Zyngier
2018-12-18 23:14:50 +0800
6992195cc KVM: arm64: Clarify explanation of STAGE2_PGTABLE_LEVELS ... Browse Code »

In attempting to re-construct the logic for our stage 2 page table
layout I found the reasoning in the comment explaining how we calculate
the number of levels used for stage 2 page tables a bit backwards.

This commit attempts to clarify the comment, to make it slightly easier
to read without having the Arm ARM open on the right page.

While we're at it, fixup a typo in a comment that was recently changed.

Reviewed-by: Suzuki K Poulose
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-18 23:14:50 +0800
2e2f6c3c0 KVM: arm/arm64: vgic: Do not cond_resched_lock() with IRQs disabled ... Browse Code »

To change the active state of an MMIO, halt is requested for all vcpus of
the affected guest before modifying the IRQ state. This is done by calling
cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
disabled at this point and we cannot reschedule a vcpu.

We actually don't need any of this, as kvm_arm_halt_guest ensures that
all the other vcpus are out of the guest. Let's just drop that useless
code.

Signed-off-by: Julien Thierry
Suggested-by: Christoffer Dall
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier

Julien Thierry
2018-12-18 23:14:49 +0800
b8e0ba7c8 KVM: arm64: Add support for creating PUD hugepages at stage 2 ... Browse Code »

KVM only supports PMD hugepages at stage 2. Now that the various page
handling routines are updated, extend the stage 2 fault handling to
map in PUD hugepages.

Addition of PUD hugepage support enables additional page sizes (e.g.,
1G with 4K granule) which can be useful on cores that support mapping
larger block sizes in the TLB entries.

Signed-off-by: Punit Agrawal
Reviewed-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
[ Replace BUG() => WARN_ON(1) for arm32 PUD helpers ]
Signed-off-by: Suzuki Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:49 +0800
35a639661 KVM: arm64: Update age handlers to support PUD hugepages ... Browse Code »

In preparation for creating larger hugepages at Stage 2, add support
to the age handling notifiers for PUD hugepages when encountered.

Provide trivial helpers for arm32 to allow sharing code.

Signed-off-by: Punit Agrawal
Reviewed-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
[ Replaced BUG() => WARN_ON(1) for arm32 PUD helpers ]
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:48 +0800
eb3f0624e KVM: arm64: Support handling access faults for PUD hugepages ... Browse Code »

In preparation for creating larger hugepages at Stage 2, extend the
access fault handling at Stage 2 to support PUD hugepages when
encountered.

Provide trivial helpers for arm32 to allow sharing of code.

Signed-off-by: Punit Agrawal
Reviewed-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
[ Replaced BUG() => WARN_ON(1) in PUD helpers ]
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:48 +0800
86d1c55ea KVM: arm64: Support PUD hugepage in stage2_is_exec() ... Browse Code »

In preparation for creating PUD hugepages at stage 2, add support for
detecting execute permissions on PUD page table entries. Faults due to
lack of execute permissions on page table entries is used to perform
i-cache invalidation on first execute.

Provide trivial implementations of arm32 helpers to allow sharing of
code.

Signed-off-by: Punit Agrawal
Reviewed-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
[ Replaced BUG() => WARN_ON(1) in arm32 PUD helpers ]
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:48 +0800
4ea5af531 KVM: arm64: Support dirty page tracking for PUD hugepages ... Browse Code »

In preparation for creating PUD hugepages at stage 2, add support for
write protecting PUD hugepages when they are encountered. Write
protecting guest tables is used to track dirty pages when migrating
VMs.

Also, provide trivial implementations of required kvm_s2pud_* helpers
to allow sharing of code with arm32.

Signed-off-by: Punit Agrawal
Reviewed-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
[ Replaced BUG() => WARN_ON() in arm32 pud helpers ]
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:47 +0800
f8df73388 KVM: arm/arm64: Introduce helpers to manipulate page table entries ... Browse Code »

Introduce helpers to abstract architectural handling of the conversion
of pfn to page table entries and marking a PMD page table entry as a
block entry.

The helpers are introduced in preparation for supporting PUD hugepages
at stage 2 - which are supported on arm64 but do not exist on arm.

Signed-off-by: Punit Agrawal
Reviewed-by: Suzuki K Poulose
Acked-by: Christoffer Dall
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Reviewed-by: Marc Zyngier
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:47 +0800
6396b852e KVM: arm/arm64: Re-factor setting the Stage 2 entry to exec on fault ... Browse Code »

Stage 2 fault handler marks a page as executable if it is handling an
execution fault or if it was a permission fault in which case the
executable bit needs to be preserved.

The logic to decide if the page should be marked executable is
duplicated for PMD and PTE entries. To avoid creating another copy
when support for PUD hugepages is introduced refactor the code to
share the checks needed to mark a page table entry as executable.

Signed-off-by: Punit Agrawal
Reviewed-by: Suzuki K Poulose
Reviewed-by: Christoffer Dall
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:47 +0800
3f58bf634 KVM: arm/arm64: Share common code in user_mem_abort() ... Browse Code »

The code for operations such as marking the pfn as dirty, and
dcache/icache maintenance during stage 2 fault handling is duplicated
between normal pages and PMD hugepages.

Instead of creating another copy of the operations when we introduce
PUD hugepages, let's share them across the different pagesizes.

Signed-off-by: Punit Agrawal
Reviewed-by: Suzuki K Poulose
Reviewed-by: Christoffer Dall
Signed-off-by: Suzuki K Poulose
Signed-off-by: Marc Zyngier

Punit Agrawal
2018-12-18 23:14:46 +0800
60c3ab30d KVM: arm/arm64: vgic-v2: Set active_source to 0 when restoring state ... Browse Code »

When restoring the active state from userspace, we don't know which CPU
was the source for the active state, and this is not architecturally
exposed in any of the register state.

Set the active_source to 0 in this case. In the future, we can expand
on this and exposse the information as additional information to
userspace for GICv2 if anyone cares.

Cc: stable@vger.kernel.org
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-18 23:14:46 +0800
fb544d1ca KVM: arm/arm64: Fix VMID alloc race by reverting to lock-less ... Browse Code »

We recently addressed a VMID generation race by introducing a read/write
lock around accesses and updates to the vmid generation values.

However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
does so without taking the read lock.

As far as I can tell, this can lead to the same kind of race:

VM 0, VCPU 0 VM 0, VCPU 1
------------ ------------
update_vttbr (vmid 254)
update_vttbr (vmid 1) // roll over
read_lock(kvm_vmid_lock);
force_vm_exit()
local_irq_disable
need_new_vmid_gen == false //because vmid gen matches

enter_guest (vmid 254)
kvm_arch.vttbr = :
read_unlock(kvm_vmid_lock);

enter_guest (vmid 1)

Which results in running two VCPUs in the same VM with different VMIDs
and (even worse) other VCPUs from other VMs could now allocate clashing
VMID 254 from the new generation as long as VCPU 0 is not exiting.

Attempt to solve this by making sure vttbr is updated before another CPU
can observe the updated VMID generation.

Cc: stable@vger.kernel.org
Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
Reviewed-by: Julien Thierry
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier

Christoffer Dall
2018-12-18 23:14:45 +0800
bd7d95caf arm64: KVM: Consistently advance singlestep when emulating instructions ... Browse Code »

When we emulate a guest instruction, we don't advance the hardware
singlestep state machine, and thus the guest will receive a software
step exception after a next instruction which is not emulated by the
host.

We bodge around this in an ad-hoc fashion. Sometimes we explicitly check
whether userspace requested a single step, and fake a debug exception
from within the kernel. Other times, we advance the HW singlestep state
rely on the HW to generate the exception for us. Thus, the observed step
behaviour differs for host and guest.

Let's make this simpler and consistent by always advancing the HW
singlestep state machine when we skip an instruction. Thus we can rely
on the hardware to generate the singlestep exception for us, and never
need to explicitly check for an active-pending step, nor do we need to
fake a debug exception from the guest.

Cc: Peter Maydell
Reviewed-by: Alex Bennée
Reviewed-by: Christoffer Dall
Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier

Mark Rutland
2018-12-18 22:11:37 +0800
0d640732d arm64: KVM: Skip MMIO insn after emulation ... Browse Code »

When we emulate an MMIO instruction, we advance the CPU state within
decode_hsr(), before emulating the instruction effects.

Having this logic in decode_hsr() is opaque, and advancing the state
before emulation is problematic. It gets in the way of applying
consistent single-step logic, and it prevents us from being able to fail
an MMIO instruction with a synchronous exception.

Clean this up by only advancing the CPU state *after* the effects of the
instruction are emulated.

Cc: Peter Maydell
Reviewed-by: Alex Bennée
Reviewed-by: Christoffer Dall
Signed-off-by: Mark Rutland
Signed-off-by: Marc Zyngier

Mark Rutland
2018-12-18 22:10:36 +0800

14 Dec, 2018

2 commits

2a31b9db1 kvm: introduce manual dirty log reprotect ... Browse Code »

There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:

1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).

This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2018-12-14 19:34:19 +0800
8fe65a829 kvm: rename last argument to kvm_get_dirty_log_protect ... Browse Code »

When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation. Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2018-12-14 19:34:18 +0800