Eric Lee / smarc-fsl-linux-kernel

01 Oct, 2018

1 commit

c0aac682f Merge tag 'v4.19-rc6' into for-4.20/block ... Browse Code »

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe

* tag 'v4.19-rc6': (780 commits)
Linux 4.19-rc6
MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
cpufreq: qcom-kryo: Fix section annotations
perf/core: Add sanity check to deal with pinned event failure
xen/blkfront: correct purging of persistent grants
Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
selftests/powerpc: Fix Makefiles for headers_install change
blk-mq: I/O and timer unplugs are inverted in blktrace
dax: Fix deadlock in dax_lock_mapping_entry()
x86/boot: Fix kexec booting failure in the SEV bit detection code
bcache: add separate workqueue for journal_write to avoid deadlock
drm/amd/display: Fix Edid emulation for linux
drm/amd/display: Fix Vega10 lightup on S3 resume
drm/amdgpu: Fix vce work queue was not cancelled when suspend
Revert "drm/panel: Add device_link from panel device to DRM device"
xen/blkfront: When purging persistent grants, keep them in the buffer
clocksource/drivers/timer-atmel-pit: Properly handle error cases
block: fix deadline elevator drain for zoned block devices
ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
...

Signed-off-by: Jens Axboe

Jens Axboe
2018-10-01 22:58:57 +0800

30 Sep, 2018

1 commit

e75417739 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Thomas writes:
"A single fix for the AMD memory encryption boot code so it does not
read random garbage instead of the cached encryption bit when a kexec
kernel is allocated above the 32bit address limit."

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix kexec booting failure in the SEV bit detection code

Greg Kroah-Hartman
2018-09-30 05:34:06 +0800

29 Sep, 2018

1 commit

f005de018 Merge tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Michael writes:
"powerpc fixes for 4.19 #3

A reasonably big batch of fixes due to me being away for a few weeks.

A fix for the TM emulation support on Power9, which could result in
corrupting the guest r11 when running under KVM.

Two fixes to the TM code which could lead to userspace GPR corruption
if we take an SLB miss at exactly the wrong time.

Our dynamic patching code had a bug that meant we could patch freed
__init text, which could lead to corrupting userspace memory.

csum_ipv6_magic() didn't work on little endian platforms since we
optimised it recently.

A fix for an endian bug when reading a device tree property telling
us how many storage keys the machine has available.

Fix a crash seen on some configurations of PowerVM when migrating the
partition from one machine to another.

A fix for a regression in the setup of our CPU to NUMA node mapping
in KVM guests.

A fix to our selftest Makefiles to make them work since a recent
change to the shared Makefile logic."

* tag 'powerpc-4.19-3' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix Makefiles for headers_install change
powerpc/numa: Use associativity if VPHN hcall is successful
powerpc/tm: Avoid possible userspace r1 corruption on reclaim
powerpc/tm: Fix userspace r13 corruption
powerpc/pseries: Fix unitialized timer reset on migration
powerpc/pkeys: Fix reading of ibm, processor-storage-keys property
powerpc: fix csum_ipv6_magic() on little endian platforms
powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again)
powerpc: Avoid code patching freed init sections
KVM: PPC: Book3S HV: Fix guest r11 corruption with POWER9 TM workarounds

Greg Kroah-Hartman
2018-09-29 08:43:32 +0800

28 Sep, 2018

2 commits

fef912bf8 block: genhd: add 'groups' argument to device_add_disk ... Browse Code »

Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck
Signed-off-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Hannes Reinecke
2018-09-28 22:30:28 +0800
bdec8d7fa x86/boot: Fix kexec booting failure in the SEV bit detection code ... Browse Code »

Commit

1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active")

can occasionally cause system resets when kexec-ing a second kernel even
if SEV is not active.

That's because get_sev_encryption_bit() uses 32-bit rIP-relative
addressing to read the value of enc_bit - a variable which caches a
previously detected encryption bit position - but kexec may allocate
the early boot code to a higher location, beyond the 32-bit addressing
limit.

In this case, garbage will be read and get_sev_encryption_bit() will
return the wrong value, leading to accessing memory with the wrong
encryption setting.

Therefore, remove enc_bit, and thus get rid of the need to do 32-bit
rIP-relative addressing in the first place.

[ bp: massage commit message heavily. ]

Fixes: 1958b5fc4010 ("x86/boot: Add early boot support when running with SEV active")
Suggested-by: Borislav Petkov
Signed-off-by: Kairui Song
Signed-off-by: Borislav Petkov
Reviewed-by: Tom Lendacky
Cc: linux-kernel@vger.kernel.org
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: brijesh.singh@amd.com
Cc: kexec@lists.infradead.org
Cc: dyoung@redhat.com
Cc: bhe@redhat.com
Cc: ghook@redhat.com
Link: https://lkml.kernel.org/r/20180927123845.32052-1-kasong@redhat.com

Kairui Song
2018-09-28 01:35:03 +0800

26 Sep, 2018

4 commits

3cfa210bf xen: don't include <xen/xen.h> from <asm/io.h> and <asm/dma-mapping.h> ... Browse Code »

Nothing Xen specific in these headers, which get included from a lot
of code in the kernel. So prune the includes and move them to the
Xen-specific files that actually use them instead.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-26 22:45:18 +0800
c39ae60df block: remove ARCH_BIOVEC_PHYS_MERGEABLE ... Browse Code »

Take the Xen check into the core code instead of delegating it to
the architectures.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-26 22:45:11 +0800
20e326760 xen: provide a prototype for xen_biovec_phys_mergeable in xen.h ... Browse Code »

Having multiple externs in arch headers is not a good way to provide
a common interface.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-26 22:45:10 +0800
a5bb207ad arm: remove the unused BIOVEC_MERGEABLE define ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-26 22:45:07 +0800

25 Sep, 2018

6 commits

2483ef056 powerpc/numa: Use associativity if VPHN hcall is successful ... Browse Code »

Currently associativity is used to lookup node-id even if the
preceding VPHN hcall failed. However this can cause CPU to be made
part of the wrong node, (most likely to be node 0). This is because
VPHN is not enabled on KVM guests.

With 2ea6263 ("powerpc/topology: Get topology for shared processors at
boot"), associativity is used to set to the wrong node. Hence KVM
guest topology is broken.

For example : A 4 node KVM guest before would have reported.

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 1746 MB
node 0 free: 1604 MB
node 1 cpus: 4 5 6 7
node 1 size: 2044 MB
node 1 free: 1765 MB
node 2 cpus: 8 9 10 11
node 2 size: 2044 MB
node 2 free: 1837 MB
node 3 cpus: 12 13 14 15
node 3 size: 2044 MB
node 3 free: 1903 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10

Would now report:

[root@localhost ~]# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 1746 MB
node 0 free: 1244 MB
node 1 cpus:
node 1 size: 2044 MB
node 1 free: 2032 MB
node 2 cpus: 1
node 2 size: 2044 MB
node 2 free: 2028 MB
node 3 cpus:
node 3 size: 2044 MB
node 3 free: 2032 MB
node distances:
node 0 1 2 3
0: 10 40 40 40
1: 40 10 40 40
2: 40 40 10 40
3: 40 40 40 10

Fix this by skipping associativity lookup if the VPHN hcall failed.

Fixes: 2ea626306810 ("powerpc/topology: Get topology for shared processors at boot")
Signed-off-by: Srikar Dronamraju
Signed-off-by: Michael Ellerman

Srikar Dronamraju
2018-09-25 21:05:08 +0800
96dc89d52 powerpc/tm: Avoid possible userspace r1 corruption on reclaim ... Browse Code »

Current we store the userspace r1 to PACATMSCRATCH before finally
saving it to the thread struct.

In theory an exception could be taken here (like a machine check or
SLB miss) that could write PACATMSCRATCH and hence corrupt the
userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but
others do.

We've never actually seen this happen but it's theoretically
possible. Either way, the code is fragile as it is.

This patch saves r1 to the kernel stack (which can't fault) before we
turn MSR[RI] back on. PACATMSCRATCH is still used but only with
MSR[RI] off. We then copy r1 from the kernel stack to the thread
struct once we have MSR[RI] back on.

Suggested-by: Breno Leitao
Signed-off-by: Michael Neuling
Signed-off-by: Michael Ellerman

Michael Neuling
2018-09-25 20:51:32 +0800
cf13435b7 powerpc/tm: Fix userspace r13 corruption ... Browse Code »

When we treclaim we store the userspace checkpointed r13 to a scratch
SPR and then later save the scratch SPR to the user thread struct.

Unfortunately, this doesn't work as accessing the user thread struct
can take an SLB fault and the SLB fault handler will write the same
scratch SPRG that now contains the userspace r13.

To fix this, we store r13 to the kernel stack (which can't fault)
before we access the user thread struct.

Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen
as a random userspace segfault with r13 looking like a kernel address.

Signed-off-by: Michael Neuling
Reviewed-by: Breno Leitao
Signed-off-by: Michael Ellerman

Michael Neuling
2018-09-25 20:51:08 +0800
57a489786 RISC-V: include linux/ftrace.h in asm-prototypes.h ... Browse Code »

Building a riscv kernel with CONFIG_FUNCTION_TRACER and
CONFIG_MODVERSIONS enabled results in these two warnings:

MODPOST vmlinux.o
WARNING: EXPORT symbol "return_to_handler" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: EXPORT symbol "_mcount" [vmlinux] version generation failed, symbol will not be versioned.

When exporting symbols from an assembly file, the MODVERSIONS code
requires their prototypes to be defined in asm-prototypes.h (see
scripts/Makefile.build). Since both of these symbols have prototypes
defined in linux/ftrace.h, include this header from RISC-V's
asm-prototypes.h.

Reported-by: Karsten Merker
Signed-off-by: James Cowgill
Signed-off-by: Palmer Dabbelt

James Cowgill
2018-09-25 04:12:27 +0800
6e768461c block: remove bvec_to_phys ... Browse Code »

We only use it in biovec_phys_mergeable and a m68k paravirt driver,
so just opencode it there. Also remove the pointless unsigned long cast
for the offset in the opencoded instances.

Signed-off-by: Christoph Hellwig
Reviewed-by: Geert Uytterhoeven
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-25 02:33:59 +0800
6a9f5f240 block: simplify BIOVEC_PHYS_MERGEABLE ... Browse Code »

Turn the macro into an inline, move it to blk.h and simplify the
arch hooks a bit.

Also rename the function to biovec_phys_mergeable as there is no need
to shout.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2018-09-25 02:33:54 +0800

24 Sep, 2018

1 commit

8604895a3 powerpc/pseries: Fix unitialized timer reset on migration ... Browse Code »

After migration of a powerpc LPAR, the kernel executes code to
update the system state to reflect new platform characteristics.

Such changes include modifications to device tree properties provided
to the system by PHYP. Property notifications received by the
post_mobility_fixup() code are passed along to the kernel in general
through a call to of_update_property() which in turn passes such
events back to all modules through entries like the '.notifier_call'
function within the NUMA module.

When the NUMA module updates its state, it resets its event timer. If
this occurs after a previous call to stop_topology_update() or on a
system without VPHN enabled, the code runs into an unitialized timer
structure and crashes. This patch adds a safety check along this path
toward the problem code.

An example crash log is as follows.

ibmvscsi 30000081: Re-enabling adapter!
------------[ cut here ]------------
kernel BUG at kernel/time/timer.c:958!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: nfsv3 nfs_acl nfs tcp_diag udp_diag inet_diag lockd unix_diag af_packet_diag netlink_diag grace fscache sunrpc xts vmx_crypto pseries_rng sg binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
CPU: 11 PID: 3067 Comm: drmgr Not tainted 4.17.0+ #179
...
NIP mod_timer+0x4c/0x400
LR reset_topology_timer+0x40/0x60
Call Trace:
0xc0000003f9407830 (unreliable)
reset_topology_timer+0x40/0x60
dt_update_callback+0x100/0x120
notifier_call_chain+0x90/0x100
__blocking_notifier_call_chain+0x60/0x90
of_property_notify+0x90/0xd0
of_update_property+0x104/0x150
update_dt_property+0xdc/0x1f0
pseries_devicetree_update+0x2d0/0x510
post_mobility_fixup+0x7c/0xf0
migration_store+0xa4/0xc0
kobj_attr_store+0x30/0x60
sysfs_kf_write+0x64/0xa0
kernfs_fop_write+0x16c/0x240
__vfs_write+0x40/0x200
vfs_write+0xc8/0x240
ksys_write+0x5c/0x100
system_call+0x58/0x6c

Fixes: 5d88aa85c00b ("powerpc/pseries: Update CPU maps when device tree is updated")
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Michael Bringmann
Signed-off-by: Michael Ellerman

Michael Bringmann
2018-09-24 19:05:38 +0800

23 Sep, 2018

2 commits

18d49ec3c Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip ... Browse Code »

Juergen writes:
"xen:
Two small fixes for xen drivers."

* tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: issue warning message when out of grant maptrack entries
xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code

Greg Kroah-Hartman
2018-09-23 19:32:19 +0800
328c6333b Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Thomas writes:
"A set of fixes for x86:

- Resolve the kvmclock regression on AMD systems with memory
encryption enabled. The rework of the kvmclock memory allocation
during early boot results in encrypted storage, which is not
shareable with the hypervisor. Create a new section for this data
which is mapped unencrypted and take care that the later
allocations for shared kvmclock memory is unencrypted as well.

- Fix the build regression in the paravirt code introduced by the
recent spectre v2 updates.

- Ensure that the initial static page tables cover the fixmap space
correctly so early console always works. This worked so far by
chance, but recent modifications to the fixmap layout can -
depending on kernel configuration - move the relevant entries to a
different place which is not covered by the initial static page
tables.

- Address the regressions and issues which got introduced with the
recent extensions to the Intel Recource Director Technology code.

- Update maintainer entries to document reality"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Expand static page table for fixmap space
MAINTAINERS: Add X86 MM entry
x86/intel_rdt: Add Reinette as co-maintainer for RDT
MAINTAINERS: Add Borislav to the x86 maintainers
x86/paravirt: Fix some warning messages
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Fix exclusive mode handling of MBA resource
x86/intel_rdt: Fix incorrect loop end condition
x86/intel_rdt: Do not allow pseudo-locking of MBA resource
x86/intel_rdt: Fix unchecked MSR access
x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
x86/intel_rdt: Global closid helper to support future fixes
x86/intel_rdt: Fix size reporting of MBA resource
x86/intel_rdt: Fix data type in parsing callbacks
x86/kvm: Use __bss_decrypted attribute in shared variables
x86/mm: Add .bss..decrypted section to hold shared variables

Greg Kroah-Hartman
2018-09-23 14:10:12 +0800

21 Sep, 2018

3 commits

a27fb6d98 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Paolo writes:
"It's mostly small bugfixes and cleanups, mostly around x86 nested
virtualization. One important change, not related to nested
virtualization, is that the ability for the guest kernel to trap
CPUID instructions (in Linux that's the ARCH_SET_CPUID arch_prctl) is
now masked by default. This is because the feature is detected
through an MSR; a very bad idea that Intel seems to like more and
more. Some applications choke if the other fields of that MSR are
not initialized as on real hardware, hence we have to disable the
whole MSR by default, as was the case before Linux 4.12."

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (23 commits)
KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs
kvm: selftests: Add platform_info_test
KVM: x86: Control guest reads of MSR_PLATFORM_INFO
KVM: x86: Turbo bits in MSR_PLATFORM_INFO
nVMX x86: Check VPID value on vmentry of L2 guests
nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2
KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv
KVM: VMX: check nested state and CR4.VMXE against SMM
kvm: x86: make kvm_{load|put}_guest_fpu() static
x86/hyper-v: rename ipi_arg_{ex,non_ex} structures
KVM: VMX: use preemption timer to force immediate VMExit
KVM: VMX: modify preemption timer bit only when arming timer
KVM: VMX: immediately mark preemption timer expired only for zero value
KVM: SVM: Switch to bitmap_zalloc()
KVM/MMU: Fix comment in walk_shadow_page_lockless_end()
kvm: selftests: use -pthread instead of -lpthread
KVM: x86: don't reset root in kvm_mmu_setup()
kvm: mmu: Don't read PDPTEs when paging is not enabled
x86/kvm/lapic: always disable MMIO interface in x2APIC mode
KVM: s390: Make huge pages unavailable in ucontrol VMs
...

Greg Kroah-Hartman
2018-09-21 22:21:42 +0800
05ab1d8a4 x86/mm: Expand static page table for fixmap space ... Browse Code »

We met a kernel panic when enabling earlycon, which is due to the fixmap
address of earlycon is not statically setup.

Currently the static fixmap setup in head_64.S only covers 2M virtual
address space, while it actually could be in 4M space with different
kernel configurations, e.g. when VSYSCALL emulation is disabled.

So increase the static space to 4M for now by defining FIXMAP_PMD_NUM to 2,
and add a build time check to ensure that the fixmap is covered by the
initial static page tables.

Fixes: 1ad83c858c7d ("x86_64,vsyscall: Make vsyscall emulation configurable")
Suggested-by: Thomas Gleixner
Signed-off-by: Feng Tang
Signed-off-by: Thomas Gleixner
Tested-by: kernel test robot
Reviewed-by: Juergen Gross (Xen parts)
Cc: H Peter Anvin
Cc: Peter Zijlstra
Cc: Michal Hocko
Cc: Yinghai Lu
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Andy Lutomirsky
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180920025828.23699-1-feng.tang@intel.com

Feng Tang
2018-09-21 05:17:22 +0800
26b471c7e KVM: nVMX: Fix bad cleanup on error of get/set nested state IOCTLs ... Browse Code »

The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set
their return value in "r" local var and break out of switch block
when they encounter some error.
This is because vcpu_load() is called before the switch block which
have a proper cleanup of vcpu_put() afterwards.

However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return
immediately on error without performing above mentioned cleanup.

Thus, change these handlers to behave as expected.

Fixes: 8fcc4b5923af ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")

Reviewed-by: Mark Kanda
Reviewed-by: Patrick Colp
Signed-off-by: Liran Alon
Signed-off-by: Paolo Bonzini

Liran Alon
2018-09-21 00:54:08 +0800

20 Sep, 2018

19 commits

c716a25b9 powerpc/pkeys: Fix reading of ibm, processor-storage-keys property ... Browse Code »

scan_pkey_feature() uses of_property_read_u32_array() to read the
ibm,processor-storage-keys property and calls be32_to_cpu() on the
value it gets. The problem is that of_property_read_u32_array() already
returns the value converted to the CPU byte order.

The value of pkeys_total ends up more or less sane because there's a min()
call in pkey_initialize() which reduces pkeys_total to 32. So in practice
the kernel ignores the fact that the hypervisor reserved one key for
itself (the device tree advertises 31 keys in my test VM).

This is wrong, but the effect in practice is that when a process tries to
allocate the 32nd key, it gets an -EINVAL error instead of -ENOSPC which
would indicate that there aren't any keys available

Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: Thiago Jung Bauermann
Signed-off-by: Michael Ellerman

Thiago Jung Bauermann
2018-09-20 20:49:46 +0800
85682a7e3 powerpc: fix csum_ipv6_magic() on little endian platforms ... Browse Code »

On little endian platforms, csum_ipv6_magic() keeps len and proto in
CPU byte order. This generates a bad results leading to ICMPv6 packets
from other hosts being dropped by powerpc64le platforms.

In order to fix this, len and proto should be converted to network
byte order ie bigendian byte order. However checksumming 0x12345678
and 0x56341278 provide the exact same result so it is enough to
rotate the sum of len and proto by 1 byte.

PPC32 only support bigendian so the fix is needed for PPC64 only

Fixes: e9c4943a107b ("powerpc: Implement csum_ipv6_magic in assembly")
Reported-by: Jianlin Shi
Reported-by: Xin Long
Cc: # 4.18+
Signed-off-by: Christophe Leroy
Tested-by: Xin Long
Signed-off-by: Michael Ellerman

Christophe Leroy
2018-09-20 19:12:28 +0800
7233b8cab powerpc/powernv/ioda2: Reduce upper limit for DMA window size (again) ... Browse Code »

mpe: This was fixed originally in commit d3d4ffaae439
("powerpc/powernv/ioda2: Reduce upper limit for DMA window size"), but
contrary to what the merge commit says was inadvertently lost by me in
commit ce57c6610cc2 ("Merge branch 'topic/ppc-kvm' into next") which
brought in changes that moved the code to a new file. So reapply it to
the new file.

Original commit message follows:

We use PHB in mode1 which uses bit 59 to select a correct DMA window.
However there is mode2 which uses bits 59:55 and allows up to 32 DMA
windows per a PE.

Even though documentation does not clearly specify that, it seems that
the actual hardware does not support bits 59:55 even in mode1, in
other words we can create a window as big as 1<
Signed-off-by: Michael Ellerman

Alexey Kardashevskiy
2018-09-20 12:31:03 +0800
6fbbde9a1 KVM: x86: Control guest reads of MSR_PLATFORM_INFO ... Browse Code »

Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
to reads of MSR_PLATFORM_INFO.

Disabling access to reads of this MSR gives userspace the control to "expose"
this platform-dependent information to guests in a clear way. As it exists
today, guests that read this MSR would get unpopulated information if userspace
hadn't already set it (and prior to this patch series, only the CPUID faulting
information could have been populated). This existing interface could be
confusing if guests don't handle the potential for incorrect/incomplete
information gracefully (e.g. zero reported for base frequency).

Signed-off-by: Drew Schmitt
Signed-off-by: Paolo Bonzini

Drew Schmitt
2018-09-20 06:51:46 +0800
d84f1cff9 KVM: x86: Turbo bits in MSR_PLATFORM_INFO ... Browse Code »

Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
the CPUID faulting bit was settable. But now any bit in
MSR_PLATFORM_INFO would be settable. This can be used, for example, to
convey frequency information about the platform on which the guest is
running.

Signed-off-by: Drew Schmitt
Signed-off-by: Paolo Bonzini

Drew Schmitt
2018-09-20 06:51:46 +0800
ba8e23db5 nVMX x86: Check VPID value on vmentry of L2 guests ... Browse Code »

According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check needs to be enforced on vmentry of L2 guests:

If the 'enable VPID' VM-execution control is 1, the value of the
of the VPID VM-execution control field must not be 0000H.

Signed-off-by: Krish Sadhukhan
Reviewed-by: Mark Kanda
Reviewed-by: Liran Alon
Reviewed-by: Jim Mattson
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2018-09-20 06:51:45 +0800
6de84e581 nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 ... Browse Code »

According to section "Checks on VMX Controls" in Intel SDM vol 3C,
the following check needs to be enforced on vmentry of L2 guests:

- Bits 5:0 of the posted-interrupt descriptor address are all 0.
- The posted-interrupt descriptor address does not set any bits
beyond the processor's physical-address width.

Signed-off-by: Krish Sadhukhan
Reviewed-by: Mark Kanda
Reviewed-by: Liran Alon
Reviewed-by: Darren Kenny
Reviewed-by: Karl Heubaum
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2018-09-20 06:51:44 +0800
e6c67d8cf KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv ... Browse Code »

In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
it is possible for a vCPU to be blocked while it is in guest-mode.

According to Intel SDM 26.6.5 Interrupt-Window Exiting and
Virtual-Interrupt Delivery: "These events wake the logical processor
if it just entered the HLT state because of a VM entry".
Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
should be waken from the HLT state and injected with the interrupt.

In addition, if while the vCPU is blocked (while it is in guest-mode),
it receives a nested posted-interrupt, then the vCPU should also be
waken and injected with the posted interrupt.

To handle these cases, this patch enhances kvm_vcpu_has_events() to also
check if there is a pending interrupt in L2 virtual APICv provided by
L1. That is, it evaluates if there is a pending virtual interrupt for L2
by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
Evaluation of Pending Interrupts.

Note that this also handles the case of nested posted-interrupt by the
fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
kvm_vcpu_running() -> vmx_check_nested_events() ->
vmx_complete_nested_posted_interrupt().

Reviewed-by: Nikita Leshenko
Reviewed-by: Darren Kenny
Signed-off-by: Liran Alon
Signed-off-by: Paolo Bonzini

Liran Alon
2018-09-20 06:51:44 +0800
5bea5123c KVM: VMX: check nested state and CR4.VMXE against SMM ... Browse Code »

VMX cannot be enabled under SMM, check it when CR4 is set and when nested
virtualization state is restored.

This should fix some WARNs reported by syzkaller, mostly around
alloc_shadow_vmcs.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2018-09-20 06:51:43 +0800
822f312d4 kvm: x86: make kvm_{load|put}_guest_fpu() static ... Browse Code »

The functions
kvm_load_guest_fpu()
kvm_put_guest_fpu()

are only used locally, make them static. This requires also that both
functions are moved because they are used before their implementation.
Those functions were exported (via EXPORT_SYMBOL) before commit
e5bb40251a920 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Paolo Bonzini

Sebastian Andrzej Siewior
2018-09-20 06:51:43 +0800
a1efa9b70 x86/hyper-v: rename ipi_arg_{ex,non_ex} structures ... Browse Code »

These structures are going to be used from KVM code so let's make
their names reflect their Hyper-V origin.

Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Roman Kagan
Acked-by: K. Y. Srinivasan
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2018-09-20 06:51:42 +0800
d264ee0c2 KVM: VMX: use preemption timer to force immediate VMExit ... Browse Code »

A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest. Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI. This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).

When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI. Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed. But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0. Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing. The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter. Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.

Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate. There are no known errata affecting timer value of '0'.

[1] I/O SMIs also have guarantees on when they arrive, but I have
no idea if/how those are emulated in KVM.

Signed-off-by: Sean Christopherson
[Use a hook for SVM instead of leaving the default in x86.c - Paolo]
Signed-off-by: Paolo Bonzini

Sean Christopherson
2018-09-20 06:51:42 +0800
f459a707e KVM: VMX: modify preemption timer bit only when arming timer ... Browse Code »

Provide a singular location where the VMX preemption timer bit is
set/cleared so that future usages of the preemption timer can ensure
the VMCS bit is up-to-date without having to modify unrelated code
paths. For example, the preemption timer can be used to force an
immediate VMExit. Cache the status of the timer to avoid redundant
VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
VMEnters/VMExits.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2018-09-20 06:51:41 +0800
4c008127e KVM: VMX: immediately mark preemption timer expired only for zero value ... Browse Code »

A VMX preemption timer value of '0' at the time of VMEnter is
architecturally guaranteed to cause a VMExit prior to the CPU
executing any instructions in the guest. This architectural
definition is in place to ensure that a previously expired timer
is correctly recognized by the CPU as it is possible for the timer
to reach zero and not trigger a VMexit due to a higher priority
VMExit being signalled instead, e.g. a pending #DB that morphs into
a VMExit.

Whether by design or coincidence, commit f4124500c2c1 ("KVM: nVMX:
Fully emulate preemption timer") special cased timer values of '0'
and '1' to ensure prompt delivery of the VMExit. Unlike '0', a
timer value of '1' has no has no architectural guarantees regarding
when it is delivered.

Modify the timer emulation to trigger immediate VMExit if and only
if the timer value is '0', and document precisely why '0' is special.
Do this even if calibration of the virtual TSC failed, i.e. VMExit
will occur immediately regardless of the frequency of the timer.
Making only '0' a special case gives KVM leeway to be more aggressive
in ensuring the VMExit is injected prior to executing instructions in
the nested guest, and also eliminates any ambiguity as to why '1' is
a special case, e.g. why wasn't the threshold for a "short timeout"
set to 10, 100, 1000, etc...

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2018-09-20 06:26:46 +0800
a101c9d63 KVM: SVM: Switch to bitmap_zalloc() ... Browse Code »

Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Signed-off-by: Andy Shevchenko
Signed-off-by: Paolo Bonzini

Andy Shevchenko
2018-09-20 06:26:45 +0800
9a9845867 KVM/MMU: Fix comment in walk_shadow_page_lockless_end() ... Browse Code »

kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
This patch is to fix the commit.

Signed-off-by: Lan Tianyu
Signed-off-by: Paolo Bonzini

Tianyu Lan
2018-09-20 06:26:45 +0800
83b20b28c KVM: x86: don't reset root in kvm_mmu_setup() ... Browse Code »

Here is the code path which shows kvm_mmu_setup() is invoked after
kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
this means the root_hpa and prev_roots are guaranteed to be invalid. And
it is not necessary to reset it again.

kvm_vm_ioctl_create_vcpu()
kvm_arch_vcpu_create()
vmx_create_vcpu()
kvm_vcpu_init()
kvm_arch_vcpu_init()
kvm_mmu_create()
kvm_arch_vcpu_setup()
kvm_mmu_setup()
kvm_init_mmu()

This patch set reset_roots to false in kmv_mmu_setup().

Fixes: 50c28f21d045dde8c52548f8482d456b3f0956f5
Signed-off-by: Wei Yang
Reviewed-by: Liran Alon
Signed-off-by: Paolo Bonzini

Wei Yang
2018-09-20 06:26:44 +0800
d35b34a9a kvm: mmu: Don't read PDPTEs when paging is not enabled ... Browse Code »

kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
CR4.PAE = 1.

Signed-off-by: Junaid Shahid
Signed-off-by: Paolo Bonzini

Junaid Shahid
2018-09-20 06:26:43 +0800
d17662027 x86/kvm/lapic: always disable MMIO interface in x2APIC mode ... Browse Code »

When VMX is used with flexpriority disabled (because of no support or
if disabled with module parameter) MMIO interface to lAPIC is still
available in x2APIC mode while it shouldn't be (kvm-unit-tests):

PASS: apic_disable: Local apic enabled in x2APIC mode
PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
FAIL: apic_disable: *0xfee00030: 50014

The issue appears because we basically do nothing while switching to
x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
only check if lAPIC is disabled before proceeding to actual write.

When APIC access is virtualized we correctly manipulate with VMX controls
in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
in x2APIC mode so there's no issue.

Disabling MMIO interface seems to be easy. The question is: what do we
do with these reads and writes? If we add apic_x2apic_mode() check to
apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
go to userspace. When lAPIC is in kernel, Qemu uses this interface to
inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
somehow works with disabled lAPIC but when we're in xAPIC mode we will
get a real injected MSI from every write to lAPIC. Not good.

The simplest solution seems to be to just ignore writes to the region
and return ~0 for all reads when we're in x2APIC mode. This is what this
patch does. However, this approach is inconsistent with what currently
happens when flexpriority is enabled: we allocate APIC access page and
create KVM memory region so in x2APIC modes all reads and writes go to
this pre-allocated page which is, btw, the same for all vCPUs.

Signed-off-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini

Vitaly Kuznetsov
2018-09-20 06:26:43 +0800