Eric Lee / smarc-fsl-linux-kernel

15 Sep, 2017

1 commit

b9f67a420 kvm,async_pf: Use swq_has_sleeper() ... Browse Code »

... as we've got the new helper now. This caller already
does the right thing, hence no changes in semantics.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Paolo Bonzini

Davidlohr Bueso
2017-09-15 22:57:11 +0800

02 Mar, 2017

1 commit

6e84f3152 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:28 +0800

28 Feb, 2017

1 commit

3fce371bf mm: add new mmget() helper ... Browse Code »

Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:

git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.

(Michal Hocko provided most of the kerneldoc comment.)

Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra (Intel)
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vegard Nossum
2017-02-28 10:43:48 +0800

15 Dec, 2016

1 commit

8b7457ef9 mm: unexport __get_user_pages_unlocked() ... Browse Code »

Unexport the low-level __get_user_pages_unlocked() function and replaces
invocations with calls to more appropriate higher-level functions.

In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
with get_user_pages_unlocked() since we can now pass gup_flags.

In async_pf_execute() and process_vm_rw_single_vec() we need to pass
different tsk, mm arguments so get_user_pages_remote() is the sane
replacement in these cases (having added manual acquisition and release
of mmap_sem.)

Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
flag. However, this flag was originally silently dropped by commit
1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
appears to have been unintentional and reintroducing it is therefore not
an issue.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes
Acked-by: Michal Hocko
Cc: Jan Kara
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Paolo Bonzini
Cc: Radim Krcmar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-12-15 08:04:09 +0800

20 Nov, 2016

1 commit

22583f0d9 KVM: async_pf: avoid recursive flushing of work items ... Browse Code »

This was reported by syzkaller:

[ INFO: possible recursive locking detected ]
4.9.0-rc4+ #49 Not tainted
---------------------------------------------
kworker/2:1/5658 is trying to acquire lock:
([ 1644.769018] (&work->work)
[< inline >] list_empty include/linux/compiler.h:243
[] flush_work+0x0/0x660 kernel/workqueue.c:1511

but task is already holding lock:
([ 1644.769018] (&work->work)
[] process_one_work+0x94b/0x1900 kernel/workqueue.c:2093

stack backtrace:
CPU: 2 PID: 5658 Comm: kworker/2:1 Not tainted 4.9.0-rc4+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Workqueue: events async_pf_execute
ffff8800676ff630 ffffffff81c2e46b ffffffff8485b930 ffff88006b1fc480
0000000000000000 ffffffff8485b930 ffff8800676ff7e0 ffffffff81339b27
ffff8800676ff7e8 0000000000000046 ffff88006b1fcce8 ffff88006b1fccf0
Call Trace:
...
[] flush_work+0x93/0x660 kernel/workqueue.c:2846
[] __cancel_work_timer+0x17a/0x410 kernel/workqueue.c:2916
[] cancel_work_sync+0x17/0x20 kernel/workqueue.c:2951
[] kvm_clear_async_pf_completion_queue+0xd7/0x400 virt/kvm/async_pf.c:126
[< inline >] kvm_free_vcpus arch/x86/kvm/x86.c:7841
[] kvm_arch_destroy_vm+0x23d/0x620 arch/x86/kvm/x86.c:7946
[< inline >] kvm_destroy_vm virt/kvm/kvm_main.c:731
[] kvm_put_kvm+0x40e/0x790 virt/kvm/kvm_main.c:752
[] async_pf_execute+0x23d/0x4f0 virt/kvm/async_pf.c:111
[] process_one_work+0x9fc/0x1900 kernel/workqueue.c:2096
[] worker_thread+0xef/0x1480 kernel/workqueue.c:2230
[] kthread+0x244/0x2d0 kernel/kthread.c:209
[] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

The reason is that kvm_put_kvm is causing the destruction of the VM, but
the page fault is still on the ->queue list. The ->queue list is owned
by the VCPU, not by the work items, so we cannot just add list_del to
the work item.

Instead, use work->vcpu to note async page faults that have been resolved
and will be processed through the done list. There is no need to flush
those.

Cc: Dmitry Vyukov
Signed-off-by: Paolo Bonzini
Signed-off-by: Radim Krčmář

Paolo Bonzini
2016-11-20 02:04:17 +0800

19 Oct, 2016

1 commit

d4944b0ec mm: remove write/force parameters from __get_user_pages_unlocked() ... Browse Code »

This removes the redundant 'write' and 'force' parameters from
__get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Paolo Bonzini
Reviewed-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 05:13:37 +0800

21 Mar, 2016

1 commit

643ad15d4 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 protection key support from Ingo Molnar:
"This tree adds support for a new memory protection hardware feature
that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

There's a background article at LWN.net:

https://lwn.net/Articles/643797/

The gist is that protection keys allow the encoding of
user-controllable permission masks in the pte. So instead of having a
fixed protection mask in the pte (which needs a system call to change
and works on a per page basis), the user can map a (handful of)
protection mask variants and can change the masks runtime relatively
cheaply, without having to change every single page in the affected
virtual memory range.

This allows the dynamic switching of the protection bits of large
amounts of virtual memory, via user-space instructions. It also
allows more precise control of MMU permission bits: for example the
executable bit is separate from the read bit (see more about that
below).

This tree adds the MM infrastructure and low level x86 glue needed for
that, plus it adds a high level API to make use of protection keys -
if a user-space application calls:

mmap(..., PROT_EXEC);

or

mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
this special case, and will set a special protection key on this
memory range. It also sets the appropriate bits in the Protection
Keys User Rights (PKRU) register so that the memory becomes unreadable
and unwritable.

So using protection keys the kernel is able to implement 'true'
PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
PROT_READ as well. Unreadable executable mappings have security
advantages: they cannot be read via information leaks to figure out
ASLR details, nor can they be scanned for ROP gadgets - and they
cannot be used by exploits for data purposes either.

We know about no user-space code that relies on pure PROT_EXEC
mappings today, but binary loaders could start making use of this new
feature to map binaries and libraries in a more secure fashion.

There is other pending pkeys work that offers more high level system
call APIs to manage protection keys - but those are not part of this
pull request.

Right now there's a Kconfig that controls this feature
(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
(like most x86 CPU feature enablement code that has no runtime
overhead), but it's not user-configurable at the moment. If there's
any serious problem with this then we can make it configurable and/or
flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
mm/core, x86/mm/pkeys: Add execute-only protection keys support
x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
x86/mm/pkeys: Allow kernel to modify user pkey rights register
x86/fpu: Allow setting of XSAVE state
x86/mm: Factor out LDT init from context init
mm/core, x86/mm/pkeys: Add arch_validate_pkey()
mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
x86/mm/pkeys: Add Kconfig prompt to existing config option
x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
x86/mm/pkeys: Dump PKRU with other kernel registers
mm/core, x86/mm/pkeys: Differentiate instruction fetches
x86/mm/pkeys: Optimize fault handling in access_error()
mm/core: Do not enforce PKEY permissions on remote mm access
um, pkeys: Add UML arch_*_access_permitted() methods
mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
x86/mm/gup: Simplify get_user_pages() PTE bit handling
...

Linus Torvalds
2016-03-21 10:08:56 +0800

17 Mar, 2016

1 commit

10dc37476 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"One of the largest releases for KVM... Hardly any generic
changes, but lots of architecture-specific updates.

ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.

PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).

s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.

x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using
vector hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest
memory - currently its only use is to speedup the legacy shadow
paging (pre-EPT) case, but in the future it will be used for
virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
KVM: x86: disable MPX if host did not enable MPX XSAVE features
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
arm64: KVM: vgic-v3: Reset LRs at boot time
arm64: KVM: vgic-v3: Do not save an LR known to be empty
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
arm64: KVM: vgic-v3: Avoid accessing ICH registers
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
KVM: arm/arm64: vgic-v2: Reset LRs at boot time
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
KVM: s390: allocate only one DMA page per VM
KVM: s390: enable STFLE interpretation only if enabled for the guest
KVM: s390: wake up when the VCPU cpu timer expires
KVM: s390: step the VCPU timer while in enabled wait
KVM: s390: protect VCPU cpu timer with a seqcount
KVM: s390: step VCPU cpu timer during kvm_run ioctl
...

Linus Torvalds
2016-03-17 00:55:35 +0800

09 Mar, 2016

1 commit

ab92f3087 Merge tag 'kvm-arm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD ... Browse Code »

KVM/ARM updates for 4.6

- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code

Conflicts:
include/uapi/linux/kvm.h

Paolo Bonzini
2016-03-09 18:50:42 +0800

29 Feb, 2016

1 commit

6aa447bcb Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-02-29 16:42:07 +0800

25 Feb, 2016

1 commit

8577370fb KVM: Use simple waitqueue for vcpu->wq ... Browse Code »

The problem:

On -rt, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:

This patch reduces the average latency in my tests from 14us to 11us.

Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:

./x86-run x86/tscdeadline_latency.flat -cpu host

with idle=poll.

The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:

"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.

The mean shows an improvement indeed."

Before:

min max mean std
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 5162.596000 2019270.084000 5824.491541 20681.645558
std 75.431231 622607.723969 89.575700 6492.272062
min 4466.000000 23928.000000 5537.926500 585.864966
25% 5163.000000 1613252.750000 5790.132275 16683.745433
50% 5175.000000 2281919.000000 5834.654000 23151.990026
75% 5190.000000 2382865.750000 5861.412950 24148.206168
max 5228.000000 4175158.000000 6254.827300 46481.048691

After
min max mean std
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 5143.511000 2076886.10300 5813.312474 21207.357565
std 77.668322 610413.09583 86.541500 6331.915127
min 4427.000000 25103.00000 5529.756600 559.187707
25% 5148.000000 1691272.75000 5784.889825 17473.518244
50% 5160.000000 2308328.50000 5832.025000 23464.837068
75% 5172.000000 2393037.75000 5853.177675 24223.969976
max 5222.000000 3922458.00000 6186.720500 42520.379830

[Patch was originaly based on the swait implementation found in the -rt
tree. Daniel ported it to mainline's version and gathered the
benchmark numbers for tscdeadline_latency test.]

Signed-off-by: Daniel Wagner
Acked-by: Peter Zijlstra (Intel)
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng
Cc: Marcelo Tosatti
Cc: Steven Rostedt
Cc: Paul Gortmaker
Cc: Paolo Bonzini
Cc: "Paul E. McKenney"
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner

Marcelo Tosatti
2016-02-25 18:27:16 +0800

24 Feb, 2016

1 commit

d7444794a KVM: async_pf: do not warn on page allocation failures ... Browse Code »

In async_pf we try to allocate with NOWAIT to get an element quickly
or fail. This code also handle failures gracefully. Lets silence
potential page allocation failures under load.

qemu-system-s39: page allocation failure: order:0,mode:0x2200000
[...]
Call Trace:
([] show_trace+0xf8/0x148)
[] show_stack+0x62/0xe8
[] dump_stack+0x70/0x98
[] warn_alloc_failed+0xd2/0x148
[] __alloc_pages_nodemask+0x94e/0xb38
[] new_slab+0x382/0x400
[] ___slab_alloc.constprop.30+0x2dc/0x378
[] kmem_cache_alloc+0x160/0x1d0
[] kvm_setup_async_pf+0x6c/0x198
[] kvm_arch_vcpu_ioctl_run+0xd48/0xd58
[] kvm_vcpu_ioctl+0x372/0x690
[] do_vfs_ioctl+0x3be/0x510
[] SyS_ioctl+0xa4/0xb8
[] system_call+0xd6/0x264
[] 0x3ffa24fa06a

Cc: stable@vger.kernel.org
Signed-off-by: Christian Borntraeger
Reviewed-by: Dominik Dingel
Signed-off-by: Paolo Bonzini

Christian Borntraeger
2016-02-24 21:47:46 +0800

23 Feb, 2016

1 commit

433da8602 KVM: async_pf: use list_first_entry ... Browse Code »

To make the intention clearer, use list_first_entry instead of
list_entry.

Signed-off-by: Geliang Tang
Signed-off-by: Paolo Bonzini

Geliang Tang
2016-02-23 22:40:55 +0800

16 Feb, 2016

1 commit

1e9877902 mm/gup: Introduce get_user_pages_remote() ... Browse Code »

For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.

This patch introduces a new get_user_pages() variant:

get_user_pages_remote()

Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.

We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.

The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.

Without protection keys, this patch should not change any behavior.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Kirill A. Shutemov
Cc: Linus Torvalds
Cc: Naoya Horiguchi
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Vlastimil Babka
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-16 17:04:09 +0800

26 Nov, 2015

1 commit

4f52696a6 KVM-async_pf: Delete an unnecessary check before the function call "kmem_cache_destroy" ... Browse Code »

The kmem_cache_destroy() function tests whether its argument is NULL
and then returns immediately. Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring
Signed-off-by: Paolo Bonzini

Markus Elfring
2015-11-26 00:24:23 +0800

14 Oct, 2015

1 commit

6003a4201 kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c ... Browse Code »

async_pf_execute() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.

async_pf_execute kvm_vcpu_block
------------------------------------------------------------------------
spin_lock(&vcpu->async_pf.lock);
if (waitqueue_active(&vcpu->wq))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&vcpu->wq, &wait,
TASK_INTERRUPTIBLE);
/*if (kvm_vcpu_check_block(vcpu) < 0) */
/*if (kvm_arch_vcpu_runnable(vcpu)) { */
...
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
|| !list_empty_careful(&vcpu->async_pf.done)
...
return 0;
list_add_tail(&apf->link,
&vcpu->async_pf.done);
spin_unlock(&vcpu->async_pf.lock);
waited = true;
schedule();
------------------------------------------------------------------------

The attached patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa
Signed-off-by: Paolo Bonzini

Kosuke Tatsukawa
2015-10-14 22:41:08 +0800

12 Feb, 2015

1 commit

0664e57ff mm: gup: kvm use get_user_pages_unlocked ... Browse Code »

Use the more generic get_user_pages_unlocked which has the additional
benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
(which allows the first page fault in an unmapped area to be always able
to block indefinitely by being allowed to release the mmap_sem).

Signed-off-by: Andrea Arcangeli
Reviewed-by: Andres Lagar-Cavilla
Reviewed-by: Kirill A. Shutemov
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800

24 Sep, 2014

1 commit

234b239be kvm: Faults which trigger IO release the mmap_sem ... Browse Code »

When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
has been swapped out or is behind a filemap, this will trigger async
readahead and return immediately. The rationale is that KVM will kick
back the guest with an "async page fault" and allow for some other
guest process to take over.

If async PFs are enabled the fault is retried asap from an async
workqueue. If not, it's retried immediately in the same code path. In
either case the retry will not relinquish the mmap semaphore and will
block on the IO. This is a bad thing, as other mmap semaphore users
now stall as a function of swap or filemap latency.

This patch ensures both the regular and async PF path re-enter the
fault allowing for the mmap semaphore to be relinquished in the case
of IO wait.

Reviewed-by: Radim Krčmář
Signed-off-by: Andres Lagar-Cavilla
Acked-by: Andrew Morton
Signed-off-by: Paolo Bonzini

Andres Lagar-Cavilla
2014-09-24 20:07:54 +0800

04 Jun, 2014

1 commit

b05d59dfc Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:

- a lot of s390 changes: optimizations, support for migration, GDB
support and more

- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)

- initial POWER8 and little-endian host support

- support for running u-boot on embedded POWER targets

- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware

- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.

The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...

Linus Torvalds
2014-06-04 23:47:12 +0800

28 Apr, 2014

3 commits

e9545b9f8 KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL) ... Browse Code »

async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.

Signed-off-by: Oleg Nesterov
Suggested-by: Andrea Arcangeli
Signed-off-by: Paolo Bonzini

Oleg Nesterov
2014-04-28 23:24:55 +0800
d72d946d0 KVM: async_pf: kill the unnecessary use_mm/unuse_mm async_pf_execute() ... Browse Code »

async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
should work just fine even if current has another or NULL ->mm.

Recently kvm_async_page_present_sync() was added insedie the "use_mm"
section, but it seems that it doesn't need current->mm too.

Signed-off-by: Oleg Nesterov
Reviewed-by: Andrea Arcangeli
Signed-off-by: Paolo Bonzini

Oleg Nesterov
2014-04-28 23:24:25 +0800
41c22f626 KVM: async_pf: mm->mm_users can not pin apf->mm ... Browse Code »

get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc
was already called (or is in progress), mm->mm_count can only pin mm->pgd
and mm_struct itself.

Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users.

kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks
fine at first glance, it seems that this ->mm is only used to verify that
current->mm == kvm->mm.

Signed-off-by: Oleg Nesterov
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini

Oleg Nesterov
2014-04-28 19:04:46 +0800

04 Feb, 2014

1 commit

1179ba539 KVM: async_pf: Add missing call for async page present ... Browse Code »

Commit KVM: async_pf: Provide additional direct page notification
missed the call from kvm_check_async_pf_completion to the new introduced function.

Reported-by: Paolo Bonzini
Signed-off-by: Dominik Dingel
Signed-off-by: Paolo Bonzini

Dominik Dingel
2014-02-04 11:24:05 +0800

30 Jan, 2014

2 commits

9f2ceda49 KVM: async_pf: Allow to wait for outstanding work ... Browse Code »

On s390 we are not able to cancel work. Instead we will flush the work and wait for
completion.

Signed-off-by: Dominik Dingel
Signed-off-by: Christian Borntraeger

Dominik Dingel
2014-01-30 19:52:20 +0800
e0ead41a6 KVM: async_pf: Provide additional direct page notification ... Browse Code »

By setting a Kconfig option, the architecture can control when
guest notifications will be presented by the apf backend.
There is the default batch mechanism, working as before, where the vcpu
thread should pull in this information.
Opposite to this, there is now the direct mechanism, that will push the
information to the guest.
This way s390 can use an already existing architecture interface.

Still the vcpu thread should call check_completion to cleanup leftovers.

Signed-off-by: Dominik Dingel
Signed-off-by: Christian Borntraeger

Dominik Dingel
2014-01-30 19:51:38 +0800

15 Oct, 2013

1 commit

f2e106692 KVM: Drop FOLL_GET in GUP when doing async page fault ... Browse Code »

Page pinning is not mandatory in kvm async page fault processing since
after async page fault event is delivered to a guest it accesses page once
again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
code, and do some simplifying in check/clear processing.

Suggested-by: Gleb Natapov
Signed-off-by: Gu zheng
Signed-off-by: chai wen
Signed-off-by: Gleb Natapov

chai wen
2013-10-15 18:43:37 +0800

25 Sep, 2013

1 commit

98fda1692 kvm: remove .done from struct kvm_async_pf ... Browse Code »

'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.

Signed-off-by: Radim Krčmář
Reviewed-by: Paolo Bonzini
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini

Radim Krčmář
2013-09-25 01:12:12 +0800

17 Sep, 2013

1 commit

28b441e24 kvm: free resources after canceling async_pf ... Browse Code »

When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.

Signed-off-by: Radim Krčmář
Reviewed-by: Paolo Bonzini
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini

Radim Krčmář
2013-09-17 17:53:15 +0800

06 Aug, 2012

2 commits

32cad84f4 KVM: do not release the error page ... Browse Code »

After commit a2766325cf9f9, the error page is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong
Signed-off-by: Avi Kivity

Xiao Guangrong
2012-08-06 21:04:58 +0800
6cede2e67 KVM: introduce KVM_ERR_PTR_BAD_PAGE ... Browse Code »

It is used to eliminate the overload of function call and cleanup
the code

Signed-off-by: Xiao Guangrong
Signed-off-by: Avi Kivity

Xiao Guangrong
2012-08-06 21:04:55 +0800

26 Jul, 2012

2 commits

a2766325c KVM: remove dummy pages ... Browse Code »
129

Currently, kvm allocates some pages and use them as error indicators,
it wastes memory and is not good for scalability

Base on Avi's suggestion, we use the error codes instead of these pages
to indicate the error conditions

Signed-off-by: Xiao Guangrong
Signed-off-by: Avi Kivity

Xiao Guangrong
2012-07-26 16:55:34 +0800
2b4b5af8f KVM: use kvm_release_page_clean to release the page ... Browse Code »

In kvm_async_pf_wakeup_all, it uses bad_page to generate broadcast wakeup,
and uses put_page to release bad_page, the work depends on the fact that
bad_page is the normal page. But we will use the error code instead of
bad_page, so use kvm_release_page_clean to release the page which will
release the error code properly

Signed-off-by: Xiao Guangrong
Signed-off-by: Avi Kivity

Xiao Guangrong
2012-07-26 16:55:31 +0800

12 Jan, 2011

6 commits

64f638c7c KVM: fix the race while wakeup all pv guest ... Browse Code »

In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
without holding vcpu->async_pf.lock, it will break if we are handling apfs
at this time.

Also use 'list_empty_careful()' instead of 'list_empty()'

Signed-off-by: Xiao Guangrong
Acked-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Xiao Guangrong
2011-01-12 17:29:03 +0800
15096ffce KVM: handle more completed apfs if possible ... Browse Code »

If it's no need to inject async #PF to PV guest we can handle
more completed apfs at one time, so we can retry guest #PF
as early as possible

Signed-off-by: Xiao Guangrong
Acked-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Xiao Guangrong
2011-01-12 17:29:01 +0800
7c90705bf KVM: Inject asynchronous page fault into a PV guest if page is swapped out. ... Browse Code »

Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Gleb Natapov
2011-01-12 17:23:17 +0800
344d9588a KVM: Add PV MSR to enable asynchronous page faults delivery. ... Browse Code »

Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Gleb Natapov
2011-01-12 17:23:12 +0800
56028d086 KVM: Retry fault before vmentry ... Browse Code »

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Gleb Natapov
2011-01-12 17:23:06 +0800
af585b921 KVM: Halt vcpu if page it tries to access is swapped out ... Browse Code »

If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

[avi: remove call to get_user_pages_noio(), nacked by Linus; this
makes everything synchrnous again]

Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti

Gleb Natapov
2011-01-12 17:21:39 +0800