15 Sep, 2017

1 commit


02 Mar, 2017

1 commit

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_users);/mmget\(\1\);/'
    git grep -l 'atomic_inc.*mm_users' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_users);/mmget\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-2-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

15 Dec, 2016

1 commit

  • Unexport the low-level __get_user_pages_unlocked() function and replaces
    invocations with calls to more appropriate higher-level functions.

    In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
    with get_user_pages_unlocked() since we can now pass gup_flags.

    In async_pf_execute() and process_vm_rw_single_vec() we need to pass
    different tsk, mm arguments so get_user_pages_remote() is the sane
    replacement in these cases (having added manual acquisition and release
    of mmap_sem.)

    Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
    flag. However, this flag was originally silently dropped by commit
    1e9877902dc7 ("mm/gup: Introduce get_user_pages_remote()"), so this
    appears to have been unintentional and reintroducing it is therefore not
    an issue.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.com
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

20 Nov, 2016

1 commit

  • This was reported by syzkaller:

    [ INFO: possible recursive locking detected ]
    4.9.0-rc4+ #49 Not tainted
    ---------------------------------------------
    kworker/2:1/5658 is trying to acquire lock:
    ([ 1644.769018] (&work->work)
    [< inline >] list_empty include/linux/compiler.h:243
    [] flush_work+0x0/0x660 kernel/workqueue.c:1511

    but task is already holding lock:
    ([ 1644.769018] (&work->work)
    [] process_one_work+0x94b/0x1900 kernel/workqueue.c:2093

    stack backtrace:
    CPU: 2 PID: 5658 Comm: kworker/2:1 Not tainted 4.9.0-rc4+ #49
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Workqueue: events async_pf_execute
    ffff8800676ff630 ffffffff81c2e46b ffffffff8485b930 ffff88006b1fc480
    0000000000000000 ffffffff8485b930 ffff8800676ff7e0 ffffffff81339b27
    ffff8800676ff7e8 0000000000000046 ffff88006b1fcce8 ffff88006b1fccf0
    Call Trace:
    ...
    [] flush_work+0x93/0x660 kernel/workqueue.c:2846
    [] __cancel_work_timer+0x17a/0x410 kernel/workqueue.c:2916
    [] cancel_work_sync+0x17/0x20 kernel/workqueue.c:2951
    [] kvm_clear_async_pf_completion_queue+0xd7/0x400 virt/kvm/async_pf.c:126
    [< inline >] kvm_free_vcpus arch/x86/kvm/x86.c:7841
    [] kvm_arch_destroy_vm+0x23d/0x620 arch/x86/kvm/x86.c:7946
    [< inline >] kvm_destroy_vm virt/kvm/kvm_main.c:731
    [] kvm_put_kvm+0x40e/0x790 virt/kvm/kvm_main.c:752
    [] async_pf_execute+0x23d/0x4f0 virt/kvm/async_pf.c:111
    [] process_one_work+0x9fc/0x1900 kernel/workqueue.c:2096
    [] worker_thread+0xef/0x1480 kernel/workqueue.c:2230
    [] kthread+0x244/0x2d0 kernel/kthread.c:209
    [] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

    The reason is that kvm_put_kvm is causing the destruction of the VM, but
    the page fault is still on the ->queue list. The ->queue list is owned
    by the VCPU, not by the work items, so we cannot just add list_del to
    the work item.

    Instead, use work->vcpu to note async page faults that have been resolved
    and will be processed through the done list. There is no need to flush
    those.

    Cc: Dmitry Vyukov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     

19 Oct, 2016

1 commit

  • This removes the redundant 'write' and 'force' parameters from
    __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
    callers as use of this flag can result in surprising behaviour (and
    hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Reviewed-by: Jan Kara
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

17 Mar, 2016

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "One of the largest releases for KVM... Hardly any generic
    changes, but lots of architecture-specific updates.

    ARM:
    - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
    - PMU support for guests
    - 32bit world switch rewritten in C
    - various optimizations to the vgic save/restore code.

    PPC:
    - enabled KVM-VFIO integration ("VFIO device")
    - optimizations to speed up IPIs between vcpus
    - in-kernel handling of IOMMU hypercalls
    - support for dynamic DMA windows (DDW).

    s390:
    - provide the floating point registers via sync regs;
    - separated instruction vs. data accesses
    - dirty log improvements for huge guests
    - bugfixes and documentation improvements.

    x86:
    - Hyper-V VMBus hypercall userspace exit
    - alternative implementation of lowest-priority interrupts using
    vector hashing (for better VT-d posted interrupt support)
    - fixed guest debugging with nested virtualizations
    - improved interrupt tracking in the in-kernel IOAPIC
    - generic infrastructure for tracking writes to guest
    memory - currently its only use is to speedup the legacy shadow
    paging (pre-EPT) case, but in the future it will be used for
    virtual GPUs as well
    - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
    KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
    KVM: x86: disable MPX if host did not enable MPX XSAVE features
    arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
    arm64: KVM: vgic-v3: Reset LRs at boot time
    arm64: KVM: vgic-v3: Do not save an LR known to be empty
    arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
    arm64: KVM: vgic-v3: Avoid accessing ICH registers
    KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
    KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
    KVM: arm/arm64: vgic-v2: Reset LRs at boot time
    KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
    KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
    KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
    KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
    KVM: s390: allocate only one DMA page per VM
    KVM: s390: enable STFLE interpretation only if enabled for the guest
    KVM: s390: wake up when the VCPU cpu timer expires
    KVM: s390: step the VCPU timer while in enabled wait
    KVM: s390: protect VCPU cpu timer with a seqcount
    KVM: s390: step VCPU cpu timer during kvm_run ioctl
    ...

    Linus Torvalds
     

09 Mar, 2016

1 commit


29 Feb, 2016

1 commit


25 Feb, 2016

1 commit

  • The problem:

    On -rt, an emulated LAPIC timer instances has the following path:

    1) hard interrupt
    2) ksoftirqd is scheduled
    3) ksoftirqd wakes up vcpu thread
    4) vcpu thread is scheduled

    This extra context switch introduces unnecessary latency in the
    LAPIC path for a KVM guest.

    The solution:

    Allow waking up vcpu thread from hardirq context,
    thus avoiding the need for ksoftirqd to be scheduled.

    Normal waitqueues make use of spinlocks, which on -RT
    are sleepable locks. Therefore, waking up a waitqueue
    waiter involves locking a sleeping lock, which
    is not allowed from hard interrupt context.

    cyclictest command line:

    This patch reduces the average latency in my tests from 14us to 11us.

    Daniel writes:
    Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
    benchmark on mainline. The test was run 1000 times on
    tip/sched/core 4.4.0-rc8-01134-g0905f04:

    ./x86-run x86/tscdeadline_latency.flat -cpu host

    with idle=poll.

    The test seems not to deliver really stable numbers though most of
    them are smaller. Paolo write:

    "Anything above ~10000 cycles means that the host went to C1 or
    lower---the number means more or less nothing in that case.

    The mean shows an improvement indeed."

    Before:

    min max mean std
    count 1000.000000 1000.000000 1000.000000 1000.000000
    mean 5162.596000 2019270.084000 5824.491541 20681.645558
    std 75.431231 622607.723969 89.575700 6492.272062
    min 4466.000000 23928.000000 5537.926500 585.864966
    25% 5163.000000 1613252.750000 5790.132275 16683.745433
    50% 5175.000000 2281919.000000 5834.654000 23151.990026
    75% 5190.000000 2382865.750000 5861.412950 24148.206168
    max 5228.000000 4175158.000000 6254.827300 46481.048691

    After
    min max mean std
    count 1000.000000 1000.00000 1000.000000 1000.000000
    mean 5143.511000 2076886.10300 5813.312474 21207.357565
    std 77.668322 610413.09583 86.541500 6331.915127
    min 4427.000000 25103.00000 5529.756600 559.187707
    25% 5148.000000 1691272.75000 5784.889825 17473.518244
    50% 5160.000000 2308328.50000 5832.025000 23464.837068
    75% 5172.000000 2393037.75000 5853.177675 24223.969976
    max 5222.000000 3922458.00000 6186.720500 42520.379830

    [Patch was originaly based on the swait implementation found in the -rt
    tree. Daniel ported it to mainline's version and gathered the
    benchmark numbers for tscdeadline_latency test.]

    Signed-off-by: Daniel Wagner
    Acked-by: Peter Zijlstra (Intel)
    Cc: linux-rt-users@vger.kernel.org
    Cc: Boqun Feng
    Cc: Marcelo Tosatti
    Cc: Steven Rostedt
    Cc: Paul Gortmaker
    Cc: Paolo Bonzini
    Cc: "Paul E. McKenney"
    Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
    Signed-off-by: Thomas Gleixner

    Marcelo Tosatti
     

24 Feb, 2016

1 commit

  • In async_pf we try to allocate with NOWAIT to get an element quickly
    or fail. This code also handle failures gracefully. Lets silence
    potential page allocation failures under load.

    qemu-system-s39: page allocation failure: order:0,mode:0x2200000
    [...]
    Call Trace:
    ([] show_trace+0xf8/0x148)
    [] show_stack+0x62/0xe8
    [] dump_stack+0x70/0x98
    [] warn_alloc_failed+0xd2/0x148
    [] __alloc_pages_nodemask+0x94e/0xb38
    [] new_slab+0x382/0x400
    [] ___slab_alloc.constprop.30+0x2dc/0x378
    [] kmem_cache_alloc+0x160/0x1d0
    [] kvm_setup_async_pf+0x6c/0x198
    [] kvm_arch_vcpu_ioctl_run+0xd48/0xd58
    [] kvm_vcpu_ioctl+0x372/0x690
    [] do_vfs_ioctl+0x3be/0x510
    [] SyS_ioctl+0xa4/0xb8
    [] system_call+0xd6/0x264
    [] 0x3ffa24fa06a

    Cc: stable@vger.kernel.org
    Signed-off-by: Christian Borntraeger
    Reviewed-by: Dominik Dingel
    Signed-off-by: Paolo Bonzini

    Christian Borntraeger
     

23 Feb, 2016

1 commit


16 Feb, 2016

1 commit

  • For protection keys, we need to understand whether protections
    should be enforced in software or not. In general, we enforce
    protections when working on our own task, but not when on others.
    We call these "current" and "remote" operations.

    This patch introduces a new get_user_pages() variant:

    get_user_pages_remote()

    Which is a replacement for when get_user_pages() is called on
    non-current tsk/mm.

    We also introduce a new gup flag: FOLL_REMOTE which can be used
    for the "__" gup variants to get this new behavior.

    The uprobes is_trap_at_addr() location holds mmap_sem and
    calls get_user_pages(current->mm) on an instruction address. This
    makes it a pretty unique gup caller. Being an instruction access
    and also really originating from the kernel (vs. the app), I opted
    to consider this a 'remote' access where protection keys will not
    be enforced.

    Without protection keys, this patch should not change any behavior.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

26 Nov, 2015

1 commit


14 Oct, 2015

1 commit

  • async_pf_execute() seems to be missing a memory barrier which might
    cause the waker to not notice the waiter and miss sending a wake_up as
    in the following figure.

    async_pf_execute kvm_vcpu_block
    ------------------------------------------------------------------------
    spin_lock(&vcpu->async_pf.lock);
    if (waitqueue_active(&vcpu->wq))
    /* The CPU might reorder the test for
    the waitqueue up here, before
    prior writes complete */
    prepare_to_wait(&vcpu->wq, &wait,
    TASK_INTERRUPTIBLE);
    /*if (kvm_vcpu_check_block(vcpu) < 0) */
    /*if (kvm_arch_vcpu_runnable(vcpu)) { */
    ...
    return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
    !vcpu->arch.apf.halted)
    || !list_empty_careful(&vcpu->async_pf.done)
    ...
    return 0;
    list_add_tail(&apf->link,
    &vcpu->async_pf.done);
    spin_unlock(&vcpu->async_pf.lock);
    waited = true;
    schedule();
    ------------------------------------------------------------------------

    The attached patch adds the missing memory barrier.

    I found this issue when I was looking through the linux source code
    for places calling waitqueue_active() before wake_up*(), but without
    preceding memory barriers, after sending a patch to fix a similar
    issue in drivers/tty/n_tty.c (Details about the original issue can be
    found here: https://lkml.org/lkml/2015/9/28/849).

    Signed-off-by: Kosuke Tatsukawa
    Signed-off-by: Paolo Bonzini

    Kosuke Tatsukawa
     

12 Feb, 2015

1 commit

  • Use the more generic get_user_pages_unlocked which has the additional
    benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
    (which allows the first page fault in an unmapped area to be always able
    to block indefinitely by being allowed to release the mmap_sem).

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Kirill A. Shutemov
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

24 Sep, 2014

1 commit

  • When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
    has been swapped out or is behind a filemap, this will trigger async
    readahead and return immediately. The rationale is that KVM will kick
    back the guest with an "async page fault" and allow for some other
    guest process to take over.

    If async PFs are enabled the fault is retried asap from an async
    workqueue. If not, it's retried immediately in the same code path. In
    either case the retry will not relinquish the mmap semaphore and will
    block on the IO. This is a bad thing, as other mmap semaphore users
    now stall as a function of swap or filemap latency.

    This patch ensures both the regular and async PF path re-enter the
    fault allowing for the mmap semaphore to be relinquished in the case
    of IO wait.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

04 Jun, 2014

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "At over 200 commits, covering almost all supported architectures, this
    was a pretty active cycle for KVM. Changes include:

    - a lot of s390 changes: optimizations, support for migration, GDB
    support and more

    - ARM changes are pretty small: support for the PSCI 0.2 hypercall
    interface on both the guest and the host (the latter acked by
    Catalin)

    - initial POWER8 and little-endian host support

    - support for running u-boot on embedded POWER targets

    - pretty large changes to MIPS too, completing the userspace
    interface and improving the handling of virtualized timer hardware

    - for x86, a larger set of changes is scheduled for 3.17. Still, we
    have a few emulator bugfixes and support for running nested
    fully-virtualized Xen guests (para-virtualized Xen guests have
    always worked). And some optimizations too.

    The only missing architecture here is ia64. It's not a coincidence
    that support for KVM on ia64 is scheduled for removal in 3.17"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
    KVM: add missing cleanup_srcu_struct
    KVM: PPC: Book3S PR: Rework SLB switching code
    KVM: PPC: Book3S PR: Use SLB entry 0
    KVM: PPC: Book3S HV: Fix machine check delivery to guest
    KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
    KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
    KVM: PPC: Book3S HV: Fix dirty map for hugepages
    KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
    KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
    KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
    KVM: PPC: Book3S: Add ONE_REG register names that were missed
    KVM: PPC: Add CAP to indicate hcall fixes
    KVM: PPC: MPIC: Reset IRQ source private members
    KVM: PPC: Graciously fail broken LE hypercalls
    PPC: ePAPR: Fix hypercall on LE guest
    KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
    KVM: PPC: BOOK3S: Always use the saved DAR value
    PPC: KVM: Make NX bit available with magic page
    KVM: PPC: Disable NX for old magic page using guests
    KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
    ...

    Linus Torvalds
     

28 Apr, 2014

3 commits

  • async_pf_execute() passes tsk == current to gup(), this is doesn't
    hurt but unnecessary and misleading. "tsk" is only used to account
    the number of faults and current is the random workqueue thread.

    Signed-off-by: Oleg Nesterov
    Suggested-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     
  • async_pf_execute() has no reasons to adopt apf->mm, gup(current, mm)
    should work just fine even if current has another or NULL ->mm.

    Recently kvm_async_page_present_sync() was added insedie the "use_mm"
    section, but it seems that it doesn't need current->mm too.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     
  • get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc
    was already called (or is in progress), mm->mm_count can only pin mm->pgd
    and mm_struct itself.

    Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users.

    kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks
    fine at first glance, it seems that this ->mm is only used to verify that
    current->mm == kvm->mm.

    Signed-off-by: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Oleg Nesterov
     

04 Feb, 2014

1 commit


30 Jan, 2014

2 commits

  • On s390 we are not able to cancel work. Instead we will flush the work and wait for
    completion.

    Signed-off-by: Dominik Dingel
    Signed-off-by: Christian Borntraeger

    Dominik Dingel
     
  • By setting a Kconfig option, the architecture can control when
    guest notifications will be presented by the apf backend.
    There is the default batch mechanism, working as before, where the vcpu
    thread should pull in this information.
    Opposite to this, there is now the direct mechanism, that will push the
    information to the guest.
    This way s390 can use an already existing architecture interface.

    Still the vcpu thread should call check_completion to cleanup leftovers.

    Signed-off-by: Dominik Dingel
    Signed-off-by: Christian Borntraeger

    Dominik Dingel
     

15 Oct, 2013

1 commit

  • Page pinning is not mandatory in kvm async page fault processing since
    after async page fault event is delivered to a guest it accesses page once
    again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
    code, and do some simplifying in check/clear processing.

    Suggested-by: Gleb Natapov
    Signed-off-by: Gu zheng
    Signed-off-by: chai wen
    Signed-off-by: Gleb Natapov

    chai wen
     

25 Sep, 2013

1 commit

  • '.done' is used to mark the completion of 'async_pf_execute()', but
    'cancel_work_sync()' returns true when the work was canceled, so we
    use it instead.

    Signed-off-by: Radim Krčmář
    Reviewed-by: Paolo Bonzini
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     

17 Sep, 2013

1 commit

  • When we cancel 'async_pf_execute()', we should behave as if the work was
    never scheduled in 'kvm_setup_async_pf()'.
    Fixes a bug when we can't unload module because the vm wasn't destroyed.

    Signed-off-by: Radim Krčmář
    Reviewed-by: Paolo Bonzini
    Reviewed-by: Gleb Natapov
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     

06 Aug, 2012

2 commits


26 Jul, 2012

2 commits

  • Currently, kvm allocates some pages and use them as error indicators,
    it wastes memory and is not good for scalability

    Base on Avi's suggestion, we use the error codes instead of these pages
    to indicate the error conditions

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     
  • In kvm_async_pf_wakeup_all, it uses bad_page to generate broadcast wakeup,
    and uses put_page to release bad_page, the work depends on the fact that
    bad_page is the normal page. But we will use the error code instead of
    bad_page, so use kvm_release_page_clean to release the page which will
    release the error code properly

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Avi Kivity

    Xiao Guangrong
     

12 Jan, 2011

6 commits

  • In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
    without holding vcpu->async_pf.lock, it will break if we are handling apfs
    at this time.

    Also use 'list_empty_careful()' instead of 'list_empty()'

    Signed-off-by: Xiao Guangrong
    Acked-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • If it's no need to inject async #PF to PV guest we can handle
    more completed apfs at one time, so we can retry guest #PF
    as early as possible

    Signed-off-by: Xiao Guangrong
    Acked-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Send async page fault to a PV guest if it accesses swapped out memory.
    Guest will choose another task to run upon receiving the fault.

    Allow async page fault injection only when guest is in user mode since
    otherwise guest may be in non-sleepable context and will not be able
    to reschedule.

    Vcpu will be halted if guest will fault on the same page again or if
    vcpu executes kernel code.

    Acked-by: Rik van Riel
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • Guest enables async PF vcpu functionality using this MSR.

    Reviewed-by: Rik van Riel
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • When page is swapped in it is mapped into guest memory only after guest
    tries to access it again and generate another fault. To save this fault
    we can map it immediately since we know that guest is going to access
    the page. Do it only when tdp is enabled for now. Shadow paging case is
    more complicated. CR[034] and EFER registers should be switched before
    doing mapping and then switched back.

    Acked-by: Rik van Riel
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • If a guest accesses swapped out memory do not swap it in from vcpu thread
    context. Schedule work to do swapping and put vcpu into halted state
    instead.

    Interrupts will still be delivered to the guest and if interrupt will
    cause reschedule guest will continue to run another task.

    [avi: remove call to get_user_pages_noio(), nacked by Linus; this
    makes everything synchrnous again]

    Acked-by: Rik van Riel
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov