15 Apr, 2013

1 commit


11 Apr, 2013

1 commit

  • Invoking arch_flush_lazy_mmu_mode() results in calls to
    preempt_enable()/disable() which may have performance impact.

    Since lazy MMU is not used on bare metal we can patch away
    arch_flush_lazy_mmu_mode() so that it is never called in such
    environment.

    [ hpa: the previous patch "Fix vmalloc_fault oops during lazy MMU
    updates" may cause a minor performance regression on
    bare metal. This patch resolves that performance regression. It is
    somewhat unclear to me if this is a good -stable candidate. ]

    Signed-off-by: Boris Ostrovsky
    Link: http://lkml.kernel.org/r/1364045796-10720-2-git-send-email-konrad.wilk@oracle.com
    Tested-by: Josh Boyer
    Tested-by: Konrad Rzeszutek Wilk
    Acked-by: Borislav Petkov
    Signed-off-by: Konrad Rzeszutek Wilk
    Signed-off-by: H. Peter Anvin
    Cc: SEE NOTE ABOVE

    Boris Ostrovsky
     

03 Apr, 2013

1 commit

  • Occassionaly on a DL380 G4 the guest would crash quite early with this:

    (XEN) d244:v0: unhandled page fault (ec=0003)
    (XEN) Pagetable walk from ffffffff84dc7000:
    (XEN) L4[0x1ff] = 00000000c3f18067 0000000000001789
    (XEN) L3[0x1fe] = 00000000c3f14067 000000000000178d
    (XEN) L2[0x026] = 00000000dc8b2067 0000000000004def
    (XEN) L1[0x1c7] = 00100000dc8da067 0000000000004dc7
    (XEN) domain_crash_sync called from entry.S
    (XEN) Domain 244 (vcpu#0) crashed on cpu#3:
    (XEN) ----[ Xen-4.1.3OVM x86_64 debug=n Not tainted ]----
    (XEN) CPU: 3
    (XEN) RIP: e033:[]
    (XEN) RFLAGS: 0000000000000216 EM: 1 CONTEXT: pv guest
    (XEN) rax: 0000000000000000 rbx: ffffffff81785f88 rcx: 000000000000003f
    (XEN) rdx: 0000000000000000 rsi: 00000000dc8da063 rdi: ffffffff84dc7000

    The offending code shows it to be a loop writting the value zero
    (%rax) in the %rdi (the L4 provided by Xen) register:

    0: 44 00 00 add %r8b,(%rax)
    3: 31 c0 xor %eax,%eax
    5: b9 40 00 00 00 mov $0x40,%ecx
    a: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
    11: 00 00
    13: ff c9 dec %ecx
    15:* 48 89 07 mov %rax,(%rdi)
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     

28 Mar, 2013

1 commit


23 Feb, 2013

1 commit

  • With commit 8170e6bed465 ("x86, 64bit: Use a #PF handler to materialize
    early mappings on demand") we started hitting an early bootup crash
    where the Xen hypervisor would inform us that:

    (XEN) d7:v0: unhandled page fault (ec=0000)
    (XEN) Pagetable walk from ffffea000005b2d0:
    (XEN) L4[0x1d4] = 0000000000000000 ffffffffffffffff
    (XEN) domain_crash_sync called from entry.S
    (XEN) Domain 7 (vcpu#0) crashed on cpu#3:
    (XEN) ----[ Xen-4.2.0 x86_64 debug=n Not tainted ]----

    .. that Xen was unable to context switch back to dom0.

    Looking at the calling stack we find:

    [] xen_get_user_pgd+0x5a ] xen_get_user_pgd+0x5a
    [] xen_write_cr3+0x77
    [] init_mem_mapping+0x1f9
    [] setup_arch+0x742
    [] printk+0x48

    We are trying to figure out whether we need to up-date the user PGD as
    well. Please keep in mind that under 64-bit PV guests we have a limited
    amount of rings: 0 for the Hypervisor, and 1 for both the Linux kernel
    and user-space. As such the Linux pvops'fied version of write_cr3
    checks if it has to update the user-space cr3 as well.

    That clearly is not needed during early bootup. The recent changes (see
    above git commit) streamline the x86 page table allocation to be much
    simpler (And also incidentally the #PF handler ends up in spirit being
    similar to how the Xen toolstack sets up the initial page-tables).

    The fix is to have an early-bootup version of cr3 that just loads the
    kernel %cr3. The later version - which also handles user-page
    modifications will be used after the initial page tables have been
    setup.

    [ hpa: removed a redundant #ifdef and made the new function __init.
    Also note that x86-32 already has such an early xen_write_cr3. ]

    Tested-by: "H. Peter Anvin"
    Reported-by: Konrad Rzeszutek Wilk
    Signed-off-by: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/1361579812-23709-1-git-send-email-konrad.wilk@oracle.com
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Linus Torvalds

    Konrad Rzeszutek Wilk
     

30 Jan, 2013

1 commit


14 Dec, 2012

1 commit

  • Pull Xen updates from Konrad Rzeszutek Wilk:
    - Add necessary infrastructure to make balloon driver work under ARM.
    - Add /dev/xen/privcmd interfaces to work with ARM and PVH.
    - Improve Xen PCIBack wild-card parsing.
    - Add Xen ACPI PAD (Processor Aggregator) support - so can offline/
    online sockets depending on the power consumption.
    - PVHVM + kexec = use an E820_RESV region for the shared region so we
    don't overwrite said region during kexec reboot.
    - Cleanups, compile fixes.

    Fix up some trivial conflicts due to the balloon driver now working on
    ARM, and there were changes next to the previous work-arounds that are
    now gone.

    * tag 'stable/for-linus-3.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/PVonHVM: fix compile warning in init_hvm_pv_info
    xen: arm: implement remap interfaces needed for privcmd mappings.
    xen: correctly use xen_pfn_t in remap_domain_mfn_range.
    xen: arm: enable balloon driver
    xen: balloon: allow PVMMU interfaces to be compiled out
    xen: privcmd: support autotranslated physmap guests.
    xen: add pages parameter to xen_remap_domain_mfn_range
    xen/acpi: Move the xen_running_on_version_or_later function.
    xen/xenbus: Remove duplicate inclusion of asm/xen/hypervisor.h
    xen/acpi: Fix compile error by missing decleration for xen_domain.
    xen/acpi: revert pad config check in xen_check_mwait
    xen/acpi: ACPI PAD driver
    xen-pciback: reject out of range inputs
    xen-pciback: simplify and tighten parsing of device IDs
    xen PVonHVM: use E820_Reserved area for shared_info

    Linus Torvalds
     

29 Nov, 2012

2 commits


18 Nov, 2012

1 commit

  • Page table area are pre-mapped now after
    x86, mm: setup page table in top-down
    x86, mm: Remove early_memremap workaround for page table accessing on 64bit

    mapping_pagetable_reserve is not used anymore, so remove it.

    Also remove operation in mask_rw_pte(), as modified allow_low_page
    always return pages that are already mapped, moreover
    xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
    before hooking it into the pagetable automatically.

    -v2: add changelog about mask_rw_pte() from Stefano.

    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/1353123563-3103-27-git-send-email-yinghai@kernel.org
    Cc: Stefano Stabellini
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

01 Nov, 2012

1 commit

  • As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
    hypervisor to do a TLB flush on all active vCPUs. If instead
    we were using the generic one (which ends up being xen_flush_tlb)
    we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
    before we make that hypercall the kernel will IPI all of the
    vCPUs (even those that were asleep from the hypervisor
    perspective). The end result is that we needlessly wake them
    up and do a TLB flush when we can just let the hypervisor
    do it correctly.

    This patch gives around 50% speed improvement when migrating
    idle guest's from one host to another.

    Oracle-bug: 14630170

    CC: stable@vger.kernel.org
    Tested-by: Jingjie Jiang
    Suggested-by: Mukesh Rathor
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     

12 Oct, 2012

1 commit

  • Pull Xen fixes from Konrad Rzeszutek Wilk:
    "This has four bug-fixes and one tiny feature that I forgot to put
    initially in my tree due to oversight.

    The feature is for kdump kernels to speed up the /proc/vmcore reading.
    There is a ram_is_pfn helper function that the different platforms can
    register for. We are now doing that.

    The bug-fixes cover some embarrassing struct pv_cpu_ops variables
    being set to NULL on Xen (but not baremetal). We had a similar issue
    in the past with {write|read}_msr_safe and this fills the three
    missing ones. The other bug-fix is to make the console output (hvc)
    be capable of dealing with misbehaving backends and not fall flat on
    its face. Lastly, a quirk for older XenBus implementations that came
    with an ancient v3.4 hypervisor (so RHEL5 based) - reading of certain
    non-existent attributes just hangs the guest during bootup - so we
    take precaution of not doing that on such older installations.

    Feature:
    - Register a pfn_is_ram helper to speed up reading of /proc/vmcore.
    Bug-fixes:
    - Three pvops call for Xen were undefined causing BUG_ONs.
    - Add a quirk so that the shutdown watches (used by kdump) are not
    used with older Xen (3.4).
    - Fix ungraceful state transition for the HVC console."

    * tag 'stable/for-linus-3.7-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pv-on-hvm kexec: add quirk for Xen 3.4 and shutdown watches.
    xen/bootup: allow {read|write}_cr8 pvops call.
    xen/bootup: allow read_tscp call for Xen PV guests.
    xen pv-on-hvm: add pfn_is_ram helper for kdump
    xen/hvc: handle backend CLOSED without CLOSING

    Linus Torvalds
     

09 Oct, 2012

1 commit

  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

04 Oct, 2012

1 commit

  • Register pfn_is_ram helper speed up reading /proc/vmcore in the kdump
    kernel. See commit message of 997c136f518c ("fs/proc/vmcore.c: add hook
    to read_from_oldmem() to check for non-ram pages") for details.

    It makes use of a new hvmop HVMOP_get_mem_type which was introduced in
    xen 4.2 (23298:26413986e6e0) and backported to 4.1.1.

    The new function is currently only enabled for reading /proc/vmcore.
    Later it will be used also for the kexec kernel. Since that requires
    more changes in the generic kernel make it static for the time being.

    Signed-off-by: Olaf Hering
    Signed-off-by: Konrad Rzeszutek Wilk

    Olaf Hering
     

12 Sep, 2012

6 commits

  • * stable/128gb.v5.1:
    xen/mmu: If the revector fails, don't attempt to revector anything else.
    xen/p2m: When revectoring deal with holes in the P2M array.
    xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
    xen/mmu: Remove from __ka space PMD entries for pagetables.
    xen/mmu: Copy and revector the P2M tree.
    xen/p2m: Add logic to revector a P2M tree to use __va leafs.
    xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
    xen/mmu: For 64-bit do not call xen_map_identity_early
    xen/mmu: use copy_page instead of memcpy.
    xen/mmu: Provide comments describing the _ka and _va aliasing issue
    xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.
    Revert "xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain." and "xen/x86: Use memblock_reserve for sensitive areas."
    xen/x86: Workaround 64-bit hypervisor and 32-bit initial domain.
    xen/x86: Use memblock_reserve for sensitive areas.
    xen/p2m: Fix the comment describing the P2M tree.

    Conflicts:
    arch/x86/xen/mmu.c

    The pagetable_init is the old xen_pagetable_setup_done and xen_pagetable_setup_start
    rolled in one.

    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • …/tip into stable/for-linus-3.7

    * 'x86/platform' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (9690 commits)
    x86: Document x86_init.paging.pagetable_init()
    x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()
    x86: Move paging_init() call to x86_init.paging.pagetable_init()
    x86: Rename pagetable_setup_start() to pagetable_init()
    x86: Remove base argument from x86_init.paging.pagetable_setup_start
    Linux 3.6-rc5
    HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
    Remove user-triggerable BUG from mpol_to_str
    xen/pciback: Fix proper FLR steps.
    uml: fix compile error in deliver_alarm()
    dj: memory scribble in logi_dj
    Fix order of arguments to compat_put_time[spec|val]
    xen: Use correct masking in xen_swiotlb_alloc_coherent.
    xen: fix logical error in tlb flushing
    xen/p2m: Fix one-off error in checking the P2M tree directory.
    powerpc: Don't use __put_user() in patch_instruction
    powerpc: Make sure IPI handlers see data written by IPI senders
    powerpc: Restore correct DSCR in context switch
    powerpc: Fix DSCR inheritance in copy_thread()
    powerpc: Keep thread.dscr and thread.dscr_inherit in sync
    ...

    Konrad Rzeszutek Wilk
     
  • At this stage x86_init.paging.pagetable_setup_done is only used in the
    XEN case. Move its content in the x86_init.paging.pagetable_init setup
    function and remove the now unused x86_init.paging.pagetable_setup_done
    remaining infrastructure.

    Signed-off-by: Attilio Rao
    Acked-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • Move the paging_init() call to the platform specific pagetable_init()
    function, so we can get rid of the extra pagetable_setup_done()
    function pointer.

    Signed-off-by: Attilio Rao
    Acked-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • In preparation for unifying the pagetable_setup_start() and
    pagetable_setup_done() setup functions, rename appropriately all the
    infrastructure related to pagetable_setup_start().

    Signed-off-by: Attilio Rao
    Ackedd-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • We either use swapper_pg_dir or the argument is unused. Preparatory
    patch to simplify platform pagetable setup further.

    Signed-off-by: Attilio Rao
    Ackedb-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     

06 Sep, 2012

1 commit

  • Callers of xen_remap_domain_range() need to know if the remap failed
    because frame is currently paged out. So they can retry the remap
    later on. Return -ENOENT in this case.

    This assumes that the error codes returned by Xen are a subset of
    those used by the kernel. It is unclear if this is defined as part of
    the hypercall ABI.

    Acked-by: Andres Lagar-Cavilla
    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     

05 Sep, 2012

1 commit

  • While TLB_FLUSH_ALL gets passed as 'end' argument to
    flush_tlb_others(), the Xen code was made to check its 'start'
    parameter. That may give a incorrect op.cmd to MMUEXT_INVLPG_MULTI
    instead of MMUEXT_TLB_FLUSH_MULTI. Then it causes some page can not
    be flushed from TLB.

    This patch fixed this issue.

    Reported-by: Jan Beulich
    Signed-off-by: Alex Shi
    Acked-by: Jan Beulich
    Tested-by: Yongjie Ren
    Signed-off-by: Konrad Rzeszutek Wilk

    Alex Shi
     

23 Aug, 2012

10 commits

  • If the P2M revectoring would fail, we would try to continue on by
    cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
    is greedy and erases the PMD on both boundaries. Since the P2M
    array can share the PMD, we would wipe out part of the __ka
    that is still used in the P2M tree to point to P2M leafs.

    This fixes it by bypassing the revectoring and continuing on.
    If the revector fails, a nice WARN is printed so we can still
    troubleshoot this.

    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • We call memblock_reserve for [start of mfn list] -> [PMD aligned end
    of mfn list] instead of ->

    Konrad Rzeszutek Wilk
     
  • Please first read the description in "xen/mmu: Copy and revector the
    P2M tree."

    At this stage, the __ka address space (which is what the old
    P2M tree was using) is partially disassembled. The cleanup_highmap
    has removed the PMD entries from 0-16MB and anything past _brk_end
    up to the max_pfn_mapped (which is the end of the ramdisk).

    The xen_remove_p2m_tree and code around has ripped out the __ka for
    the old P2M array.

    Here we continue on doing it to where the Xen page-tables were.
    It is safe to do it, as the page-tables are addressed using __va.
    For good measure we delete anything that is within MODULES_VADDR
    and up to the end of the PMD.

    At this point the __ka only contains PMD entries for the start
    of the kernel up to __brk.

    [v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug]
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • Please first read the description in "xen/p2m: Add logic to revector a
    P2M tree to use __va leafs" patch.

    The 'xen_revector_p2m_tree()' function allocates a new P2M tree
    copies the contents of the old one in it, and returns the new one.

    At this stage, the __ka address space (which is what the old
    P2M tree was using) is partially disassembled. The cleanup_highmap
    has removed the PMD entries from 0-16MB and anything past _brk_end
    up to the max_pfn_mapped (which is the end of the ramdisk).

    We have revectored the P2M tree (and the one for save/restore as well)
    to use new shiny __va address to new MFNs. The xen_start_info
    has been taken care of already in 'xen_setup_kernel_pagetable()' and
    xen_start_info->shared_info in 'xen_setup_shared_info()', so
    we are free to roam and delete PMD entries - which is exactly what
    we are going to do. We rip out the __ka for the old P2M array.

    [v1: Fix smatch warnings]
    [v2: memset was doing 0 instead of 0xff]
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • As we are not using them. We end up only using the L1 pagetables
    and grafting those to our page-tables.

    [v1: Per Stefano's suggestion squashed two commits]
    [v2: Per Stefano's suggestion simplified loop]
    [v3: Fix smatch warnings]
    [v4: Add more comments]
    Acked-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • B/c we do not need it. During the startup the Xen provides
    us with all the initial memory mapped that we need to function.

    The initial memory mapped is up to the bootstack, which means
    we can reference using __ka up to 4.f):

    (from xen/interface/xen.h):

    4. This the order of bootstrap elements in the initial virtual region:
    a. relocated kernel image
    b. initial ram disk [mod_start, mod_len]
    c. list of allocated page frames [mfn_list, nr_pages]
    d. start_info_t structure [register ESI (x86)]
    e. bootstrap page tables [pt_base, CR3 (x86)]
    f. bootstrap stack [register ESP (x86)]

    (initial ram disk may be ommitted).

    [v1: More comments in git commit]
    Acked-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • After all, this is what it is there for.

    Acked-by: Jan Beulich
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • Which is that the level2_kernel_pgt (__ka virtual addresses)
    and level2_ident_pgt (__va virtual address) contain the same
    PMD entries. So if you modify a PTE in __ka, it will be reflected
    in __va (and vice-versa).

    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • We don't need to return the new PGD - as we do not use it.

    Acked-by: Stefano Stabellini
    Signed-off-by: Konrad Rzeszutek Wilk

    Konrad Rzeszutek Wilk
     
  • This patch removes the "return -ENOSYS" for auto_translated_physmap
    guests from privcmd_mmap, thus it allows ARM guests to issue privcmd
    mmap calls. However privcmd mmap calls are still going to fail for HVM
    and hybrid guests on x86 because the xen_remap_domain_mfn_range
    implementation is currently PV only.

    Changes in v2:

    - better commit message;
    - return -EINVAL from xen_remap_domain_mfn_range if
    auto_translated_physmap.

    Signed-off-by: Stefano Stabellini
    Acked-by: Konrad Rzeszutek Wilk
    Signed-off-by: Konrad Rzeszutek Wilk

    Stefano Stabellini
     

27 Jul, 2012

1 commit

  • Pull x86/mm changes from Peter Anvin:
    "The big change here is the patchset by Alex Shi to use INVLPG to flush
    only the affected pages when we only need to flush a small page range.

    It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
    vectors!) and replace it with an ordinary IPI function call."

    Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
    to changed line)

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tlb: Fix build warning and crash when building for !SMP
    x86/tlb: do flush_tlb_kernel_range by 'invlpg'
    x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
    x86/tlb: enable tlb flush range support for x86
    mm/mmu_gather: enable tlb flush range in generic mmu_gather
    x86/tlb: add tlb_flushall_shift knob into debugfs
    x86/tlb: add tlb_flushall_shift for specific CPU
    x86/tlb: fall back to flush all when meet a THP large page
    x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
    x86/tlb_info: get last level TLB entry number of CPU
    x86: Add read_mostly declaration/definition to variables from smp.h
    x86: Define early read-mostly per-cpu macros

    Linus Torvalds
     

20 Jul, 2012

2 commits

  • When constructing the initial page tables, if the MFN for a usable PFN
    is missing in the p2m then that frame is initially ballooned out. In
    this case, zero the PTE (as in decrease_reservation() in
    drivers/xen/balloon.c).

    This is obviously safe instead of having an valid PTE with an MFN of
    INVALID_P2M_ENTRY (~0).

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     
  • In xen_set_pte() if batching is unavailable (because the caller is in
    an interrupt context such as handling a page fault) it would fall back
    to using native_set_pte() and trapping and emulating the PTE write.

    On 32-bit guests this requires two traps for each PTE write (one for
    each dword of the PTE). Instead, do one mmu_update hypercall
    directly.

    During construction of the initial page tables, continue to use
    native_set_pte() because most of the PTEs being set are in writable
    and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
    using a hypercall for this is very expensive.

    This significantly improves page fault performance in 32-bit PV
    guests.

    lmbench3 test Before After Improvement
    ----------------------------------------------
    lat_pagefault 3.18 us 2.32 us 27%
    lat_proc fork 356 us 313.3 us 11%

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel
     

28 Jun, 2012

1 commit

  • x86 has no flush_tlb_range support in instruction level. Currently the
    flush_tlb_range just implemented by flushing all page table. That is not
    the best solution for all scenarios. In fact, if we just use 'invlpg' to
    flush few lines from TLB, we can get the performance gain from later
    remain TLB lines accessing.

    But the 'invlpg' instruction costs much of time. Its execution time can
    compete with cr3 rewriting, and even a bit more on SNB CPU.

    So, on a 512 4KB TLB entries CPU, the balance points is at:
    (512 - X) * 100ns(assumed TLB refill cost) =
    X(TLB flush entries) * 100ns(assumed invlpg cost)

    Here, X is 256, that is 1/2 of 512 entries.

    But with the mysterious CPU pre-fetcher and page miss handler Unit, the
    assumed TLB refill cost is far lower then 100ns in sequential access. And
    2 HT siblings in one core makes the memory access more faster if they are
    accessing the same memory. So, in the patch, I just do the change when
    the target entries is less than 1/16 of whole active tlb entries.
    Actually, I have no data support for the percentage '1/16', so any
    suggestions are welcomed.

    As to hugetlb, guess due to smaller page table, and smaller active TLB
    entries, I didn't see benefit via my benchmark, so no optimizing now.

    My micro benchmark show in ideal scenarios, the performance improves 70
    percent in reading. And in worst scenario, the reading/writing
    performance is similar with unpatched 3.4-rc4 kernel.

    Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
    'always':

    multi thread testing, '-t' paramter is thread number:
    with patch unpatched 3.4-rc4
    ./mprotect -t 1 14ns 24ns
    ./mprotect -t 2 13ns 22ns
    ./mprotect -t 4 12ns 19ns
    ./mprotect -t 8 14ns 16ns
    ./mprotect -t 16 28ns 26ns
    ./mprotect -t 32 54ns 51ns
    ./mprotect -t 128 200ns 199ns

    Single process with sequencial flushing and memory accessing:

    with patch unpatched 3.4-rc4
    ./mprotect 7ns 11ns
    ./mprotect -p 4096 -l 8 -n 10240
    21ns 21ns

    [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
    has additional performance numbers. ]

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     

25 May, 2012

1 commit

  • Pull Xen updates from Konrad Rzeszutek Wilk:
    "Features:
    * Extend the APIC ops implementation and add IRQ_WORKER vector
    support so that 'perf' can work properly.
    * Fix self-ballooning code, and balloon logic when booting as initial
    domain.
    * Move array printing code to generic debugfs
    * Support XenBus domains.
    * Lazily free grants when a domain is dead/non-existent.
    * In M2P code use batching calls
    Bug-fixes:
    * Fix NULL dereference in allocation failure path (hvc_xen)
    * Fix unbinding of IRQ_WORKER vector during vCPU hot-unplug
    * Fix HVM guest resume - we would leak an PIRQ value instead of
    reusing the existing one."

    Fix up add-add onflicts in arch/x86/xen/enlighten.c due to addition of
    apic ipi interface next to the new apic_id functions.

    * tag 'stable/for-linus-3.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen: do not map the same GSI twice in PVHVM guests.
    hvc_xen: NULL dereference on allocation failure
    xen: Add selfballoning memory reservation tunable.
    xenbus: Add support for xenbus backend in stub domain
    xen/smp: unbind irqworkX when unplugging vCPUs.
    xen: enter/exit lazy_mmu_mode around m2p_override calls
    xen/acpi/sleep: Enable ACPI sleep via the __acpi_os_prepare_sleep
    xen: implement IRQ_WORK_VECTOR handler
    xen: implement apic ipi interface
    xen/setup: update VA mapping when releasing memory during setup
    xen/setup: Combine the two hypercall functions - since they are quite similar.
    xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
    xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0
    xen/gnttab: add deferred freeing logic
    debugfs: Add support to print u32 array in debugfs
    xen/p2m: An early bootup variant of set_phys_to_machine
    xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
    xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument
    xen/p2m: Move code around to allow for better re-usage.

    Linus Torvalds
     

23 May, 2012

1 commit

  • Pull x86/apic changes from Ingo Molnar:
    "Most of the changes are about helping virtualized guest kernels
    achieve better performance."

    Fix up trivial conflicts with the iommu updates to arch/x86/kernel/apic/io_apic.c

    * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/apic: Implement EIO micro-optimization
    x86/apic: Add apic->eoi_write() callback
    x86/apic: Use symbolic APIC_EOI_ACK
    x86/apic: Fix typo EIO_ACK -> EOI_ACK and document it
    x86/xen/apic: Add missing #include
    x86/apic: Only compile local function if used with !CONFIG_GENERIC_PENDING_IRQ
    x86/apic: Fix UP boot crash
    x86: Conditionally update time when ack-ing pending irqs
    xen/apic: implement io apic read with hypercall
    Revert "xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'"
    xen/x86: Implement x86_apic_ops
    x86/apic: Replace io_apic_ops with x86_io_apic_ops.

    Linus Torvalds
     

08 May, 2012

2 commits

  • * stable/autoballoon.v5.2:
    xen/setup: update VA mapping when releasing memory during setup
    xen/setup: Combine the two hypercall functions - since they are quite similar.
    xen/setup: Populate freed MFNs from non-RAM E820 entries and gaps to E820 RAM
    xen/setup: Only print "Freeing XXX-YYY pfn range: Z pages freed" if Z > 0
    xen/p2m: An early bootup variant of set_phys_to_machine
    xen/p2m: Collapse early_alloc_p2m_middle redundant checks.
    xen/p2m: Allow alloc_p2m_middle to call reserve_brk depending on argument
    xen/p2m: Move code around to allow for better re-usage.

    Konrad Rzeszutek Wilk
     
  • In xen_memory_setup(), if a page that is being released has a VA
    mapping this must also be updated. Otherwise, the page will be not
    released completely -- it will still be referenced in Xen and won't be
    freed util the mapping is removed and this prevents it from being
    reallocated at a different PFN.

    This was already being done for the ISA memory region in
    xen_ident_map_ISA() but on many systems this was omitting a few pages
    as many systems marked a few pages below the ISA memory region as
    reserved in the e820 map.

    This fixes errors such as:

    (XEN) page_alloc.c:1148:d0 Over-allocation for domain 0: 2097153 > 2097152
    (XEN) memory.c:133:d0 Could not allocate order=0 extent: id=0 memflags=0 (0 of 17)

    Signed-off-by: David Vrabel
    Signed-off-by: Konrad Rzeszutek Wilk

    David Vrabel