09 Oct, 2012

8 commits

  • .fault now can retry. The retry can break state machine of .fault. In
    filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
    try, since the page is in page cache now, ra->mmap_miss is decreased. And
    these are done in one fault, so we can't detect random mmap file access.

    Add a new flag to indicate .fault is tried once. In the second try, skip
    ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.

    I only tested x86, didn't test other archs, but looks the change for other
    archs is obvious, but who knows :)

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Provide rb_insert_augmented() and rb_erase_augmented() through a new
    rbtree_augmented.h include file. rb_erase_augmented() is defined there as
    an __always_inline function, in order to allow inlining of augmented
    rbtree callbacks into it. Since this generates a relatively large
    function, each augmented rbtree user should make sure to have a single
    call site.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • As proposed by Peter Zijlstra, this makes it easier to define the augmented
    rbtree callbacks.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
    and remove the old augmented rbtree implementation.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.

    We can toss mapping address from remap_pfn_range() into
    track_pfn_vma_new(), and collect all PAT-related logic together in
    arch/x86/.

    This patch also restores orignal frustration-free is_cow_mapping() check
    in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
    ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")

    is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
    because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.

    [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Hugh Dickins
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
    attribute and uses it. Expectation is that the driver reserves the
    memory attributes for the pfn before calling vm_insert_pfn().

    remap_pfn_range() (when called for the whole vma) will setup a new
    attribute (based on the prot argument) for the specified pfn range.
    This addresses the legacy usage which typically calls remap_pfn_range()
    with a desired memory attribute. For ranges smaller than the vma size
    (which is typically not the case), remap_pfn_range() will use the
    existing memory attribute for the pfn range.

    Expose two different API's for these different behaviors.
    track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
    and track_pfn_remap() for the remap_pfn_range().

    This cleanup also prepares the ground for the track/untrack pfn vma
    routines to take over the ownership of setting PAT specific vm_flag in
    the 'vma'.

    [khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
    [akpm@linux-foundation.org: tweak a few comments]
    Signed-off-by: Suresh Siddha
    Signed-off-by: Konstantin Khlebnikov
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Hugh Dickins
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Konstantin Khlebnikov
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suresh Siddha
     
  • 'pfn' argument for track_pfn_vma_new() can be used for reserving the
    attribute for the pfn range. No need to depend on 'vm_pgoff'

    Similarly, untrack_pfn_vma() can depend on the 'pfn' argument if it is
    non-zero or can use follow_phys() to get the starting value of the pfn
    range.

    Also the non zero 'size' argument can be used instead of recomputing it
    from vma.

    This cleanup also prepares the ground for the track/untrack pfn vma
    routines to take over the ownership of setting PAT specific vm_flag in the
    'vma'.

    [khlebnikov@openvz.org: Clear pfn to paddr conversion]
    Signed-off-by: Suresh Siddha
    Signed-off-by: Konstantin Khlebnikov
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Tetsuo Handa
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suresh Siddha
     

02 Oct, 2012

3 commits

  • Pull x86/smap support from Ingo Molnar:
    "This adds support for the SMAP (Supervisor Mode Access Prevention) CPU
    feature on Intel CPUs: a hardware feature that prevents unintended
    user-space data access from kernel privileged code.

    It's turned on automatically when possible.

    This, in combination with SMEP, makes it even harder to exploit kernel
    bugs such as NULL pointer dereferences."

    Fix up trivial conflict in arch/x86/kernel/entry_64.S due to newly added
    includes right next to each other.

    * 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, smep, smap: Make the switching functions one-way
    x86, suspend: On wakeup always initialize cr4 and EFER
    x86-32: Start out eflags and cr4 clean
    x86, smap: Do not abuse the [f][x]rstor_checking() functions for user space
    x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry
    x86, smap: Reduce the SMAP overhead for signal handling
    x86, smap: A page fault due to SMAP is an oops
    x86, smap: Turn on Supervisor Mode Access Prevention
    x86, smap: Add STAC and CLAC instructions to control user space access
    x86, uaccess: Merge prototypes for clear_user/__clear_user
    x86, smap: Add a header file with macros for STAC/CLAC
    x86, alternative: Add header guards to
    x86, alternative: Use .pushsection/.popsection
    x86, smap: Add CR4 bit for SMAP
    x86-32, mm: The WP test should be done on a kernel page

    Linus Torvalds
     
  • Pull x86/platform changes from Ingo Molnar:
    "This cleans up some Xen-induced pagetable init code uglies, by
    generalizing new platform callbacks and state: x86_init.paging.*"

    * 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86: Document x86_init.paging.pagetable_init()
    x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()
    x86: Move paging_init() call to x86_init.paging.pagetable_init()
    x86: Rename pagetable_setup_start() to pagetable_init()
    x86: Remove base argument from x86_init.paging.pagetable_setup_start

    Linus Torvalds
     
  • Pull x86/mm changes from Ingo Molnar:
    "The biggest change is new TLB partial flushing code for AMD CPUs.
    (The v3.6 kernel had the Intel CPU side code, see commits
    e0ba94f14f74..effee4b9b3b.)

    There's also various other refinements around the TLB flush code"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86: Distinguish TLB shootdown interrupts from other functions call interrupts
    x86/mm: Fix range check in tlbflush debugfs interface
    x86, cpu: Preset default tlb_flushall_shift on AMD
    x86, cpu: Add AMD TLB size detection
    x86, cpu: Push TLB detection CPUID check down
    x86, cpu: Fixup tlb_flushall_shift formatting

    Linus Torvalds
     

28 Sep, 2012

1 commit

  • As TLB shootdown requests to other CPU cores are now using function call
    interrupts, TLB shootdowns entry in /proc/interrupts is always shown as 0.

    This behavior change was introduced by commit 52aec3308db8 ("x86/tlb:
    replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR").

    This patch reverts TLB shootdowns entry in /proc/interrupts to count TLB
    shootdowns separately from the other function call interrupts.

    Signed-off-by: Tomoki Sekiyama
    Link: http://lkml.kernel.org/r/20120926021128.22212.20440.stgit@hpxw
    Acked-by: Alex Shi
    Signed-off-by: H. Peter Anvin

    Tomoki Sekiyama
     

26 Sep, 2012

1 commit

  • Add necessary hooks to x86 exception for userspace
    RCU extended quiescent state support.

    This includes traps, page fault, debug exceptions, etc...

    Signed-off-by: Frederic Weisbecker
    Cc: Alessio Igor Bogani
    Cc: Andrew Morton
    Cc: Avi Kivity
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Josh Triplett
    Cc: Kevin Hilman
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Sven-Thorsten Dietrich
    Cc: Thomas Gleixner
    Signed-off-by: Paul E. McKenney

    Frederic Weisbecker
     

22 Sep, 2012

2 commits


13 Sep, 2012

1 commit

  • Fixing an off-by-one error in devmem_is_allowed(), which allows
    accesses to physical addresses 0x100000-0x100fff, an extra page
    past 1MB.

    Signed-off-by: T Makphaibulchoke
    Acked-by: H. Peter Anvin
    Cc: yinghai@kernel.org
    Cc: tiwai@suse.de
    Cc: dhowells@redhat.com
    Link: http://lkml.kernel.org/r/1346210503-14276-1-git-send-email-tmac@hp.com
    Signed-off-by: Ingo Molnar

    T Makphaibulchoke
     

12 Sep, 2012

4 commits

  • At this stage x86_init.paging.pagetable_setup_done is only used in the
    XEN case. Move its content in the x86_init.paging.pagetable_init setup
    function and remove the now unused x86_init.paging.pagetable_setup_done
    remaining infrastructure.

    Signed-off-by: Attilio Rao
    Acked-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • Move the paging_init() call to the platform specific pagetable_init()
    function, so we can get rid of the extra pagetable_setup_done()
    function pointer.

    Signed-off-by: Attilio Rao
    Acked-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • In preparation for unifying the pagetable_setup_start() and
    pagetable_setup_done() setup functions, rename appropriately all the
    infrastructure related to pagetable_setup_start().

    Signed-off-by: Attilio Rao
    Ackedd-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     
  • We either use swapper_pg_dir or the argument is unused. Preparatory
    patch to simplify platform pagetable setup further.

    Signed-off-by: Attilio Rao
    Ackedb-by:
    Cc:
    Cc:
    Cc:
    Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
    Signed-off-by: Thomas Gleixner

    Attilio Rao
     

07 Sep, 2012

1 commit

  • Since the shift count settable there is used for shifting values
    of type "unsigned long", its value must not match or exceed
    BITS_PER_LONG (otherwise the shift operations are undefined).

    Similarly, the value must not be negative (but -1 must be
    permitted, as that's the value used to distinguish the case of
    the fine grained flushing being disabled).

    Signed-off-by: Jan Beulich
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/5049B65C020000780009990C@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

22 Aug, 2012

1 commit

  • Each page mapped in a process's address space must be correctly
    accounted for in _mapcount. Normally the rules for this are
    straightforward but hugetlbfs page table sharing is different. The page
    table pages at the PMD level are reference counted while the mapcount
    remains the same.

    If this accounting is wrong, it causes bugs like this one reported by
    Larry Woodman:

    kernel BUG at mm/filemap.c:135!
    invalid opcode: 0000 [#1] SMP
    CPU 22
    Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
    Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
    RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170
    Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
    Call Trace:
    delete_from_page_cache+0x40/0x80
    truncate_hugepages+0x115/0x1f0
    hugetlbfs_evict_inode+0x18/0x30
    evict+0x9f/0x1b0
    iput_final+0xe3/0x1e0
    iput+0x3e/0x50
    d_kill+0xf8/0x110
    dput+0xe2/0x1b0
    __fput+0x162/0x240

    During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
    shared page tables with the check dst_pte == src_pte. The logic is if
    the PMD page is the same, they must be shared. This assumes that the
    sharing is between the parent and child. However, if the sharing is
    with a different process entirely then this check fails as in this
    diagram:

    parent
    |
    ------------>pmd
    src_pte----------> data page
    ^
    other--------->pmd--------------------|
    ^
    child-----------|
    dst_pte

    For this situation to occur, it must be possible for Parent and Other to
    have faulted and failed to share page tables with each other. This is
    possible due to the following style of race.

    PROC A PROC B
    copy_hugetlb_page_range copy_hugetlb_page_range
    src_pte == huge_pte_offset src_pte == huge_pte_offset
    !src_pte so no sharing !src_pte so no sharing

    (time passes)

    hugetlb_fault hugetlb_fault
    huge_pte_alloc huge_pte_alloc
    huge_pmd_share huge_pmd_share
    LOCK(i_mmap_mutex)
    find nothing, no sharing
    UNLOCK(i_mmap_mutex)
    LOCK(i_mmap_mutex)
    find nothing, no sharing
    UNLOCK(i_mmap_mutex)
    pmd_alloc pmd_alloc
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)
    LOCK(instantiation_mutex)
    fault
    UNLOCK(instantiation_mutex)

    These two processes are not poing to the same data page but are not
    sharing page tables because the opportunity was missed. When either
    process later forks, the src_pte == dst pte is potentially insufficient.
    As the check falls through, the wrong PTE information is copied in
    (harmless but wrong) and the mapcount is bumped for a page mapped by a
    shared page table leading to the BUG_ON.

    This patch addresses the issue by moving pmd_alloc into huge_pmd_share
    which guarantees that the shared pud is populated in the same critical
    section as pmd. This also means that huge_pte_offset test in
    huge_pmd_share is serialized correctly now which in turn means that the
    success of the sharing will be higher as the racing tasks see the pud
    and pmd populated together.

    Race identified and changelog written mostly by Mel Gorman.

    {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
    Reported-by: Larry Woodman
    Tested-by: Larry Woodman
    Reviewed-by: Mel Gorman
    Signed-off-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Cc: David Gibson
    Cc: Ken Chen
    Cc: Cong Wang
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Aug, 2012

1 commit

  • This reverts commit bacef661acdb634170a8faddbc1cf28e8f8b9eee.

    This commit has been found to cause serious regressions on a number of
    ASUS machines at the least. We probably need to provide a 1:1 map in
    addition to the EFI virtual memory map in order for this to work.

    Signed-off-by: H. Peter Anvin
    Reported-and-bisected-by: Jérôme Carretero
    Cc: Jan Beulich
    Cc: Matt Fleming
    Cc: Matthew Garrett
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu

    H. Peter Anvin
     

03 Aug, 2012

1 commit

  • Otherwise you could run into:
    WARN_ON in numa_register_memblks(), because node_possible_map is zero

    References: https://bugzilla.novell.com/show_bug.cgi?id=757888

    On this machine (ProLiant ML570 G3) the SRAT table contains:
    - No processor affinities
    - One memory affinity structure (which is set disabled)

    CC: Per Jessen
    CC: Andi Kleen
    Signed-off-by: Thomas Renninger
    Signed-off-by: Len Brown

    Thomas Renninger
     

27 Jul, 2012

2 commits

  • Pull x86/mm changes from Peter Anvin:
    "The big change here is the patchset by Alex Shi to use INVLPG to flush
    only the affected pages when we only need to flush a small page range.

    It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
    vectors!) and replace it with an ordinary IPI function call."

    Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
    to changed line)

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/tlb: Fix build warning and crash when building for !SMP
    x86/tlb: do flush_tlb_kernel_range by 'invlpg'
    x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
    x86/tlb: enable tlb flush range support for x86
    mm/mmu_gather: enable tlb flush range in generic mmu_gather
    x86/tlb: add tlb_flushall_shift knob into debugfs
    x86/tlb: add tlb_flushall_shift for specific CPU
    x86/tlb: fall back to flush all when meet a THP large page
    x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
    x86/tlb_info: get last level TLB entry number of CPU
    x86: Add read_mostly declaration/definition to variables from smp.h
    x86: Define early read-mostly per-cpu macros

    Linus Torvalds
     
  • Pul x86/efi changes from Ingo Molnar:
    "This tree adds an EFI bootloader handover protocol, which, once
    supported on the bootloader side, will make bootup faster and might
    result in simpler bootloaders.

    The other change activates the EFI wall clock time accessors on x86-64
    as well, instead of the legacy RTC readout."

    * 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, efi: Handover Protocol
    x86-64/efi: Use EFI to deal with platform wall clock

    Linus Torvalds
     

28 Jun, 2012

7 commits

  • This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
    and gain was analyzed in previous patch
    (x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).

    In the testing: http://lkml.org/lkml/2012/6/21/10

    The pay is mostly covered by long kernel path, but the gain is still
    quite clear, memory access in user APP can increase 30+% when kernel
    execute this funtion.

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
    amount of vector in IDT. But it is still not enough, since modern x86
    sever has more cpu number. That still causes heavy lock contention
    in TLB flushing.

    The patch using generic smp call function to replace it. That saved 32
    vector number in IDT, and resolved the lock contention in TLB
    flushing on large system.

    In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
    has 3% performance increase.

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • Not every tlb_flush execution moment is really need to evacuate all
    TLB entries, like in munmap, just few 'invlpg' is better for whole
    process performance, since it leaves most of TLB entries for later
    accessing.

    This patch also rewrite flush_tlb_range for 2 purposes:
    1, split it out to get flush_blt_mm_range function.
    2, clean up to reduce line breaking, thanks for Borislav's input.

    My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
    show that the random memory access on other CPU has 0~50% speed up
    on a 2P * 4cores * HT NHM EP while do 'munmap'.

    Thanks Yongjie's testing on this patch:
    -------------
    I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
    kernel.
    After running two benchmarks in Xen HVM guest, I found his patches
    brought about 1%~3% performance gain in 'kernel build' and 'netperf'
    testing, though the performance gain was not very stable in 'kernel
    build' testing.

    Some detailed testing results are below.

    Testing Environment:
    Hardware: Romley-EP platform
    Xen version: latest upstream
    Linux kernel: 3.4-RC6
    Guest vCPU number: 8
    NIC: Intel 82599 (10GB bandwidth)

    In 'kernel build' testing in guest:
    Command line | performance gain
    make -j 4 | 3.81%
    make -j 8 | 0.37%
    make -j 16 | -0.52%

    In 'netperf' testing, we tested TCP_STREAM with default socket size
    16384 byte as large packet and 64 byte as small packet.
    I used several clients to add networking pressure, then 'netperf' server
    automatically generated several threads to response them.
    I also used large-size packet and small-size packet in the testing.
    Packet size | Thread number | performance gain
    16384 bytes | 4 | 0.02%
    16384 bytes | 8 | 2.21%
    16384 bytes | 16 | 2.04%
    64 bytes | 4 | 1.07%
    64 bytes | 8 | 3.31%
    64 bytes | 16 | 0.71%

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.com
    Tested-by: Ren, Yongjie
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • kernel will replace cr3 rewrite with invlpg when
    tlb_flush_entries
    Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • Testing show different CPU type(micro architectures and NUMA mode) has
    different balance points between the TLB flush all and multiple invlpg.
    And there also has cases the tlb flush change has no any help.

    This patch give a interface to let x86 vendor developers have a chance
    to set different shift for different CPU type.

    like some machine in my hands, balance points is 16 entries on
    Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
    IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
    help.

    For untested machine, do a conservative optimization, same as NHM CPU.

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • We don't need to flush large pages by PAGE_SIZE step, that just waste
    time. and actually, large page don't need 'invlpg' optimizing according
    to our micro benchmark. So, just flush whole TLB is enough for them.

    The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
    with THP 'always' setting.

    Multi-thread testing, '-t' paramter is thread number:
    without this patch with this patch
    ./mprotect -t 1 14ns 13ns
    ./mprotect -t 2 13ns 13ns
    ./mprotect -t 4 12ns 11ns
    ./mprotect -t 8 14ns 10ns
    ./mprotect -t 16 28ns 28ns
    ./mprotect -t 32 54ns 52ns
    ./mprotect -t 128 200ns 200ns

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     
  • x86 has no flush_tlb_range support in instruction level. Currently the
    flush_tlb_range just implemented by flushing all page table. That is not
    the best solution for all scenarios. In fact, if we just use 'invlpg' to
    flush few lines from TLB, we can get the performance gain from later
    remain TLB lines accessing.

    But the 'invlpg' instruction costs much of time. Its execution time can
    compete with cr3 rewriting, and even a bit more on SNB CPU.

    So, on a 512 4KB TLB entries CPU, the balance points is at:
    (512 - X) * 100ns(assumed TLB refill cost) =
    X(TLB flush entries) * 100ns(assumed invlpg cost)

    Here, X is 256, that is 1/2 of 512 entries.

    But with the mysterious CPU pre-fetcher and page miss handler Unit, the
    assumed TLB refill cost is far lower then 100ns in sequential access. And
    2 HT siblings in one core makes the memory access more faster if they are
    accessing the same memory. So, in the patch, I just do the change when
    the target entries is less than 1/16 of whole active tlb entries.
    Actually, I have no data support for the percentage '1/16', so any
    suggestions are welcomed.

    As to hugetlb, guess due to smaller page table, and smaller active TLB
    entries, I didn't see benefit via my benchmark, so no optimizing now.

    My micro benchmark show in ideal scenarios, the performance improves 70
    percent in reading. And in worst scenario, the reading/writing
    performance is similar with unpatched 3.4-rc4 kernel.

    Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
    'always':

    multi thread testing, '-t' paramter is thread number:
    with patch unpatched 3.4-rc4
    ./mprotect -t 1 14ns 24ns
    ./mprotect -t 2 13ns 22ns
    ./mprotect -t 4 12ns 19ns
    ./mprotect -t 8 14ns 16ns
    ./mprotect -t 16 28ns 26ns
    ./mprotect -t 32 54ns 51ns
    ./mprotect -t 128 200ns 199ns

    Single process with sequencial flushing and memory accessing:

    with patch unpatched 3.4-rc4
    ./mprotect 7ns 11ns
    ./mprotect -p 4096 -l 8 -n 10240
    21ns 21ns

    [ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
    has additional performance numbers. ]

    Signed-off-by: Alex Shi
    Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
    Signed-off-by: H. Peter Anvin

    Alex Shi
     

20 Jun, 2012

1 commit


11 Jun, 2012

1 commit

  • Fix kernel-doc warnings in arch/x86/mm/ioremap.c and
    arch/x86/mm/pageattr.c, just like this one:

    Warning(arch/x86/mm/ioremap.c:204):
    No description found for parameter 'phys_addr'
    Warning(arch/x86/mm/ioremap.c:204):
    Excess function parameter 'offset' description in 'ioremap_nocache'

    Signed-off-by: Wanpeng Li
    Cc: Gavin Shan
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1339296652-2935-1-git-send-email-liwp.linux@gmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

08 Jun, 2012

1 commit

  • …ion early page table space

    Robin found this regression:

    | I just tried to boot an 8TB system. It fails very early in boot with:
    | Kernel panic - not syncing: Cannot find space for the kernel page tables

    git bisect commit 722bc6b16771ed80871e1fd81c86d3627dda2ac8.

    A git revert of that commit does boot past that point on the 8TB
    configuration.

    That commit will add up extra pages for all memory range even
    above 4g.

    Try to limit that extra page count adding to first entry only.

    Bisected-by: Robin Holt <holt@sgi.com>
    Tested-by: Robin Holt <holt@sgi.com>
    Signed-off-by: Yinghai Lu <yinghai@kernel.org>
    Cc: WANG Cong <xiyou.wangcong@gmail.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Link: http://lkml.kernel.org/r/CAE9FiQUj3wyzQxtq9yzBNc9u220p8JZ1FYHG7t%3DMOzJ%3D9BZMYA@mail.gmail.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Yinghai Lu
     

06 Jun, 2012

2 commits

  • When hot-adding a CPU, the system outputs following messages
    since node_to_cpumask_map[2] was not allocated memory.

    Booting Node 2 Processor 32 APIC 0xc0
    node_to_cpumask_map[2] NULL
    Pid: 0, comm: swapper/32 Tainted: G A 3.3.5-acd #21
    Call Trace:
    [] debug_cpumask_set_cpu+0x155/0x160
    [] ? add_timer_on+0xaa/0x120
    [] numa_add_cpu+0x1e/0x22
    [] identify_cpu+0x1df/0x1e4
    [] identify_econdary_cpu+0x16/0x1d
    [] smp_store_cpu_info+0x3c/0x3e
    [] smp_callin+0x139/0x1be
    [] start_secondary+0x13/0xeb

    The reason is that the bit of node 2 was not set at
    numa_nodes_parsed. numa_nodes_parsed is set by only
    acpi_numa_processor_affinity_init /
    acpi_numa_x2apic_affinity_init. Thus even if hot-added memory
    which is same PXM as hot-added CPU is written in ACPI SRAT
    Table, if the hot-added CPU is not written in ACPI SRAT table,
    numa_nodes_parsed is not set.

    But according to ACPI Spec Rev 5.0, it says about ACPI SRAT
    table as follows: This optional table provides information that
    allows OSPM to associate processors and memory ranges, including
    ranges of memory provided by hot-added memory devices, with
    system localities / proximity domains and clock domains.

    It means that ACPI SRAT table only provides information for CPUs
    present at boot time and for memory including hot-added memory.
    So hot-added memory is written in ACPI SRAT table, but hot-added
    CPU is not written in it. Thus numa_nodes_parsed should be set
    by not only acpi_numa_processor_affinity_init /
    acpi_numa_x2apic_affinity_init but also
    acpi_numa_memory_affinity_init for the case.

    Additionally, if system has cpuless memory node,
    acpi_numa_processor_affinity_init /
    acpi_numa_x2apic_affinity_init cannot set numa_nodes_parseds
    since these functions cannot find cpu description for the node.
    In this case, numa_nodes_parsed needs to be set by
    acpi_numa_memory_affinity_init.

    Signed-off-by: Yasuaki Ishimatsu
    Acked-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: liuj97@gmail.com
    Cc: kosaki.motohiro@gmail.com
    Link: http://lkml.kernel.org/r/4FCC2098.4030007@jp.fujitsu.com
    [ merged it ]
    Signed-off-by: Ingo Molnar

    Yasuaki Ishimatsu
     
  • Other than ix86, x86-64 on EFI so far didn't set the
    {g,s}et_wallclock accessors to the EFI routines, thus
    incorrectly using raw RTC accesses instead.

    Simply removing the #ifdef around the respective code isn't
    enough, however: While so far early get-time calls were done in
    physical mode, this doesn't work properly for x86-64, as virtual
    addresses would still need to be set up for all runtime regions
    (which wasn't the case on the system I have access to), so
    instead the patch moves the call to efi_enter_virtual_mode()
    ahead (which in turn allows to drop all code related to calling
    efi-get-time in physical mode).

    Additionally the earlier calling of efi_set_executable()
    requires the CPA code to cope, i.e. during early boot it must be
    avoided to call cpa_flush_array(), as the first thing this
    function does is a BUG_ON(irqs_disabled()).

    Also make the two EFI functions in question here static -
    they're not being referenced elsewhere.

    Signed-off-by: Jan Beulich
    Tested-by: Matt Fleming
    Acked-by: Matthew Garrett
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4FBFBF5F020000780008637F@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

31 May, 2012

1 commit


30 May, 2012

1 commit

  • Function pat_pagerange_is_ram() scales poorly to large address
    ranges, because it probes the resource tree for each page.

    On a 2.6 GHz Opteron, this function consumes 34 ms for a 1 GB range.

    It is called twice during untrack_pfn_vma(), slowing process
    cleanup and handicapping the OOM killer.

    This replacement consumes less than 1ms, under the same conditions.

    Signed-off-by: John Dykstra on behalf of Cray Inc.
    Acked-by: Suresh Siddha
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337980366.1979.6.camel@redwood
    [ Small stylistic cleanups and renames ]
    Signed-off-by: Ingo Molnar

    John Dykstra