Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Oct, 2012

8 commits

45cac65b0 readahead: fault retry breaks mmap file read random detection ... Browse Code »

.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.

Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.

I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-10-09 15:22:47 +0800
9c079add0 rbtree: move augmented rbtree functionality to rbtree_augmented.h ... Browse Code »

Provide rb_insert_augmented() and rb_erase_augmented() through a new
rbtree_augmented.h include file. rb_erase_augmented() is defined there as
an __always_inline function, in order to allow inlining of augmented
rbtree callbacks into it. Since this generates a relatively large
function, each augmented rbtree user should make sure to have a single
call site.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Catalin Marinas
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:40 +0800
6b2dbba8b mm: replace vma prio_tree with an interval tree ... Browse Code »

Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.

Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Catalin Marinas
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:39 +0800
3908836aa rbtree: add RB_DECLARE_CALLBACKS() macro ... Browse Code »

As proposed by Peter Zijlstra, this makes it easier to define the augmented
rbtree callbacks.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:38 +0800
9d9e6f970 rbtree: remove prior augmented rbtree implementation ... Browse Code »

convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
and remove the old augmented rbtree implementation.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:38 +0800
b3b9c2932 mm, x86, pat: rework linear pfn-mmap tracking ... Browse Code »

Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.

We can toss mapping address from remap_pfn_range() into
track_pfn_vma_new(), and collect all PAT-related logic together in
arch/x86/.

This patch also restores orignal frustration-free is_cow_mapping() check
in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")

is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.

[suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Suresh Siddha
Cc: Venkatesh Pallipadi
Cc: H. Peter Anvin
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: Hugh Dickins
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:16 +0800
5180da410 x86, pat: separate the pfn attribute tracking for remap_pfn_range and vm_insert_pfn ... Browse Code »

With PAT enabled, vm_insert_pfn() looks up the existing pfn memory
attribute and uses it. Expectation is that the driver reserves the
memory attributes for the pfn before calling vm_insert_pfn().

remap_pfn_range() (when called for the whole vma) will setup a new
attribute (based on the prot argument) for the specified pfn range.
This addresses the legacy usage which typically calls remap_pfn_range()
with a desired memory attribute. For ranges smaller than the vma size
(which is typically not the case), remap_pfn_range() will use the
existing memory attribute for the pfn range.

Expose two different API's for these different behaviors.
track_pfn_insert() for tracking the pfn attribute set by vm_insert_pfn()
and track_pfn_remap() for the remap_pfn_range().

This cleanup also prepares the ground for the track/untrack pfn vma
routines to take over the ownership of setting PAT specific vm_flag in
the 'vma'.

[khlebnikov@openvz.org: Clear checks in track_pfn_remap()]
[akpm@linux-foundation.org: tweak a few comments]
Signed-off-by: Suresh Siddha
Signed-off-by: Konstantin Khlebnikov
Cc: Venkatesh Pallipadi
Cc: H. Peter Anvin
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: Hugh Dickins
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Konstantin Khlebnikov
Cc: Matt Helsley
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suresh Siddha
2012-10-09 15:22:16 +0800
b1a86e15d x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines ... Browse Code »

'pfn' argument for track_pfn_vma_new() can be used for reserving the
attribute for the pfn range. No need to depend on 'vm_pgoff'

Similarly, untrack_pfn_vma() can depend on the 'pfn' argument if it is
non-zero or can use follow_phys() to get the starting value of the pfn
range.

Also the non zero 'size' argument can be used instead of recomputing it
from vma.

This cleanup also prepares the ground for the track/untrack pfn vma
routines to take over the ownership of setting PAT specific vm_flag in the
'vma'.

[khlebnikov@openvz.org: Clear pfn to paddr conversion]
Signed-off-by: Suresh Siddha
Signed-off-by: Konstantin Khlebnikov
Cc: Venkatesh Pallipadi
Cc: H. Peter Anvin
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Tetsuo Handa
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Suresh Siddha
2012-10-09 15:22:15 +0800

02 Oct, 2012

3 commits

15385dfe7 Merge branch 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86/smap support from Ingo Molnar:
"This adds support for the SMAP (Supervisor Mode Access Prevention) CPU
feature on Intel CPUs: a hardware feature that prevents unintended
user-space data access from kernel privileged code.

It's turned on automatically when possible.

This, in combination with SMEP, makes it even harder to exploit kernel
bugs such as NULL pointer dereferences."

Fix up trivial conflict in arch/x86/kernel/entry_64.S due to newly added
includes right next to each other.

* 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, smep, smap: Make the switching functions one-way
x86, suspend: On wakeup always initialize cr4 and EFER
x86-32: Start out eflags and cr4 clean
x86, smap: Do not abuse the [f][x]rstor_checking() functions for user space
x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry
x86, smap: Reduce the SMAP overhead for signal handling
x86, smap: A page fault due to SMAP is an oops
x86, smap: Turn on Supervisor Mode Access Prevention
x86, smap: Add STAC and CLAC instructions to control user space access
x86, uaccess: Merge prototypes for clear_user/__clear_user
x86, smap: Add a header file with macros for STAC/CLAC
x86, alternative: Add header guards to
x86, alternative: Use .pushsection/.popsection
x86, smap: Add CR4 bit for SMAP
x86-32, mm: The WP test should be done on a kernel page

Linus Torvalds
2012-10-02 04:59:17 +0800
a5fa7b7d8 Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86/platform changes from Ingo Molnar:
"This cleans up some Xen-induced pagetable init code uglies, by
generalizing new platform callbacks and state: x86_init.paging.*"

* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Document x86_init.paging.pagetable_init()
x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()
x86: Move paging_init() call to x86_init.paging.pagetable_init()
x86: Rename pagetable_setup_start() to pagetable_init()
x86: Remove base argument from x86_init.paging.pagetable_setup_start

Linus Torvalds
2012-10-02 02:14:17 +0800
229993001 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86/mm changes from Ingo Molnar:
"The biggest change is new TLB partial flushing code for AMD CPUs.
(The v3.6 kernel had the Intel CPU side code, see commits
e0ba94f14f74..effee4b9b3b.)

There's also various other refinements around the TLB flush code"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Distinguish TLB shootdown interrupts from other functions call interrupts
x86/mm: Fix range check in tlbflush debugfs interface
x86, cpu: Preset default tlb_flushall_shift on AMD
x86, cpu: Add AMD TLB size detection
x86, cpu: Push TLB detection CPUID check down
x86, cpu: Fixup tlb_flushall_shift formatting

Linus Torvalds
2012-10-02 02:13:33 +0800

28 Sep, 2012

1 commit

fd0f58697 x86: Distinguish TLB shootdown interrupts from other functions call interrupts ... Browse Code »

As TLB shootdown requests to other CPU cores are now using function call
interrupts, TLB shootdowns entry in /proc/interrupts is always shown as 0.

This behavior change was introduced by commit 52aec3308db8 ("x86/tlb:
replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR").

This patch reverts TLB shootdowns entry in /proc/interrupts to count TLB
shootdowns separately from the other function call interrupts.

Signed-off-by: Tomoki Sekiyama
Link: http://lkml.kernel.org/r/20120926021128.22212.20440.stgit@hpxw
Acked-by: Alex Shi
Signed-off-by: H. Peter Anvin

Tomoki Sekiyama
2012-09-28 13:52:34 +0800

26 Sep, 2012

1 commit

6ba3c97a3 x86: Exception hooks for userspace RCU extended QS ... Browse Code »

Add necessary hooks to x86 exception for userspace
RCU extended quiescent state support.

This includes traps, page fault, debug exceptions, etc...

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney

Frederic Weisbecker
2012-09-26 21:47:07 +0800

22 Sep, 2012

2 commits

40d3cd669 x86, smap: A page fault due to SMAP is an oops ... Browse Code »

If we get a page fault due to SMAP, trigger an oops rather than
spinning forever.

Signed-off-by: H. Peter Anvin
Link: http://lkml.kernel.org/r/1348256595-29119-11-git-send-email-hpa@linux.intel.com

H. Peter Anvin
2012-09-22 03:45:27 +0800
8bd753be7 x86-32, mm: The WP test should be done on a kernel page ... Browse Code »

PAGE_READONLY includes user permission, but this is a page used
exclusively by the kernel; use PAGE_KERNEL_RO instead.

Signed-off-by: H. Peter Anvin
Link: http://lkml.kernel.org/r/1348256595-29119-3-git-send-email-hpa@linux.intel.com

H. Peter Anvin
2012-09-22 03:45:25 +0800

13 Sep, 2012

1 commit

73e8f3d7e x86/mm/init.c: Fix devmem_is_allowed() off by one ... Browse Code »

Fixing an off-by-one error in devmem_is_allowed(), which allows
accesses to physical addresses 0x100000-0x100fff, an extra page
past 1MB.

Signed-off-by: T Makphaibulchoke
Acked-by: H. Peter Anvin
Cc: yinghai@kernel.org
Cc: tiwai@suse.de
Cc: dhowells@redhat.com
Link: http://lkml.kernel.org/r/1346210503-14276-1-git-send-email-tmac@hp.com
Signed-off-by: Ingo Molnar

T Makphaibulchoke
2012-09-13 23:35:54 +0800

12 Sep, 2012

4 commits

c71128872 x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done() ... Browse Code »

At this stage x86_init.paging.pagetable_setup_done is only used in the
XEN case. Move its content in the x86_init.paging.pagetable_init setup
function and remove the now unused x86_init.paging.pagetable_setup_done
remaining infrastructure.

Signed-off-by: Attilio Rao
Acked-by:
Cc:
Cc:
Cc:
Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner

Attilio Rao
2012-09-12 21:33:06 +0800
843b8ed2e x86: Move paging_init() call to x86_init.paging.pagetable_init() ... Browse Code »

Move the paging_init() call to the platform specific pagetable_init()
function, so we can get rid of the extra pagetable_setup_done()
function pointer.

Signed-off-by: Attilio Rao
Acked-by:
Cc:
Cc:
Cc:
Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner

Attilio Rao
2012-09-12 21:33:06 +0800
7737b215a x86: Rename pagetable_setup_start() to pagetable_init() ... Browse Code »

In preparation for unifying the pagetable_setup_start() and
pagetable_setup_done() setup functions, rename appropriately all the
infrastructure related to pagetable_setup_start().

Signed-off-by: Attilio Rao
Ackedd-by:
Cc:
Cc:
Cc:
Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner

Attilio Rao
2012-09-12 21:33:06 +0800
73090f899 x86: Remove base argument from x86_init.paging.pagetable_setup_start ... Browse Code »

We either use swapper_pg_dir or the argument is unused. Preparatory
patch to simplify platform pagetable setup further.

Signed-off-by: Attilio Rao
Ackedb-by:
Cc:
Cc:
Cc:
Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner

Attilio Rao
2012-09-12 21:33:06 +0800

07 Sep, 2012

1 commit

d4c9dbc61 x86/mm: Fix range check in tlbflush debugfs interface ... Browse Code »

Since the shift count settable there is used for shifting values
of type "unsigned long", its value must not match or exceed
BITS_PER_LONG (otherwise the shift operations are undefined).

Similarly, the value must not be negative (but -1 must be
permitted, as that's the value used to distinguish the case of
the fine grained flushing being disabled).

Signed-off-by: Jan Beulich
Cc: Alex Shi
Link: http://lkml.kernel.org/r/5049B65C020000780009990C@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar

Jan Beulich
2012-09-07 16:56:02 +0800

22 Aug, 2012

1 commit

eb48c0714 mm: hugetlbfs: correctly populate shared pmd ... Browse Code »

Each page mapped in a process's address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straightforward but hugetlbfs page table sharing is different. The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#1] SMP
CPU 22
Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170
Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
Call Trace:
delete_from_page_cache+0x40/0x80
truncate_hugepages+0x115/0x1f0
hugetlbfs_evict_inode+0x18/0x30
evict+0x9f/0x1b0
iput_final+0xe3/0x1e0
iput+0x3e/0x50
d_kill+0xf8/0x110
dput+0xe2/0x1b0
__fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

parent
|
------------>pmd
src_pte----------> data page
^
other--------->pmd--------------------|
^
child-----------|
dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
pmd_alloc pmd_alloc
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed. When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman
Tested-by: Larry Woodman
Reviewed-by: Mel Gorman
Signed-off-by: Michal Hocko
Reviewed-by: Rik van Riel
Cc: David Gibson
Cc: Ken Chen
Cc: Cong Wang
Cc: Hillf Danton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-08-22 07:45:02 +0800

15 Aug, 2012

1 commit

f026cfa82 Revert "x86-64/efi: Use EFI to deal with platform wall clock" ... Browse Code »

This reverts commit bacef661acdb634170a8faddbc1cf28e8f8b9eee.

This commit has been found to cause serious regressions on a number of
ASUS machines at the least. We probably need to provide a 1:1 map in
addition to the EFI virtual memory map in order for this to work.

Signed-off-by: H. Peter Anvin
Reported-and-bisected-by: Jérôme Carretero
Cc: Jan Beulich
Cc: Matt Fleming
Cc: Matthew Garrett
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu

H. Peter Anvin
2012-08-15 00:58:25 +0800

03 Aug, 2012

1 commit

095adbb64 ACPI: Only count valid srat memory structures ... Browse Code »

Otherwise you could run into:
WARN_ON in numa_register_memblks(), because node_possible_map is zero

References: https://bugzilla.novell.com/show_bug.cgi?id=757888

On this machine (ProLiant ML570 G3) the SRAT table contains:
- No processor affinities
- One memory affinity structure (which is set disabled)

CC: Per Jessen
CC: Andi Kleen
Signed-off-by: Thomas Renninger
Signed-off-by: Len Brown

Thomas Renninger
2012-08-03 12:15:53 +0800

27 Jul, 2012

2 commits

4cb38750d Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86/mm changes from Peter Anvin:
"The big change here is the patchset by Alex Shi to use INVLPG to flush
only the affected pages when we only need to flush a small page range.

It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
vectors!) and replace it with an ordinary IPI function call."

Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
to changed line)

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix build warning and crash when building for !SMP
x86/tlb: do flush_tlb_kernel_range by 'invlpg'
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
x86/tlb: enable tlb flush range support for x86
mm/mmu_gather: enable tlb flush range in generic mmu_gather
x86/tlb: add tlb_flushall_shift knob into debugfs
x86/tlb: add tlb_flushall_shift for specific CPU
x86/tlb: fall back to flush all when meet a THP large page
x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
x86/tlb_info: get last level TLB entry number of CPU
x86: Add read_mostly declaration/definition to variables from smp.h
x86: Define early read-mostly per-cpu macros

Linus Torvalds
2012-07-27 04:17:17 +0800
0a2fe19cc Merge branch 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pul x86/efi changes from Ingo Molnar:
"This tree adds an EFI bootloader handover protocol, which, once
supported on the bootloader side, will make bootup faster and might
result in simpler bootloaders.

The other change activates the EFI wall clock time accessors on x86-64
as well, instead of the legacy RTC readout."

* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Handover Protocol
x86-64/efi: Use EFI to deal with platform wall clock

Linus Torvalds
2012-07-27 04:13:25 +0800

28 Jun, 2012

7 commits

effee4b9b x86/tlb: do flush_tlb_kernel_range by 'invlpg' ... Browse Code »

This patch do flush_tlb_kernel_range by 'invlpg'. The performance pay
and gain was analyzed in previous patch
(x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range).

In the testing: http://lkml.org/lkml/2012/6/21/10

The pay is mostly covered by long kernel path, but the gain is still
quite clear, memory access in user APP can increase 30+% when kernel
execute this funtion.

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-10-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:14 +0800
52aec3308 x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR ... Browse Code »

There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
amount of vector in IDT. But it is still not enough, since modern x86
sever has more cpu number. That still causes heavy lock contention
in TLB flushing.

The patch using generic smp call function to replace it. That saved 32
vector number in IDT, and resolved the lock contention in TLB
flushing on large system.

In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
has 3% performance increase.

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:13 +0800
611ae8e3f x86/tlb: enable tlb flush range support for x86 ... Browse Code »

Not every tlb_flush execution moment is really need to evacuate all
TLB entries, like in munmap, just few 'invlpg' is better for whole
process performance, since it leaves most of TLB entries for later
accessing.

This patch also rewrite flush_tlb_range for 2 purposes:
1, split it out to get flush_blt_mm_range function.
2, clean up to reduce line breaking, thanks for Borislav's input.

My micro benchmark 'mummap' http://lkml.org/lkml/2012/5/17/59
show that the random memory access on other CPU has 0~50% speed up
on a 2P * 4cores * HT NHM EP while do 'munmap'.

Thanks Yongjie's testing on this patch:
-------------
I used Linux 3.4-RC6 w/ and w/o his patches as Xen dom0 and guest
kernel.
After running two benchmarks in Xen HVM guest, I found his patches
brought about 1%~3% performance gain in 'kernel build' and 'netperf'
testing, though the performance gain was not very stable in 'kernel
build' testing.

Some detailed testing results are below.

Testing Environment:
Hardware: Romley-EP platform
Xen version: latest upstream
Linux kernel: 3.4-RC6
Guest vCPU number: 8
NIC: Intel 82599 (10GB bandwidth)

In 'kernel build' testing in guest:
Command line | performance gain
make -j 4 | 3.81%
make -j 8 | 0.37%
make -j 16 | -0.52%

In 'netperf' testing, we tested TCP_STREAM with default socket size
16384 byte as large packet and 64 byte as small packet.
I used several clients to add networking pressure, then 'netperf' server
automatically generated several threads to response them.
I also used large-size packet and small-size packet in the testing.
Packet size | Thread number | performance gain
16384 bytes | 4 | 0.02%
16384 bytes | 8 | 2.21%
16384 bytes | 16 | 2.04%
64 bytes | 4 | 1.07%
64 bytes | 8 | 3.31%
64 bytes | 16 | 0.71%

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-8-git-send-email-alex.shi@intel.com
Tested-by: Ren, Yongjie
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:11 +0800
3df3212f9 x86/tlb: add tlb_flushall_shift knob into debugfs ... Browse Code »

kernel will replace cr3 rewrite with invlpg when
tlb_flush_entries
Link: http://lkml.kernel.org/r/1340845344-27557-6-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:10 +0800
c4211f42d x86/tlb: add tlb_flushall_shift for specific CPU ... Browse Code »

Testing show different CPU type(micro architectures and NUMA mode) has
different balance points between the TLB flush all and multiple invlpg.
And there also has cases the tlb flush change has no any help.

This patch give a interface to let x86 vendor developers have a chance
to set different shift for different CPU type.

like some machine in my hands, balance points is 16 entries on
Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
help.

For untested machine, do a conservative optimization, same as NHM CPU.

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:10 +0800
d8dfe60d6 x86/tlb: fall back to flush all when meet a THP large page ... Browse Code »

We don't need to flush large pages by PAGE_SIZE step, that just waste
time. and actually, large page don't need 'invlpg' optimizing according
to our micro benchmark. So, just flush whole TLB is enough for them.

The following result is tested on a 2CPU * 4cores * 2HT NHM EP machine,
with THP 'always' setting.

Multi-thread testing, '-t' paramter is thread number:
without this patch with this patch
./mprotect -t 1 14ns 13ns
./mprotect -t 2 13ns 13ns
./mprotect -t 4 12ns 11ns
./mprotect -t 8 14ns 10ns
./mprotect -t 16 28ns 28ns
./mprotect -t 32 54ns 52ns
./mprotect -t 128 200ns 200ns

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-4-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:09 +0800
e7b52ffd4 x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range ... Browse Code »

x86 has no flush_tlb_range support in instruction level. Currently the
flush_tlb_range just implemented by flushing all page table. That is not
the best solution for all scenarios. In fact, if we just use 'invlpg' to
flush few lines from TLB, we can get the performance gain from later
remain TLB lines accessing.

But the 'invlpg' instruction costs much of time. Its execution time can
compete with cr3 rewriting, and even a bit more on SNB CPU.

So, on a 512 4KB TLB entries CPU, the balance points is at:
(512 - X) * 100ns(assumed TLB refill cost) =
X(TLB flush entries) * 100ns(assumed invlpg cost)

Here, X is 256, that is 1/2 of 512 entries.

But with the mysterious CPU pre-fetcher and page miss handler Unit, the
assumed TLB refill cost is far lower then 100ns in sequential access. And
2 HT siblings in one core makes the memory access more faster if they are
accessing the same memory. So, in the patch, I just do the change when
the target entries is less than 1/16 of whole active tlb entries.
Actually, I have no data support for the percentage '1/16', so any
suggestions are welcomed.

As to hugetlb, guess due to smaller page table, and smaller active TLB
entries, I didn't see benefit via my benchmark, so no optimizing now.

My micro benchmark show in ideal scenarios, the performance improves 70
percent in reading. And in worst scenario, the reading/writing
performance is similar with unpatched 3.4-rc4 kernel.

Here is the reading data on my 2P * 4cores *HT NHM EP machine, with THP
'always':

multi thread testing, '-t' paramter is thread number:
with patch unpatched 3.4-rc4
./mprotect -t 1 14ns 24ns
./mprotect -t 2 13ns 22ns
./mprotect -t 4 12ns 19ns
./mprotect -t 8 14ns 16ns
./mprotect -t 16 28ns 26ns
./mprotect -t 32 54ns 51ns
./mprotect -t 128 200ns 199ns

Single process with sequencial flushing and memory accessing:

with patch unpatched 3.4-rc4
./mprotect 7ns 11ns
./mprotect -p 4096 -l 8 -n 10240
21ns 21ns

[ hpa: http://lkml.kernel.org/r/1B4B44D9196EFF41AE41FDA404FC0A100BFF94@SHSMSX101.ccr.corp.intel.com
has additional performance numbers. ]

Signed-off-by: Alex Shi
Link: http://lkml.kernel.org/r/1340845344-27557-3-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin

Alex Shi
2012-06-28 10:29:07 +0800

20 Jun, 2012

1 commit

0d26d1d87 x86/mm: Mark free_initrd_mem() as __init ... Browse Code »

... matching various other architectures.

Signed-off-by: Jan Beulich
Link: http://lkml.kernel.org/r/4FDF1F5C020000780008A661@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar

Jan Beulich
2012-06-20 20:33:47 +0800

11 Jun, 2012

1 commit

9efc31b81 x86/mm: Fix some kernel-doc warnings ... Browse Code »

Fix kernel-doc warnings in arch/x86/mm/ioremap.c and
arch/x86/mm/pageattr.c, just like this one:

Warning(arch/x86/mm/ioremap.c:204):
No description found for parameter 'phys_addr'
Warning(arch/x86/mm/ioremap.c:204):
Excess function parameter 'offset' description in 'ioremap_nocache'

Signed-off-by: Wanpeng Li
Cc: Gavin Shan
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1339296652-2935-1-git-send-email-liwp.linux@gmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2012-06-11 16:54:45 +0800

08 Jun, 2012

1 commit

bd2753b2d x86/mm: Only add extra pages count for the first memory range during pre-allocat… ... Browse Code »

…ion early page table space

Robin found this regression:

| I just tried to boot an 8TB system. It fails very early in boot with:
| Kernel panic - not syncing: Cannot find space for the kernel page tables

git bisect commit 722bc6b16771ed80871e1fd81c86d3627dda2ac8.

A git revert of that commit does boot past that point on the 8TB
configuration.

That commit will add up extra pages for all memory range even
above 4g.

Try to limit that extra page count adding to first entry only.

Bisected-by: Robin Holt <holt@sgi.com>
Tested-by: Robin Holt <holt@sgi.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/CAE9FiQUj3wyzQxtq9yzBNc9u220p8JZ1FYHG7t%3DMOzJ%3D9BZMYA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Yinghai Lu
2012-06-08 17:40:50 +0800

06 Jun, 2012

2 commits

4af463d28 x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init() ... Browse Code »

When hot-adding a CPU, the system outputs following messages
since node_to_cpumask_map[2] was not allocated memory.

Booting Node 2 Processor 32 APIC 0xc0
node_to_cpumask_map[2] NULL
Pid: 0, comm: swapper/32 Tainted: G A 3.3.5-acd #21
Call Trace:
[] debug_cpumask_set_cpu+0x155/0x160
[] ? add_timer_on+0xaa/0x120
[] numa_add_cpu+0x1e/0x22
[] identify_cpu+0x1df/0x1e4
[] identify_econdary_cpu+0x16/0x1d
[] smp_store_cpu_info+0x3c/0x3e
[] smp_callin+0x139/0x1be
[] start_secondary+0x13/0xeb

The reason is that the bit of node 2 was not set at
numa_nodes_parsed. numa_nodes_parsed is set by only
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init. Thus even if hot-added memory
which is same PXM as hot-added CPU is written in ACPI SRAT
Table, if the hot-added CPU is not written in ACPI SRAT table,
numa_nodes_parsed is not set.

But according to ACPI Spec Rev 5.0, it says about ACPI SRAT
table as follows: This optional table provides information that
allows OSPM to associate processors and memory ranges, including
ranges of memory provided by hot-added memory devices, with
system localities / proximity domains and clock domains.

It means that ACPI SRAT table only provides information for CPUs
present at boot time and for memory including hot-added memory.
So hot-added memory is written in ACPI SRAT table, but hot-added
CPU is not written in it. Thus numa_nodes_parsed should be set
by not only acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init but also
acpi_numa_memory_affinity_init for the case.

Additionally, if system has cpuless memory node,
acpi_numa_processor_affinity_init /
acpi_numa_x2apic_affinity_init cannot set numa_nodes_parseds
since these functions cannot find cpu description for the node.
In this case, numa_nodes_parsed needs to be set by
acpi_numa_memory_affinity_init.

Signed-off-by: Yasuaki Ishimatsu
Acked-by: David Rientjes
Acked-by: KOSAKI Motohiro
Cc: liuj97@gmail.com
Cc: kosaki.motohiro@gmail.com
Link: http://lkml.kernel.org/r/4FCC2098.4030007@jp.fujitsu.com
[ merged it ]
Signed-off-by: Ingo Molnar

Yasuaki Ishimatsu
2012-06-06 17:58:39 +0800
bacef661a x86-64/efi: Use EFI to deal with platform wall clock ... Browse Code »

Other than ix86, x86-64 on EFI so far didn't set the
{g,s}et_wallclock accessors to the EFI routines, thus
incorrectly using raw RTC accesses instead.

Simply removing the #ifdef around the respective code isn't
enough, however: While so far early get-time calls were done in
physical mode, this doesn't work properly for x86-64, as virtual
addresses would still need to be set up for all runtime regions
(which wasn't the case on the system I have access to), so
instead the patch moves the call to efi_enter_virtual_mode()
ahead (which in turn allows to drop all code related to calling
efi-get-time in physical mode).

Additionally the earlier calling of efi_set_executable()
requires the CPA code to cope, i.e. during early boot it must be
avoided to call cpa_flush_array(), as the first thing this
function does is a BUG_ON(irqs_disabled()).

Also make the two EFI functions in question here static -
they're not being referenced elsewhere.

Signed-off-by: Jan Beulich
Tested-by: Matt Fleming
Acked-by: Matthew Garrett
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/4FBFBF5F020000780008637F@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar

Jan Beulich
2012-06-06 17:48:05 +0800

31 May, 2012

1 commit

bbd771474 Merge branch 'x86/trampoline' into x86/urgent ... Browse Code »

x86/trampoline contains an urgent commit which is necessarily on a
newer baseline.

Signed-off-by: H. Peter Anvin

H. Peter Anvin
2012-05-31 03:11:32 +0800

30 May, 2012

1 commit

fa83523f4 x86/mm/pat: Improve scaling of pat_pagerange_is_ram() ... Browse Code »

Function pat_pagerange_is_ram() scales poorly to large address
ranges, because it probes the resource tree for each page.

On a 2.6 GHz Opteron, this function consumes 34 ms for a 1 GB range.

It is called twice during untrack_pfn_vma(), slowing process
cleanup and handicapping the OOM killer.

This replacement consumes less than 1ms, under the same conditions.

Signed-off-by: John Dykstra on behalf of Cray Inc.
Acked-by: Suresh Siddha
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1337980366.1979.6.camel@redwood
[ Small stylistic cleanups and renames ]
Signed-off-by: Ingo Molnar

John Dykstra
2012-05-30 16:57:11 +0800