Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2012

6 commits

12d271078 memcg: fix split_huge_page_refcounts() ... Browse Code »

This patch started off as a cleanup: __split_huge_page_refcounts() has to
cope with two scenarios, when the hugepage being split is already on LRU,
and when it is not; but why does it have to split that accounting across
three different sites? Consolidate it in lru_add_page_tail(), handling
evictable and unevictable alike, and use standard add_page_to_lru_list()
when accounting is needed (when the head is not yet on LRU).

But a recent regression in -next, I guess the removal of PageCgroupAcctLRU
test from mem_cgroup_split_huge_fixup(), makes this now a necessary fix:
under load, the MEM_CGROUP_ZSTAT count was wrapping to a huge number,
messing up reclaim calculations and causing a freeze at rmdir of cgroup.

Add a VM_BUG_ON to mem_cgroup_lru_del_list() when we're about to wrap that
count - this has not been the only such incident. Document that
lru_add_page_tail() is for Transparent HugePages by #ifdef around it.

Signed-off-by: Hugh Dickins
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:09 +0800
45676885b thp: improve order in lru list for split huge page ... Browse Code »

Put the tail subpages of an isolated hugepage under splitting in the lru
reclaim head as they supposedly should be isolated too next.

Queues the subpages in physical order in the lru for non isolated
hugepages under splitting. That might provide some theoretical cache
benefit to the buddy allocator later.

Signed-off-by: Shaohua Li
Signed-off-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
f21760b15 thp: add tlb_remove_pmd_tlb_entry ... Browse Code »

We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
flushed, but not a corresponding API for pmd entry. This isn't a
problem so far because THP is only for x86 currently and tlb_flush()
under x86 will flush entire TLB. But this is confusion and could be
missed if thp is ported to other arch.

Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
__tlb_remove_page() as suggested by Andrea Arcangeli. The
__tlb_remove_page() function is supposed to be called after
tlb_remove_xxx_tlb_entry() and we can catch any misuse.

Signed-off-by: Shaohua Li
Reviewed-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
e5591307f thp: remove unnecessary tlb flush for mprotect ... Browse Code »

change_protection() will do TLB flush later, don't need duplicate tlb
flush.

Signed-off-by: Shaohua Li
Reviewed-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
569e55900 thp: improve the error code path ... Browse Code »

Improve the error code path. Delete unnecessary sysfs file for example.
Also remove the #ifdef xxx to make code better.

Signed-off-by: Shaohua Li
Reviewed-by: Andrea Arcangeli
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-01-13 12:13:08 +0800
e94c8a9cb memcg: make mem_cgroup_split_huge_fixup() more efficient ... Browse Code »
43

In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

But thinking again,
- compound_lock() is held at move_accout...then, it's not necessary
to take move_lock_page_cgroup().
- LRU is locked and all tail pages will go into the same LRU as
head is now on.
- page_cgroup is contiguous in huge page range.

This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
hugepage and reduce costs for spliting.

[akpm@linux-foundation.org: fix typo, per Michal]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Reviewed-by: Michal Hocko
Cc: Balbir Singh
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-13 12:13:05 +0800

09 Dec, 2011

1 commit

1dfb059b9 thp: reduce khugepaged freezing latency ... Browse Code »

khugepaged can sometimes cause suspend to fail, requiring that the user
retry the suspend operation.

Use wait_event_freezable_timeout() instead of
schedule_timeout_interruptible() to avoid missing freezer wakeups. A
try_to_freeze() would have been needed in the khugepaged_alloc_hugepage
tight loop too in case of the allocation failing repeatedly, and
wait_event_freezable_timeout will provide it too.

khugepaged would still freeze just fine by trying again the next minute
but it's better if it freezes immediately.

Reported-by: Jiri Slaby
Signed-off-by: Andrea Arcangeli
Tested-by: Jiri Slaby
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: "Srivatsa S. Bhat"
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-12-09 23:50:28 +0800

03 Nov, 2011

1 commit

70b50f94f mm: thp: tail page refcounting fix ... Browse Code »
2

Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli
Reported-by: Michel Lespinasse
Reviewed-by: Michel Lespinasse
Reviewed-by: Minchan Kim
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Gibson
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-03 07:06:57 +0800

01 Nov, 2011

4 commits

35d8c7ad7 mm/huge_memory: fix typo when updating mmu cache ... Browse Code »

There are three cases of update_mmu_cache() in the file, and the case in
function collapse_huge_page() has a typo, namely the last parameter used,
which is corrected based on the other two cases.

Due to the define of update_mmu_cache by X86, the only arch that
implements THP currently, the change here has no really crystal point, but
one or two minutes of efforts could be saved for those archs that are
likely to support THP in future.

Signed-off-by: Hillf Danton
Cc: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-11-01 08:30:51 +0800
0089e4853 mm/huge_memory: fix copying user highpage ... Browse Code »

The THP copy-on-write handler falls back to regular-sized pages for a huge
page replacement upon allocation failure or if THP has been individually
disabled in the target VMA. The loop responsible for copying page-sized
chunks accidentally uses multiples of PAGE_SHIFT instead of PAGE_SIZE as
the virtual address arg for copy_user_highpage().

Signed-off-by: Hillf Danton
Acked-by: Johannes Weiner
Reviewed-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2011-11-01 08:30:50 +0800
2f1da6421 mm/huge_memory.c: quiet sparse noise ... Browse Code »

Quiet the sparse noise:

warning: symbol 'khugepaged_scan' was not declared. Should it be static?
warning: context imbalance in 'khugepaged_scan_mm_slot' - unexpected unlock

Signed-off-by: H Hartley Sweeten
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2011-11-01 08:30:50 +0800
37a1c49a9 thp: mremap support and TLB optimization ... Browse Code »

This adds THP support to mremap (decreases the number of split_huge_page()
calls).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include
#include
#include
#include
#include

#define SIZE (5UL*1024*1024*1024)

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");

return 0;
}
===

THP on

Performance counter stats for './largepage13' (3 runs):

69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

7.314454164 seconds time elapsed ( +- 0.023% )

THP off

Performance counter stats for './largepage13' (3 runs):

1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

9.693655243 seconds time elapsed ( +- 0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.

[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:48 +0800

26 Jul, 2011

1 commit

d788e80a8 mm/huge_memory.c: minor lock simplification in __khugepaged_exit ... Browse Code »

The lock is released first thing in all three branches. Simplify this by
unconditionally releasing lock and remove else clause which was only there
to be sure lock was released.

Signed-off-by: Chris Wright
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Wright
2011-07-26 11:57:09 +0800

16 Jun, 2011

1 commit

f300ea499 mm: remove khugepaged double thp vmstat update with CONFIG_NUMA=n ... Browse Code »

Johannes noticed the vmstat update is already taken care of by
khugepaged_alloc_hugepage() internally. The only places that are required
to update the vmstat are the callers of alloc_hugepage (callers of
khugepaged_alloc_hugepage aren't).

Signed-off-by: Andrea Arcangeli
Reported-by: Johannes Weiner
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-06-16 11:03:58 +0800

25 May, 2011

2 commits

692e0b354 mm: thp: optimize memcg charge in khugepaged ... Browse Code »

We don't need to hold the mmmap_sem through mem_cgroup_newpage_charge(),
the mmap_sem is only hold for keeping the vma stable and we don't need the
vma stable anymore after we return from alloc_hugepage_vma().

Signed-off-by: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-05-25 23:39:21 +0800
2b575eb64 mm: convert anon_vma->lock to a mutex ... Browse Code »

Straightforward conversion of anon_vma->lock to a mutex.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:19 +0800

29 Apr, 2011

1 commit

78f11a255 mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups ... Browse Code »

The huge_memory.c THP page fault was allowed to run if vm_ops was null
(which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
setup a special vma->vm_ops and it would fallback to regular anonymous
memory) but other THP logics weren't fully activated for vmas with vm_file
not NULL (/dev/zero has a not NULL vma->vm_file).

So this removes the vm_file checks so that /dev/zero also can safely use
THP (the other albeit safer approach to fix this bug would have been to
prevent the THP initial page fault to run if vm_file was set).

After removing the vm_file checks, this also makes huge_memory.c stricter
in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
only be allowed to exist before the first page fault, and in turn when
vma->anon_vma is null (so preventing khugepaged registration). So I tend
to think the previous comment saying if vm_file was set, VM_PFNMAP might
have been set and we could still be registered in khugepaged (despite
anon_vma was not NULL to be registered in khugepaged) was too paranoid.
The is_linear_pfn_mapping check is also I think superfluous (as described
by comment) but under DEBUG_VM it is safe to stay.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

Signed-off-by: Andrea Arcangeli
Reported-by: Caspar Zhang
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: [2.6.38.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-04-29 02:28:20 +0800

15 Apr, 2011

2 commits

e27e6151b mm/thp: use conventional format for boolean attributes ... Browse Code »

The conventional format for boolean attributes in sysfs is numeric ("0" or
"1" followed by new-line). Any boolean attribute can then be read and
written using a generic function. Using the strings "yes [no]", "[yes]
no" (read), "yes" and "no" (write) will frustrate this.

[akpm@linux-foundation.org: use kstrtoul()]
[akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
Signed-off-by: Ben Hutchings
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Hugh Dickins
Tested-by: David Rientjes
Cc: NeilBrown
Cc: [2.6.38.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Hutchings
2011-04-15 07:06:56 +0800
81ab4201f mm: add VM counters for transparent hugepages ... Browse Code »

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat for
allocation of transparent hugepages and fallback to smaller pages.

Optional patch, but useful for development and understanding the system.

Contains improvements from Andrea Arcangeli and Johannes Weiner

[akpm@linux-foundation.org: coding-style fixes]
[hannes@cmpxchg.org: fix vmstat_text[] entries]
Signed-off-by: Andi Kleen
Acked-by: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-04-15 07:06:55 +0800

23 Mar, 2011

1 commit

cc5d462f7 mm: use __GFP_OTHER_NODE for transparent huge pages ... Browse Code »

Pass __GFP_OTHER_NODE for transparent hugepages NUMA allocations done by the
hugepages daemon. This way the low level accounting for local versus
remote pages works correctly.

Contains improvements from Andrea Arcangeli

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800

14 Mar, 2011

1 commit

2fbfac4e0 thp+memcg-numa: fix BUG at include/linux/mm.h:370! ... Browse Code »

THP's collapse_huge_page() has an understandable but ugly difference
in when its huge page is allocated: inside if NUMA but outside if not.
It's hardly surprising that the memcg failure path forgot that, freeing
the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page()
(or even worse, using the freed page).

Signed-off-by: Hugh Dickins
Reviewed-by: Minchan Kim
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-03-14 23:29:50 +0800

05 Mar, 2011

3 commits

5c4b4be3b mm: use correct numa policy node for transparent hugepages ... Browse Code »

Pass down the correct node for a transparent hugepage allocation. Most
callers continue to use the current node, however the hugepaged daemon
now uses the previous node of the first to be collapsed page instead.
This ensures that khugepaged does not mess up local memory for an
existing process which uses local policy.

The choice of node is somewhat primitive currently: it just uses the
node of the first page in the pmd range. An alternative would be to
look at multiple pages and use the most popular node. I used the
simplest variant for now which should work well enough for the case of
all pages being on the same node.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Andrea Arcangeli
Signed-off-by: Andi Kleen
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-05 09:53:39 +0800
19ee151e1 mm: preserve original node for transparent huge page copies ... Browse Code »

This makes a difference for LOCAL policy, where the node cannot be
determined from the policy itself, but has to be gotten from the original
page.

Acked-by: Andrea Arcangeli
Signed-off-by: Andi Kleen
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-05 09:53:39 +0800
2f5f9486f mm: change alloc_pages_vma to pass down the policy node for local policy ... Browse Code »

Currently alloc_pages_vma() always uses the local node as policy node for
the LOCAL policy. Pass this node down as an argument instead.

No behaviour change from this patch, but will be needed for followons.

Acked-by: Andrea Arcangeli
Signed-off-by: Andi Kleen
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-05 09:53:39 +0800

16 Feb, 2011

1 commit

a7d6e4ecd thp: prevent hugepages during args/env copying into the user stack ... Browse Code »

Transparent hugepages can only be created if rmap is fully
functional. So we must prevent hugepages to be created while
is_vma_temporary_stack() is true.

This also optmizes away some harmless but unnecessary setting of
khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-02-16 07:21:11 +0800

12 Feb, 2011

1 commit

678ff896a memcg: fix leak of accounting at failure path of hugepage collapsing ... Browse Code »

mem_cgroup_uncharge_page() should be called in all failure cases after
mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

[ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
[ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
[ 4209.078674] page flags: 0x40000000004000(head)
[ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
[ 4209.082177] (/A)
[ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
[ 4209.083412] Call Trace:
[ 4209.083678] [] ? bad_page+0xe4/0x140
[ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
[ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
[ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
[ 4209.086110] [] ? free_compound_page+0x1b/0x20
[ 4209.086699] [] ? __put_compound_page+0x1c/0x30
[ 4209.087333] [] ? put_compound_page+0x4d/0x200
[ 4209.087935] [] ? put_page+0x45/0x50
[ 4209.097361] [] ? khugepaged+0x9e9/0x1430
[ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
[ 4209.099121] [] ? khugepaged+0x0/0x1430
[ 4209.099780] [] ? kthread+0x96/0xa0
[ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
[ 4209.101214] [] ? kthread+0x0/0xa0
[ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Reviewed-by: Johannes Weiner
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-02-12 08:12:20 +0800

03 Feb, 2011

1 commit

a6d30ddda thp: fix the wrong reported address of hwpoisoned hugepages ... Browse Code »

When the tail page of THP is poisoned, the head page will be poisoned too.
And the wrong address, address of head page, will be sent with sigbus
always.

So when the poisoned page is used by Guest OS which is running on KVM,
after the address changing(hva->gpa) by qemu, the unexpected process on
Guest OS will be killed by sigbus.

What we expected is that the process using the poisoned tail page could be
killed on Guest OS, but not that the process using the healthy head page
is killed.

Since it is not good to poison the healthy page, avoid poisoning other
than the page which is really poisoned.
(While we poison all pages in a huge page in case of hugetlb,
we can do this for THP thanks to split_huge_page().)

Here we fix two parts:
1. Isolate the poisoned page only to make sure
the reported address is the address of poisoned page.
2. make the poisoned page work as the poisoned regular page.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Jin Dongming
Reviewed-by: Hidetoshi Seto
Cc: Andrea Arcangeli
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jin Dongming
2011-02-03 08:03:19 +0800

21 Jan, 2011

2 commits

ca3e02141 memcg: fix USED bit handling at uncharge in THP ... Browse Code »

Now, under THP:

at charge:
- PageCgroupUsed bit is set to all page_cgroup on a hugepage.
....set to 512 pages.
at uncharge
- PageCgroupUsed bit is unset on the head page.

So, some pages will remain with "Used" bit.

This patch fixes that Used bit is set only to the head page.
Used bits for tail pages will be set at splitting if necessary.

This patch adds this lock order:
compound_lock() -> page_cgroup_move_lock().

[akpm@linux-foundation.org: fix warning]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-21 09:02:06 +0800
453c71926 thp: keep highpte mapped until it is no longer needed ... Browse Code »

Two users reported THP-related crashes on 32-bit x86 machines. Their oops
reports indicated an invalid pte, and subsequent code inspection showed
that the highpte is actually used after unmap.

The fix is to unmap the pte only after all operations against it are
finished.

Signed-off-by: Johannes Weiner
Reported-by: Ilya Dryomov
Reported-by: werner
Cc: Andrea Arcangeli
Tested-by: Ilya Dryomov
Tested-by: Steven Rostedt
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-21 09:02:05 +0800

14 Jan, 2011

11 commits

60ab3244e thp: khugepaged: make khugepaged aware about madvise ... Browse Code »

MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory. While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.

MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).

MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
a664b2d85 thp: madvise(MADV_NOHUGEPAGE) ... Browse Code »

Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
2c888cfbc thp: fix anon memory statistics with transparent hugepages ... Browse Code »

Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:46 +0800
97562cd24 thp: disable transparent hugepages by default on small systems ... Browse Code »

On small systems, the extra memory used by the anti-fragmentation memory
reserve and simply because huge pages are smaller than large pages can
easily outweigh the benefits of less TLB misses.

A less obvious concern is if run on a NUMA machine with asymmetric node
sizes and one of them is very small. The reserve could make the node
unusable.

In case of the crashdump kernel, OOMs have been observed due to the
anti-fragmentation memory reserve taking up a large fraction of the
crashdump image.

This patch disables transparent hugepages on systems with less than 1GB of
RAM, but the hugepage subsystem is fully initialized so administrators can
enable THP through /sys if desired.

Signed-off-by: Rik van Riel
Acked-by: Avi Kiviti
Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:46 +0800
878aee7d6 thp: freeze khugepaged and ksmd ... Browse Code »

It's unclear why schedule friendly kernel threads can't be taken away by
the CPU through the scheduler itself. It's safer to stop them as they can
trigger memory allocation, if kswapd also freezes itself to avoid
generating I/O they have too.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:46 +0800
8ee53820e thp: mmu_notifier_test_young ... Browse Code »

For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit). qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...

We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.

Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.

->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:46 +0800
4b7167b9f thp: don't allow transparent hugepage support without PSE ... Browse Code »

Archs implementing Transparent Hugepage Support must implement a function
called has_transparent_hugepage to be sure the virtual or physical CPU
supports Transparent Hugepages.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800
94fcc585f thp: avoid breaking huge pmd invariants in case of vma_adjust failures ... Browse Code »

An huge pmd can only be mapped if the corresponding 2M virtual range is
fully contained in the vma. At times the VM calls split_vma twice, if the
first split_vma succeeds and the second fail, the first split_vma remains
in effect and it's not rolled back. For split_vma or vma_adjust to fail
an allocation failure is needed so it's a very unlikely event (the out of
memory killer would normally fire before any allocation failure is visible
to kernel and userland and if an out of memory condition happens it's
unlikely to happen exactly here). Nevertheless it's safer to ensure that
no huge pmd can be left around if the vma is adjusted in a way that can't
fit hugepages anymore at the new vm_start/vm_end address.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800
13ece886d thp: transparent hugepage config choice ... Browse Code »

Allow to choose between the always|madvise default for page faults and
khugepaged at config time. madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE)
won't risk to use any more memory by backing their virtual regions with
hugepages).

Initially set the default to N and don't depend on EMBEDDED.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800
ce83d2174 thp: allocate memory in khugepaged outside of mmap_sem write mode ... Browse Code »

This tries to be more friendly to filesystem in userland, with userland
backends that allocate memory in the I/O paths and that could deadlock if
khugepaged holds the mmap_sem write mode of the userland backend while
allocating memory. Memory allocation may wait for writeback I/O
completion from the daemon that may be blocked in the mmap_sem read mode
if a page fault happens and the daemon wasn't using mlock for the memory
required for the I/O submission and completion.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800
0bbbc0b33 thp: add numa awareness to hugepage allocations ... Browse Code »

It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known
and an error has to be returned to the outer loop to sleep
alloc_sleep_millisecs in case of failure. But it retains the more
efficient logic of handling allocation failures in khugepaged in case of
CONFIG_NUMA=n.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:45 +0800