Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

19 Apr, 2014

1 commit

7848a4bf5 mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages() ... Browse Code »
9

soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
hugetlb: fix softlockup when a large number of hugepages are freed." can
happen in return_unused_surplus_pages(), so let's fix it.

Signed-off-by: Masayoshi Mizuma
Signed-off-by: Naoya Horiguchi
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Aneesh Kumar
Cc: KOSAKI Motohiro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mizuma, Masayoshi
2014-04-19 07:40:08 +0800

08 Apr, 2014

5 commits

55f67141a mm: hugetlb: fix softlockup when a large number of hugepages are freed. ... Browse Code »
10

When I decrease the value of nr_hugepage in procfs a lot, softlockup
happens. It is because there is no chance of context switch during this
process.

On the other hand, when I allocate a large number of hugepages, there is
some chance of context switch. Hence softlockup doesn't happen during
this process. So it's necessary to add the context switch in the
freeing process as same as allocating process to avoid softlockup.

When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
process occupied a CPU over 150 seconds and following softlockup message
appeared twice or more.

$ echo 6000000 > /proc/sys/vm/nr_hugepages
$ cat /proc/sys/vm/nr_hugepages
6000000
$ grep ^Huge /proc/meminfo
HugePages_Total: 6000000
HugePages_Free: 6000000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
$ echo 0 > /proc/sys/vm/nr_hugepages

BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
free_pool_huge_page+0xb8/0xd0
set_max_huge_pages+0x128/0x190
hugetlb_sysctl_handler_common+0x113/0x140
hugetlb_sysctl_handler+0x1e/0x20
proc_sys_call_handler+0x97/0xd0
proc_sys_write+0x14/0x20
vfs_write+0xb8/0x1a0
sys_write+0x51/0x90
__audit_syscall_exit+0x265/0x290
system_call_fastpath+0x16/0x1b

I have not confirmed this problem with upstream kernels because I am not
able to prepare the machine equipped with 12TB memory now. However I
confirmed that the amount of decreasing hugepages was directly
proportional to the amount of required time.

I measured required times on a smaller machine. It showed 130-145
hugepages decreased in a millisecond.

Amount of decreasing Required time Decreasing rate
hugepages (msec) (pages/msec)
------------------------------------------------------------
10,000 pages == 20GB 70 - 74 135-142
30,000 pages == 60GB 208 - 229 131-144

It means decrement of 6TB hugepages will trigger softlockup with the
default threshold 20sec, in this decreasing rate.

Signed-off-by: Masayoshi Mizuma
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Wanpeng Li
Cc: Aneesh Kumar
Cc: KOSAKI Motohiro
Cc: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mizuma, Masayoshi
2014-04-08 07:35:58 +0800
ac7149045 mm: fix 'ERROR: do not initialise globals to 0 or NULL' and coding style ... Browse Code »

Signed-off-by: Choi Gi-yong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Choi Gi-yong
2014-04-08 07:35:55 +0800
3b32123d7 mm: use macros from compiler.h instead of __attribute__((...)) ... Browse Code »

To increase compiler portability there is which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.

[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gideon Israel Dsouza
2014-04-08 07:35:54 +0800
a5338093b mm: move mmu notifier call from change_protection to change_pmd_range ... Browse Code »

The NUMA scanning code can end up iterating over many gigabytes of
unpopulated memory, especially in the case of a freshly started KVM
guest with lots of memory.

This results in the mmu notifier code being called even when there are
no mapped pages in a virtual address range. The amount of time wasted
can be enough to trigger soft lockup warnings with very large KVM
guests.

This patch moves the mmu notifier call to the pmd level, which
represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
code is only called from the address in the PMD where present mappings
are first encountered.

The hugetlbfs code is left alone for now; hugetlb mappings are not
relocatable, and as such are left alone by the NUMA code, and should
never trigger this problem to begin with.

Signed-off-by: Rik van Riel
Acked-by: David Rientjes
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Reported-by: Xing Gang
Tested-by: Chegu Vinod
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-04-08 07:35:50 +0800
a9af0c5df mm/hugetlb.c: add NULL check of return value of huge_pte_offset ... Browse Code »

huge_pte_offset() could return NULL, so we need NULL check to avoid
potential NULL pointer dereferences.

Signed-off-by: Naoya Horiguchi
Cc: Mel Gorman
Cc: Sasha Levin
Cc: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-04-08 07:35:49 +0800

04 Apr, 2014

8 commits

f412c97ab mm, hugetlb: mark some bootstrap functions as __init ... Browse Code »

Both prep_compound_huge_page() and prep_compound_gigantic_page() are
only called at bootstrap and can be marked as __init.

The __SetPageTail(page) in prep_compound_gigantic_page() happening
before page->first_page is initialized is not concerning since this is
bootstrap.

Signed-off-by: David Rientjes
Reviewed-by: Michal Hocko
Cc: Joonsoo Kim
Reviewed-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-04-04 07:21:01 +0800
8382d914e mm, hugetlb: improve page-fault scalability ... Browse Code »

The kernel can currently only handle a single hugetlb page fault at a
time. This is due to a single mutex that serializes the entire path.
This lock protects from spurious OOM errors under conditions of low
availability of free hugepages. This problem is specific to hugepages,
because it is normal to want to use every single hugepage in the system
- with normal pages we simply assume there will always be a few spare
pages which can be used temporarily until the race is resolved.

Address this problem by using a table of mutexes, allowing a better
chance of parallelization, where each hugepage is individually
serialized. The hash key is selected depending on the mapping type.
For shared ones it consists of the address space and file offset being
faulted; while for private ones the mm and virtual address are used.
The size of the table is selected based on a compromise of collisions
and memory footprint of a series of database workloads.

Large database workloads that make heavy use of hugepages can be
particularly exposed to this issue, causing start-up times to be
painfully slow. This patch reduces the startup time of a 10 Gb Oracle
DB (with ~5000 faults) from 37.5 secs to 25.7 secs. Larger workloads
will naturally benefit even more.

NOTE:
The only downside to this patch, detected by Joonsoo Kim, is that a
small race is possible in private mappings: A child process (with its
own mm, after cow) can instantiate a page that is already being handled
by the parent in a cow fault. When low on pages, can trigger spurious
OOMs. I have not been able to think of a efficient way of handling
this... but do we really care about such a tiny window? We already
maintain another theoretical race with normal pages. If not, one
possible way to is to maintain the single hash for private mappings --
any workloads that *really* suffer from this scaling problem should
already use shared mappings.

[akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
Signed-off-by: Davidlohr Bueso
Cc: Aneesh Kumar K.V
Cc: David Gibson
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-04 07:21:00 +0800
4e35f4838 mm, hugetlb: use vma_resv_map() map types ... Browse Code »

Util now, we get a resv_map by two ways according to each mapping type.
This makes code dirty and unreadable. Unify it.

[davidlohr@hp.com: code cleanups]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
f031dd274 mm, hugetlb: remove resv_map_put ... Browse Code »

This is a preparation patch to unify the use of vma_resv_map()
regardless of the map type. This patch prepares it by removing
resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
for all resv_maps.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
7b24d8616 mm, hugetlb: fix race in region tracking ... Browse Code »

There is a race condition if we map a same file on different processes.
Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
mmap_sem (exclusively). This doesn't prevent other tasks from modifying
the region structure, so it can be modified by two processes
concurrently.

To solve this, introduce a spinlock to resv_map and make region
manipulation function grab it before they do actual work.

[davidlohr@hp.com: updated changelog]
Signed-off-by: Davidlohr Bueso
Signed-off-by: Joonsoo Kim
Suggested-by: Joonsoo Kim
Acked-by: David Gibson
Cc: David Gibson
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-04 07:20:59 +0800
1406ec9ba mm, hugetlb: improve, cleanup resv_map parameters ... Browse Code »

To change a protection method for region tracking to find grained one,
we pass the resv_map, instead of list_head, to region manipulation
functions.

This doesn't introduce any functional change, and it is just for
preparing a next step.

[davidlohr@hp.com: update changelog]
Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
9119a41e9 mm, hugetlb: unify region structure handling ... Browse Code »

Currently, to track reserved and allocated regions, we use two different
ways, depending on the mapping. For MAP_SHARED, we use
address_mapping's private_list and, while for MAP_PRIVATE, we use a
resv_map.

Now, we are preparing to change a coarse grained lock which protect a
region structure to fine grained lock, and this difference hinder it.
So, before changing it, unify region structure handling, consistently
using a resv_map regardless of the kind of mapping.

Signed-off-by: Joonsoo Kim
Signed-off-by: Davidlohr Bueso
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Naoya Horiguchi
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-04-04 07:20:59 +0800
d26914d11 mm: optimize put_mems_allowed() usage ... Browse Code »
7

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-04-04 07:20:58 +0800

24 Jan, 2014

1 commit

309381fea mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE ... Browse Code »

Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.

[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-01-24 08:36:50 +0800

22 Jan, 2014

5 commits

8b89a1169 mm/hugetlb.c: use memblock apis for early memory allocations ... Browse Code »

Switch to memblock interfaces for early memory allocator instead of
bootmem allocator. No functional change in beahvior than what it is in
current code from bootmem users points of view.

Archs already converted to NO_BOOTMEM now directly use memblock
interfaces instead of bootmem wrappers build on top of memblock. And
the archs which still uses bootmem, these new apis just fallback to
exiting bootmem APIs.

Signed-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Cc: "Rafael J. Wysocki"
Cc: Arnd Bergmann
Cc: Christoph Lameter
Cc: Greg Kroah-Hartman
Cc: H. Peter Anvin
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michal Hocko
Cc: Paul Walmsley
Cc: Pavel Machek
Cc: Russell King
Cc: Tejun Heo
Cc: Tony Lindgren
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Grygorii Strashko
2014-01-22 08:19:47 +0800
e8569dd29 mm/hugetlb.c: call MMU notifiers when copying a hugetlb page range ... Browse Code »

When copy_hugetlb_page_range() is called to copy a range of hugetlb
mappings, the secondary MMUs are not notified if there is a protection
downgrade, which breaks COW semantics in KVM.

This patch adds the necessary MMU notifier calls.

Signed-off-by: Andreas Sandberg
Acked-by: Steve Capper
Acked-by: Marc Zyngier
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Sandberg
2014-01-22 08:19:44 +0800
9b7ac2601 mm/hugetlb.c: defer PageHeadHuge() symbol export ... Browse Code »

No actual need of it. So keep it internal.

Signed-off-by: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2014-01-22 08:19:43 +0800
758f66a29 mm/hugetlb.c: simplify PageHeadHuge() and PageHuge() ... Browse Code »

Signed-off-by: Andrea Arcangeli
Cc: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-01-22 08:19:43 +0800
a0368d4e4 mm: hugetlb: use get_page_foll() in follow_hugetlb_page() ... Browse Code »

get_page_foll() is more optimal and is always safe to use under the PT
lock. More so for hugetlbfs as there's no risk of race conditions with
split_huge_page regardless of the PT lock.

Signed-off-by: Andrea Arcangeli
Tested-by: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2014-01-22 08:19:43 +0800

22 Nov, 2013

2 commits

27c73ae75 mm: hugetlbfs: fix hugetlbfs optimization ... Browse Code »

Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
caused by THP") can cause dereference of a dangling pointer if
split_huge_page runs during PageHuge() if there are updates to the
tail_page->private field.

Also it is repeating compound_head twice for hugetlbfs and it is running
compound_head+compound_trans_head for THP when a single one is needed in
both cases.

The new code within the PageSlab() check doesn't need to verify that the
THP page size is never bigger than the smallest hugetlbfs page size, to
avoid memory corruption.

A longstanding theoretical race condition was found while fixing the
above (see the change right after the skip_unlock label, that is
relevant for the compound_lock path too).

By re-establishing the _mapcount tail refcounting for all compound
pages, this also fixes the below problem:

echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

BUG: Bad page state in process bash pfn:59a01
page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
page flags: 0x1c00000000008000(tail)
Modules linked in:
CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
dump_stack+0x55/0x76
bad_page+0xd5/0x130
free_pages_prepare+0x213/0x280
__free_pages+0x36/0x80
update_and_free_page+0xc1/0xd0
free_pool_huge_page+0xc2/0xe0
set_max_huge_pages.part.58+0x14c/0x220
nr_hugepages_store_common.isra.60+0xd0/0xf0
nr_hugepages_store+0x13/0x20
kobj_attr_store+0xf/0x20
sysfs_write_file+0x189/0x1e0
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xb0
system_call_fastpath+0x16/0x1b

Signed-off-by: Khalid Aziz
Signed-off-by: Andrea Arcangeli
Tested-by: Khalid Aziz
Cc: Pravin Shelar
Cc: Greg Kroah-Hartman
Cc: Ben Hutchings
Cc: Christoph Lameter
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2013-11-22 08:42:27 +0800
30b0a105d mm: thp: give transparent hugepage code a separate copy_page ... Browse Code »

Right now, the migration code in migrate_page_copy() uses copy_huge_page()
for hugetlbfs and thp pages:

if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);

So, yay for code reuse. But:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);

and a non-hugetlbfs page has no page_hstate(). This works 99% of the
time because page_hstate() determines the hstate from the page order
alone. Since the page order of a THP page matches the default hugetlbfs
page order, it works.

But, if you change the default huge page size on the boot command-line
(say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
so page_hstate() returns null and copy_huge_page() oopses pretty fast
since copy_huge_page() dereferences the hstate:

void copy_huge_page(struct page *dst, struct page *src)
{
struct hstate *h = page_hstate(src);
if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
...

Mel noticed that the migration code is really the only user of these
functions. This moves all the copy code over to migrate.c and makes
copy_huge_page() work for THP by checking for it explicitly.

I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
THP migration for the NUMA working set scanning fault case")

[akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Reviewed-by: Naoya Horiguchi
Cc: Hillf Danton
Cc: Andrea Arcangeli
Tested-by: Dave Jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2013-11-22 08:42:27 +0800

15 Nov, 2013

1 commit

cb900f412 mm, hugetlb: convert hugetlbfs to use split pmd lock ... Browse Code »

Hugetlb supports multiple page sizes. We use split lock only for PMD
level, but not for PUD.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Signed-off-by: Kirill A. Shutemov
Tested-by: Alex Thorlton
Cc: Ingo Molnar
Cc: "Eric W . Biederman"
Cc: "Paul E . McKenney"
Cc: Al Viro
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: Dave Jones
Cc: David Howells
Cc: Frederic Weisbecker
Cc: Johannes Weiner
Cc: Kees Cook
Cc: Mel Gorman
Cc: Michael Kerrisk
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Robin Holt
Cc: Sedat Dilek
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2013-11-15 08:32:14 +0800

17 Oct, 2013

2 commits

ef5a22be2 mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages ... Browse Code »

Commit 11feeb498086 ("kvm: optimize away THP checks in
kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
compound pages.

That commit depends on the assumption that PG_reserved is identical for
all head and tail pages of a compound page. So that if get_user_pages
returns a tail page, we don't need to check the head page in order to
know if we deal with a reserved page that requires different
refcounting.

The assumption that PG_reserved is the same for head and tail pages is
certainly correct for THP and regular hugepages, but gigantic hugepages
allocated through bootmem don't clear the PG_reserved on the tail pages
(the clearing of PG_reserved is done later only if the gigantic hugepage
is freed).

This patch corrects the gigantic compound page initialization so that we
can retain the optimization in 11feeb498086. The cacheline was already
modified in order to set PG_tail so this won't affect the boot time of
large memory systems.

[akpm@linux-foundation.org: tweak comment layout and grammar]
Signed-off-by: Andrea Arcangeli
Reported-by: andy123
Acked-by: Rik van Riel
Cc: Gleb Natapov
Cc: Mel Gorman
Cc: Hugh Dickins
Acked-by: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2013-10-17 12:35:52 +0800
16c794b4f mm/hugetlb.c: correct missing private flag clearing ... Browse Code »

We should clear the page's private flag when returing the page to the
hugepage pool. Otherwise, marked hugepage can be allocated to the user
who tries to allocate the non-reserved hugepage. If this user fail to
map this hugepage, he would try to return the page to the hugepage pool.
Since this page has a private flag, resv_huge_pages would mistakenly
increase. This patch fixes this situation.

Signed-off-by: Joonsoo Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michal Hocko
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Naoya Horiguchi
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-10-17 12:35:52 +0800

12 Sep, 2013

15 commits

86cdb465c mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movable ... Browse Code »

Now hugepage migration is enabled, although restricted on pmd-based
hugepages for now (due to lack of testing.) So we should allocate
migratable hugepages from ZONE_MOVABLE if possible.

This patch makes GFP flags in hugepage allocation dependent on migration
support, not only the value of hugepages_treat_as_movable. It provides no
change on the behavior for architectures which do not support hugepage
migration,

Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Reviewed-by: Wanpeng Li
Cc: Hillf Danton
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-09-12 06:57:49 +0800
c8721bbbd mm: memory-hotplug: enable memory hotplug to handle hugepage ... Browse Code »
22

Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.

What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().

Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().

As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.

[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Cc: Hillf Danton
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-09-12 06:57:48 +0800
74060e4d7 mm: mbind: add hugepage migration code to mbind() ... Browse Code »

Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
migrate hugepage with mbind(2) after applying the enablement patch which
comes later in this series.

Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Reviewed-by: Wanpeng Li
Acked-by: Hillf Danton
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-09-12 06:57:48 +0800
31caf665e mm: migrate: make core migration code aware of hugepage ... Browse Code »

Currently hugepage migration is available only for soft offlining, but
it's also useful for some other users of page migration (clearly because
users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
So this patchset tries to extend such users to support hugepage migration.

The target of this patchset is to enable hugepage migration for NUMA
related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
memory hotplug.

This patchset does not add hugepage migration for memory compaction,
because users of memory compaction mainly expect to construct thp by
arranging raw pages, and there's little or no need to compact hugepages.
CMA, another user of page migration, can have benefit from hugepage
migration, but is not enabled to support it for now (just because of lack
of testing and expertise in CMA.)

Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
x86_64, or hugepages in architectures like ia64) is not enabled for now
(again, because of lack of testing.)

As for how these are achived, I extended the API (migrate_pages()) to
handle hugepage (with patch 1 and 2) and adjusted code of each caller to
check and collect movable hugepages (with patch 3-7). Remaining 2 patches
are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
about making sure that we only migrate pmd-based hugepages. And patch 9
is about choosing appropriate zone for hugepage allocation.

My test is mainly functional one, simply kicking hugepage migration via
each entry point and confirm that migration is done correctly. Test code
is available here:

git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

And I always run libhugetlbfs test when changing hugetlbfs's code. With
this patchset, no regression was found in the test.

This patch (of 9):

Before enabling each user of page migration to support hugepage,
this patch enables the list of pages for migration to link not only
LRU pages, but also hugepages. As a result, putback_movable_pages()
and migrate_pages() can handle both of LRU pages and hugepages.

Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Reviewed-by: Wanpeng Li
Acked-by: Hillf Danton
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-09-12 06:57:46 +0800
07443a85a mm, hugetlb: return a reserved page to a reserved pool if failed ... Browse Code »

If we fail with a reserved page, just calling put_page() is not
sufficient, because put_page() invoke free_huge_page() at last step and it
doesn't know whether a page comes from a reserved pool or not. So it
doesn't do anything related to reserved count. This makes reserve count
lower than how we need, because reserve count already decrease in
dequeue_huge_page_vma(). This patch fix this situation.

Signed-off-by: Joonsoo Kim
Cc: Aneesh Kumar
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:45 +0800
8312034f3 mm, hugetlb: grab a page_table_lock after page_cache_release ... Browse Code »

We don't need to grab a page_table_lock when we try to release a page.
So, defer to grab a page_table_lock.

Signed-off-by: Joonsoo Kim
Reviewed-by: Naoya Horiguchi
Cc: Aneesh Kumar
Reviewed-by: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:44 +0800
5944d0116 mm, hugetlb: remove useless check about mapping type ... Browse Code »

is_vma_resv_set(vma, HPAGE_RESV_OWNER) implys that this mapping is for
private. So we don't need to check whether this mapping is for shared or
not.

This patch is just for clean-up.

Signed-off-by: Joonsoo Kim
Cc: Aneesh Kumar
Cc: Naoya Horiguchi
Reviewed-by: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:44 +0800
8bb3f12e7 mm, hugetlb: fix subpool accounting handling ... Browse Code »

If we alloc hugepage with avoid_reserve, we don't dequeue reserved one.
So, we should check subpool counter when avoid_reserve. This patch
implement it.

Signed-off-by: Joonsoo Kim
Cc: Aneesh Kumar
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:44 +0800
f522c3ac0 mm, hugetlb: change variable name reservations to resv ... Browse Code »

'reservations' is so long name as a variable and we use 'resv_map' to
represent 'struct resv_map' in other place. To reduce confusion and
unreadability, change it.

Signed-off-by: Joonsoo Kim
Reviewed-by: Aneesh Kumar
Cc: Naoya Horiguchi
Reviewed-by: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:44 +0800
4ef918480 mm, hugetlb: protect reserved pages when soft offlining a hugepage ... Browse Code »

Don't use the reserve pool when soft offlining a hugepage. Check we have
free pages outside the reserve pool before we dequeue the huge page.
Otherwise, we can steal other's reserve page.

Signed-off-by: Joonsoo Kim
Reviewed-by: Aneesh Kumar
Cc: Naoya Horiguchi
Reviewed-by: Davidlohr Bueso
Cc: David Gibson
Cc: Wanpeng Li
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:42 +0800
af0ed73e6 mm, hugetlb: decrement reserve count if VM_NORESERVE alloc page cache ... Browse Code »

If a vma with VM_NORESERVE allocate a new page for page cache, we should
check whether this area is reserved or not. If this address is already
reserved by other process(in case of chg == 0), we should decrement
reserve count, because this allocated page will go into page cache and
currently, there is no way to know that this page comes from reserved pool
or not when releasing inode. This may introduce over-counting problem to
reserved count. With following example code, you can easily reproduce
this situation.

Assume 2MB, nr_hugepages = 100

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}

flag = MAP_SHARED | MAP_NORESERVE;
q = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (q == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
q[0] = 'c';

After finish the program, run 'cat /proc/meminfo'. You can see below
result.

HugePages_Free: 100
HugePages_Rsvd: 1

To fix this, we should check our mapping type and tracked region. If our
mapping is VM_NORESERVE, VM_MAYSHARE and chg is 0, this imply that current
allocated page will go into page cache which is already reserved region
when mapping is created. In this case, we should decrease reserve count.
As implementing above, this patch solve the problem.

[akpm@linux-foundation.org: fix spelling in comment]
Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Reviewed-by: Aneesh Kumar K.V
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:28 +0800
a63884e92 mm, hugetlb: remove decrement_hugepage_resv_vma() ... Browse Code »

Now, Checking condition of decrement_hugepage_resv_vma() and
vma_has_reserves() is same, so we can clean-up this function with
vma_has_reserves(). Additionally, decrement_hugepage_resv_vma() has only
one call site, so we can remove function and embed it into
dequeue_huge_page_vma() directly. This patch implement it.

Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Reviewed-by: Aneesh Kumar K.V
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:28 +0800
72231b03c mm, hugetlb: add VM_NORESERVE check in vma_has_reserves() ... Browse Code »

If we map the region with MAP_NORESERVE and MAP_SHARED, we can skip to
check reserve counting and eventually we cannot be ensured to allocate a
huge page in fault time. With following example code, you can easily find
this situation.

Assume 2MB, nr_hugepages = 100

fd = hugetlbfs_unlinked_fd();
if (fd < 0)
return 1;

size = 200 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}

size = 2 * MB;
flag = MAP_ANONYMOUS | MAP_SHARED | MAP_HUGETLB | MAP_NORESERVE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, -1, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
p[0] = '0';
sleep(10);

During executing sleep(10), run 'cat /proc/meminfo' on another process.

HugePages_Free: 99
HugePages_Rsvd: 100

Number of free should be higher or equal than number of reserve, but this
aren't. This represent that non reserved shared mapping steal a reserved
page. Non reserved shared mapping should not eat into reserve space.

If we consider VM_NORESERVE in vma_has_reserve() and return 0 which mean
that we don't have reserved pages, then we check that we have enough free
pages in dequeue_huge_page_vma(). This prevent to steal a reserved page.

With this change, above test generate a SIGBUG which is correct, because
all free pages are reserved and non reserved shared mapping can't get a
free page.

Signed-off-by: Joonsoo Kim
Reviewed-by: Wanpeng Li
Reviewed-by: Aneesh Kumar K.V
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:28 +0800
37a2140dc mm, hugetlb: do not use a page in page cache for cow optimization ... Browse Code »

Currently, we use a page with mapped count 1 in page cache for cow
optimization. If we find this condition, we don't allocate a new page and
copy contents. Instead, we map this page directly. This may introduce a
problem that writting to private mapping overwrite hugetlb file directly.
You can find this situation with following code.

size = 20 * MB;
flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
p[0] = 's';
fprintf(stdout, "BEFORE STEAL PRIVATE WRITE: %c\n", p[0]);
munmap(p, size);

flag = MAP_PRIVATE;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
}
p[0] = 'c';
munmap(p, size);

flag = MAP_SHARED;
p = mmap(NULL, size, PROT_READ|PROT_WRITE, flag, fd, 0);
if (p == MAP_FAILED) {
fprintf(stderr, "mmap() failed: %s\n", strerror(errno));
return -1;
}
fprintf(stdout, "AFTER STEAL PRIVATE WRITE: %c\n", p[0]);
munmap(p, size);

We can see that "AFTER STEAL PRIVATE WRITE: c", not "AFTER STEAL PRIVATE
WRITE: s". If we turn off this optimization to a page in page cache, the
problem is disappeared.

So, I change the trigger condition of optimization. If this page is not
AnonPage, we don't do optimization. This makes this optimization turning
off for a page cache.

Signed-off-by: Joonsoo Kim
Acked-by: Michal Hocko
Reviewed-by: Wanpeng Li
Reviewed-by: Naoya Horiguchi
Cc: Aneesh Kumar K.V
Acked-by: Hillf Danton
Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:27 +0800
c0d934ba2 mm, hugetlb: remove redundant list_empty check in gather_surplus_pages() ... Browse Code »

If list is empty, list_for_each_entry_safe() doesn't do anything. So,
this check is redundant. Remove it.

Signed-off-by: Joonsoo Kim
Acked-by: Michal Hocko
Reviewed-by: Wanpeng Li
Reviewed-by: Aneesh Kumar K.V
Acked-by: Hillf Danton
Cc: Naoya Horiguchi
Cc: Wanpeng Li
Cc: Rik van Riel
Cc: Mel Gorman
Cc: "Aneesh Kumar K.V"
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:27 +0800