Eric Lee / smarc-fsl-linux-kernel

05 Apr, 2016

1 commit

09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

18 Mar, 2016

1 commit

598d80914 mm: convert pr_warning to pr_warn ... Browse Code »

There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
consistently.

Miscellanea:

- Coalesce formats
- Realign arguments

Signed-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800

10 Mar, 2016

2 commits

86613628b mm/hugetlb: use EOPNOTSUPP in hugetlb sysctl handlers ... Browse Code »

Replace ENOTSUPP with EOPNOTSUPP. If hugepages are not supported, this
value is propagated to userspace. EOPNOTSUPP is part of uapi and is
widely supported by libc libraries.

It gives nicer message to user, rather than:

# cat /proc/sys/vm/nr_hugepages
cat: /proc/sys/vm/nr_hugepages: Unknown error 524

And also LTP's proc01 test was failing because this ret code (524)
was unexpected:

proc01 1 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
proc01 2 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
proc01 3 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524

Signed-off-by: Jan Stancek
Acked-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Acked-by: Hillf Danton
Cc: Mike Kravetz
Cc: Dave Hansen
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Stancek
2016-03-10 07:43:42 +0800
910154d52 mm/hugetlb: hugetlb_no_page: rate-limit warning message ... Browse Code »

The warning message "killed due to inadequate hugepage pool" simply
indicates that SIGBUS was sent, not that the process was forcibly killed.
If the process has a signal handler installed does not fix the problem,
this message can rapidly spam the kernel log.

On my amd64 dev machine that does not have hugepages configured, I can
reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
4 megabytes of huge pages) and running something that sets a signal
handler and forks, like

#include
#include
#include
#include

sig_atomic_t counter = 10;
void handler(int signal)
{
if (counter-- == 0)
exit(0);
}

int main(void)
{
int status;
char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
if (addr == MAP_FAILED) {perror("mmap"); return 1;}
*addr = 'x';
switch (fork()) {
case -1:
perror("fork"); return 1;
case 0:
signal(SIGBUS, handler);
*addr = 'x';
break;
default:
*addr = 'x';
wait(&status);
if (WIFSIGNALED(status)) {
psignal(WTERMSIG(status), "child");
}
break;
}
}

Signed-off-by: Geoffrey Thomas
Cc: Naoya Horiguchi
Cc: Hillf Danton
Cc: "Kirill A. Shutemov"
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geoffrey Thomas
2016-03-10 07:43:42 +0800

19 Feb, 2016

1 commit

f8b74815a mm/hugetlb.c: fix incorrect proc nr_hugepages value ... Browse Code »

Currently incorrect default hugepage pool size is reported by proc
nr_hugepages when number of pages for the default huge page size is
specified twice.

When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
indicates the current number of pre-allocated huge pages of the default
size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
max_huge_pages and after boot time pre-allocation, max_huge_pages should
equal the number of pre-allocated pages (nr_hugepages).

Test case:

Note that this is specific to x86 architecture.

Boot the kernel with command line option 'default_hugepagesz=1G
hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
returns the value X. However, dmesg output shows that Z huge pages were
pre-allocated.

So, the root cause of the problem here is that the global variable
default_hstate_max_huge_pages is set if a default huge page size is
specified (directly or indirectly) on the command line. After the command
line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
the value is assigned to default_hstae.max_huge_pages. However,
default_hstate.max_huge_pages may have already been set based on the
number of pre-allocated huge pages of default_hstate size.

The solution to this problem is if hstate->max_huge_pages is already set
then it should not set as a result of global max_huge_pages value.
Basically if the value of the variable hugepages is set multiple times on
a command line for a specific supported hugepagesize then proc layer
should consider the last specified value.

Signed-off-by: Vaishali Thakkar
Reviewed-by: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Hillf Danton
Cc: Kirill A. Shutemov
Cc: Dave Hansen
Cc: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vaishali Thakkar
2016-02-19 08:23:24 +0800

06 Feb, 2016

2 commits

080fe2068 mm, hugetlb: don't require CMA for runtime gigantic pages ... Browse Code »

Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.

Signed-off-by: Vlastimil Babka
Cc: Luiz Capitulino
Cc: Kirill A. Shutemov
Cc: Zhang Yanfei
Cc: Yasuaki Ishimatsu
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Davidlohr Bueso
Cc: Hillf Danton
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2016-02-06 10:10:40 +0800
b4330afbe mm/hugetlb: fix gigantic page initialization/allocation ... Browse Code »

Attempting to preallocate 1G gigantic huge pages at boot time with
"hugepagesz=1G hugepages=1" on the kernel command line will prevent
booting with the following:

kernel BUG at mm/hugetlb.c:1218!

When mapcount accounting was reworked, the setting of
compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
a result, the validation of mapcount in free_huge_page fails.

The "BUG_ON" checks in free_huge_page were also changed to
"VM_BUG_ON_PAGE" to assist with debugging.

Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
Signed-off-by: Mike Kravetz
Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Acked-by: David Rientjes
Tested-by: Vlastimil Babka
Cc: "Aneesh Kumar K.V"
Cc: Jerome Marchand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2016-02-06 10:10:40 +0800

16 Jan, 2016

4 commits

53f9263ba mm: rework mapcount accounting to enable 4k mapping of THPs ... Browse Code »

We're going to allow mapping of individual 4k pages of THP compound. It
means we need to track mapcount on per small page basis.

Straight-forward approach is to use ->_mapcount in all subpages to track
how many time this subpage is mapped with PMDs or PTEs combined. But
this is rather expensive: mapping or unmapping of a THP page with PMD
would require HPAGE_PMD_NR atomic operations instead of single we have
now.

The idea is to store separately how many times the page was mapped as
whole -- compound_mapcount. This frees up ->_mapcount in subpages to
track PTE mapcount.

We use the same approach as with compound page destructor and compound
order to store compound_mapcount: use space in first tail page,
->mapping this time.

Any time we map/unmap whole compound page (THP or hugetlb) -- we
increment/decrement compound_mapcount. When we map part of compound
page with PTE we operate on ->_mapcount of the subpage.

page_mapcount() counts both: PTE and PMD mappings of the page.

Basically, we have mapcount for a subpage spread over two counters. It
makes tricky to detect when last mapcount for a page goes away.

We introduced PageDoubleMap() for this. When we split THP PMD for the
first time and there's other PMD mapping left we offset up ->_mapcount
in all subpages by one and set PG_double_map on the compound page.
These additional references go away with last compound_mapcount.

This approach provides a way to detect when last mapcount goes away on
per small page basis without introducing new overhead for most common
cases.

[akpm@linux-foundation.org: fix typo in comment]
[mhocko@suse.com: ignore partial THP when moving task]
Signed-off-by: Kirill A. Shutemov
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
ddc58f27f mm: drop tail page refcounting ... Browse Code »

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
d281ee614 rmap: add argument to charge compound page ... Browse Code »

We're going to allow mapping of individual 4k pages of THP compound
page. It means we cannot rely on PageTransHuge() check to decide if
map/unmap small page or THP.

The patch adds new argument to rmap functions to indicate whether we
want to operate on whole compound page or only the small page.

[n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
de09d31dd page-flags: define PG_reserved behavior on compound pages ... Browse Code »

As far as I can see there's no users of PG_reserved on compound pages.
Let's use PF_NO_COMPOUND here.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Christoph Lameter
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: "Aneesh Kumar K.V"
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Jerome Marchand
Cc: Jérôme Glisse
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

15 Jan, 2016

1 commit

3e89e1c5e hugetlb: make mm and fs code explicitly non-modular ... Browse Code »

The Kconfig currently controlling compilation of this code is:

config HUGETLBFS
bool "HugeTLB file system support"

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that when
reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular case,
the init ordering gets moved to earlier levels when we use the more
appropriate initcalls here.

Originally I had the fs part and the mm part as separate commits, just
by happenstance of the nature of how I detected these non-modular use
cases. But that can possibly introduce regressions if the patch merge
ordering puts the fs part 1st -- as the 0-day testing reported a splat
at mount time.

Investigating with "initcall_debug" showed that the delta was
init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
both the fs change and the mm change are here together.

In addition, it worked before due to luck of link order, since they were
both in the same initcall category. So we now have the fs part using
fs_initcall, and the mm part using subsys_initcall, which puts it one
bucket earlier. It now passes the basic sanity test that failed in
earlier 0-day testing.

We delete the MODULE_LICENSE tag and capture that information at the top
of the file alongside author comments, etc.

We don't replace module.h with init.h since the file already has that.
Also note that MODULE_ALIAS is a no-op for non-modular code.

Signed-off-by: Paul Gortmaker
Reported-by: kernel test robot
Cc: Nadia Yvette Chambers
Cc: Alexander Viro
Cc: Naoya Horiguchi
Reviewed-by: Mike Kravetz
Cc: David Rientjes
Cc: Hillf Danton
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2016-01-15 08:00:49 +0800

13 Dec, 2015

3 commits

dbe409e4f mm/hugetlb.c: fix resv map memory leak for placeholder entries ... Browse Code »

Dmitry Vyukov reported the following memory leak

unreferenced object 0xffff88002eaafd88 (size 32):
comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
hex dump (first 32 bytes):
28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmalloc include/linux/slab.h:458
region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
__vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
vma_needs_reservation mm/hugetlb.c:1813
alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
hugetlb_no_page mm/hugetlb.c:3543
hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
__get_user_pages+0x542/0xf30 mm/gup.c:497
populate_vma_page_range+0xde/0x110 mm/gup.c:919
__mm_populate+0x1c7/0x310 mm/gup.c:969
do_mlock+0x291/0x360 mm/mlock.c:637
SYSC_mlock2 mm/mlock.c:658
SyS_mlock2+0x4b/0x70 mm/mlock.c:648

Dmitry identified a potential memory leak in the routine region_chg,
where a region descriptor is not free'ed on an error path.

However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left
in the reserve map. This is "by design" as the entry should be deleted
when the map is released. The bug is in the region_del routine which is
used to delete entries within a specific range (and when the map is
released). region_del did not handle the case where a placeholder entry
exactly matched the start of the range range to be deleted. In this
case, the entry would not be deleted and leaked. The fix is to take
these special placeholder entries into account in region_del.

The region_chg error path leak is also fixed.

Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
Signed-off-by: Mike Kravetz
Reported-by: Dmitry Vyukov
Acked-by: Hillf Danton
Cc: [4.3+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-12-13 02:15:34 +0800
0d777df5d mm: hugetlb: call huge_pte_alloc() only if ptep is null ... Browse Code »

Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
and check whether the obtained *ptep is a migration/hwpoison entry or
not. And if not, then we get to call huge_pte_alloc(). This is racy
because the *ptep could turn into migration/hwpoison entry after the
huge_pte_offset() check. This race results in BUG_ON in
huge_pte_alloc().

We don't have to call huge_pte_alloc() when the huge_pte_offset()
returns non-NULL, so let's fix this bug with moving the code into else
block.

Note that the *ptep could turn into a migration/hwpoison entry after
this block, but that's not a problem because we have another
!pte_present check later (we never go into hugetlb_no_page() in that
case.)

Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
Signed-off-by: Naoya Horiguchi
Acked-by: Hillf Danton
Acked-by: David Rientjes
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Mike Kravetz
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-12-13 02:15:34 +0800
a88c76954 mm: hugetlb: fix hugepage memory leak caused by wrong reserve count ... Browse Code »

When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
alloc_buddy_huge_page() to directly create a hugepage from the buddy
allocator.

In that case, however, if alloc_buddy_huge_page() succeeds we don't
decrement h->resv_huge_pages, which means that successful
hugetlb_fault() returns without releasing the reserve count. As a
result, subsequent hugetlb_fault() might fail despite that there are
still free hugepages.

This patch simply adds decrementing code on that code path.

I reproduced this problem when testing v4.3 kernel in the following situation:
- the test machine/VM is a NUMA system,
- hugepage overcommiting is enabled,
- most of hugepages are allocated and there's only one free hugepage
which is on node 0 (for example),
- another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
node 1, tries to allocate a hugepage,
- the allocation should fail but the reserve count is still hold.

Signed-off-by: Naoya Horiguchi
Cc: David Rientjes
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Hillf Danton
Cc: Mike Kravetz
Cc: [3.16+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-12-13 02:15:34 +0800

11 Nov, 2015

1 commit

d15c7c093 hugetlb: trivial comment fix ... Browse Code »

Recently alloc_buddy_huge_page() was renamed to __alloc_buddy_huge_page(),
so let's sync comments.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-11-11 08:32:11 +0800

07 Nov, 2015

3 commits

d00181b96 mm: use 'unsigned int' for page order ... Browse Code »

Let's try to be consistent about data type of page order.

[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Stephen Rothwell
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
1d798ca3f mm: make compound_head() robust ... Browse Code »

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
f1e61557f mm: pack compound_dtor and compound_order into one word in struct page ... Browse Code »

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800

06 Nov, 2015

5 commits

de60f5f10 mm: introduce VM_LOCKONFAULT ... Browse Code »

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.

Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.

Signed-off-by: Eric B Munson
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Guenter Roeck
Cc: Heiko Carstens
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-11-06 11:34:48 +0800
e0ec90ee7 mm, hugetlbfs: optimize when NUMA=n ... Browse Code »

My recent patch "mm, hugetlb: use memory policy when available" added some
bloat to hugetlb.o. This patch aims to get some of the bloat back,
especially when NUMA is not in play.

It does this with an implicit #ifdef and marking some things static that
should have been static in my first patch. It also makes the warnings
only VM_WARN_ON()s. They were responsible for a pretty big chunk of the
bloat.

Doing this gets our NUMA=n text size back to a wee bit _below_ where we
started before the original patch.

It also shaves a bit of space off the NUMA=y case, but not much.
Enforcing the mempolicy definitely takes some text and it's hard to avoid.

size(1) output:

text data bss dec hex filename
30745 3433 2492 36670 8f3e hugetlb.o.nonuma.baseline
31305 3755 2492 37552 92b0 hugetlb.o.nonuma.patch1
30713 3433 2492 36638 8f1e hugetlb.o.nonuma.patch2 (this patch)
25235 473 41276 66984 105a8 hugetlb.o.numa.baseline
25715 475 41276 67466 1078a hugetlb.o.numa.patch1
25491 473 41276 67240 106a8 hugetlb.o.numa.patch2 (this patch)

Signed-off-by: Dave Hansen
Cc: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Hillf Danton
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2015-11-06 11:34:48 +0800
099730d67 mm, hugetlb: use memory policy when available ... Browse Code »

I have a hugetlbfs user which is never explicitly allocating huge pages
with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
the pages be allocated from the buddy allocator at fault time.

This works, but they noticed that mbind() was not doing them any good and
the pages were being allocated without respect for the policy they
specified.

The code in question is this:

> struct page *alloc_huge_page(struct vm_area_struct *vma,
...
> page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
> if (!page) {
> page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
But, it only grabs _existing_ huge pages from the huge page pool. If the
pool is empty, we fall back to alloc_buddy_huge_page() which obviously
can't do anything with the VMA's policy because it isn't even passed the
VMA.

Almost everybody preallocates huge pages. That's probably why nobody has
ever noticed this. Looking back at the git history, I don't think this
_ever_ worked from when alloc_buddy_huge_page() was introduced in
7893d1d5, 8 years ago.

The fix is to pass vma/addr down in to the places where we actually call
in to the buddy allocator. It's fairly straightforward plumbing. This
has been lightly tested.

Signed-off-by: Dave Hansen
Cc: Naoya Horiguchi
Cc: Mike Kravetz
Cc: Hillf Danton
Cc: David Rientjes
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2015-11-06 11:34:48 +0800
b4e289a6a mm/hugetlb: make node_hstates array static ... Browse Code »

There are no users of the node_hstates array outside of the
mm/hugetlb.c. So let's make it static.

Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Kuleshov
2015-11-06 11:34:48 +0800
5d317b2b6 mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status ... Browse Code »

Currently there's no easy way to get per-process usage of hugetlb pages,
which is inconvenient because userspace applications which use hugetlb
typically want to control their processes on the basis of how much memory
(including hugetlb) they use. So this patch simply provides easy access
to the info via /proc/PID/status.

Signed-off-by: Naoya Horiguchi
Acked-by: Joern Engel
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-11-06 11:34:48 +0800

02 Oct, 2015

1 commit

2f84a8990 mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault ... Browse Code »

SunDong reported the following on

https://bugzilla.kernel.org/show_bug.cgi?id=103841

I think I find a linux bug, I have the test cases is constructed. I
can stable recurring problems in fedora22(4.0.4) kernel version,
arch for x86_64. I construct transparent huge page, when the parent
and child process with MAP_SHARE, MAP_PRIVATE way to access the same
huge page area, it has the opportunity to lead to huge page copy on
write failure, and then it will munmap the child corresponding mmap
area, but then the child mmap area with VM_MAYSHARE attributes, child
process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
functions (vma - > vm_flags & VM_MAYSHARE).

There were a number of problems with the report (e.g. it's hugetlbfs that
triggers this, not transparent huge pages) but it was fundamentally
correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
looks like this

vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
pgoff 0 file ffff88106bdb9800 private_data (null)
flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
------------
kernel BUG at mm/hugetlb.c:462!
SMP
Modules linked in: xt_pkttype xt_LOG xt_limit [..]
CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
set_vma_resv_flags+0x2d/0x30

The VM_BUG_ON is correct because private and shared mappings have
different reservation accounting but the warning clearly shows that the
VMA is shared.

When a private COW fails to allocate a new page then only the process
that created the VMA gets the page -- all the children unmap the page.
If the children access that data in the future then they get killed.

The problem is that the same file is mapped shared and private. During
the COW, the allocation fails, the VMAs are traversed to unmap the other
private pages but a shared VMA is found and the bug is triggered. This
patch identifies such VMAs and skips them.

Signed-off-by: Mel Gorman
Reported-by: SunDong
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: David Rientjes
Reviewed-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-10-02 09:42:35 +0800

09 Sep, 2015

9 commits

96db800f5 mm: rename alloc_pages_exact_node() to __alloc_pages_node() ... Browse Code »

alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
allocator: do not check NUMA node ID when the caller knows the node is
valid") as an optimized variant of alloc_pages_node(), that doesn't
fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
name of the function can easily suggest that the allocation is
restricted to the given node and fails otherwise. In truth, the node is
only preferred, unless __GFP_THISNODE is passed among the gfp flags.

The misleading name has lead to mistakes in the past, see for example
commits 5265047ac301 ("mm, thp: really limit transparent hugepage
allocation to local node") and b360edb43f8e ("mm, mempolicy:
migrate_to_node should only migrate to node").

Another issue with the name is that there's a family of
alloc_pages_exact*() functions where 'exact' means exact size (instead
of page order), which leads to more confusion.

To prevent further mistakes, this patch effectively renames
alloc_pages_exact_node() to __alloc_pages_node() to better convey that
it's an optimized variant of alloc_pages_node() not intended for general
usage. Both functions get described in comments.

It has been also considered to really provide a convenience function for
allocations restricted to a node, but the major opinion seems to be that
__GFP_THISNODE already provides that functionality and we shouldn't
duplicate the API needlessly. The number of users would be small
anyway.

Existing callers of alloc_pages_exact_node() are simply converted to
call __alloc_pages_node(), with the exception of sba_alloc_coherent()
which open-codes the check for NUMA_NO_NODE, so it is converted to use
alloc_pages_node() instead. This means it no longer performs some
VM_BUG_ON checks, and since the current check for nid in
alloc_pages_node() uses a 'nid < 0' comparison (which includes
NUMA_NO_NODE), it may hide wrong values which would be previously
exposed.

Both differences will be rectified by the next patch.

To sum up, this patch makes no functional changes, except temporarily
hiding potentially buggy callers. Restricting the checks in
alloc_pages_node() is left for the next patch which can in turn expose
more existing buggy callers.

Signed-off-by: Vlastimil Babka
Acked-by: Johannes Weiner
Acked-by: Robin Holt
Acked-by: Michal Hocko
Acked-by: Christoph Lameter
Acked-by: Michael Ellerman
Cc: Mel Gorman
Cc: David Rientjes
Cc: Greg Thelen
Cc: Aneesh Kumar K.V
Cc: Pekka Enberg
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Gleb Natapov
Cc: Paolo Bonzini
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Cliff Whickman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-09-09 06:35:28 +0800
70c3547e3 hugetlbfs: add hugetlbfs_fallocate() ... Browse Code »

This is based on the shmem version, but it has diverged quite a bit. We
have no swap to worry about, nor the new file sealing. Add
synchronication via the fault mutex table to coordinate page faults,
fallocate allocation and fallocate hole punch.

What this allows us to do is move physical memory in and out of a
hugetlbfs file without having it mapped. This also gives us the ability
to support MADV_REMOVE since it is currently implemented using
fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
a hugetlbfs file, which wasn't possible before.

hugetlbfs fallocate only operates on whole huge pages.

Based on code by Dave Hansen.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
ab76ad540 hugetlbfs: New huge_add_to_page_cache helper routine ... Browse Code »

Currently, there is only a single place where hugetlbfs pages are added
to the page cache. The new fallocate code be adding a second one, so
break the functionality out into its own helper.

Signed-off-by: Dave Hansen
Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
d85f69b0b mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate ... Browse Code »

Areas hole punched by fallocate will not have entries in the
region/reserve map. However, shared mappings with min_size subpool
reservations may still have reserved pages. alloc_huge_page needs to
handle this special case and do the proper accounting.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
1fb1b0e9e mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch ... Browse Code »

In vma_has_reserves(), the current assumption is that reserves are
always present for shared mappings. However, this will not be the case
with fallocate hole punch. When punching a hole, the present page will
be deleted as well as the region/reserve map entry (and hence any
reservation). vma_has_reserves is passed "chg" which indicates whether
or not a region/reserve map is present. Use this to determine if
reserves are actually present or were removed via hole punch.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
b5cec28d3 hugetlbfs: truncate_hugepages() takes a range of pages ... Browse Code »

Modify truncate_hugepages() to take a range of pages (start, end)
instead of simply start. If an end value of LLONG_MAX is passed, the
current "truncate" functionality is maintained. Existing callers are
modified to pass LLONG_MAX as end of range. By keying off end ==
LLONG_MAX, the routine behaves differently for truncate and hole punch.
Page removal is now synchronized with page allocation via faults by
using the fault mutex table. The hole punch case can experience the
rare region_del error and must handle accordingly.

Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
the case where region_del returns an error.

Since the routine handles more than just the truncate case, it is
renamed to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().

Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from region_del and
pass it back to the caller.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
c672c7f29 mm/hugetlb: expose hugetlb fault mutex for use by fallocate ... Browse Code »

hugetlb page faults are currently synchronized by the table of mutexes
(htlb_fault_mutex_table). fallocate code will need to synchronize with
the page fault code when it allocates or deletes pages. Expose
interfaces so that fallocate operations can be synchronized with page
faults. Minor name changes to be more consistent with other global
hugetlb symbols.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
feba16e25 mm/hugetlb: add region_del() to delete a specific range of entries ... Browse Code »

fallocate hole punch will want to remove a specific range of pages. The
existing region_truncate() routine deletes all region/reserve map
entries after a specified offset. region_del() will provide this same
functionality if the end of region is specified as LONG_MAX. Hence,
region_del() can replace region_truncate().

Unlike region_truncate(), region_del() can return an error in the rare
case where it can not allocate memory for a region descriptor. This
ONLY happens in the case where an existing region must be split.
Current callers passing LONG_MAX as end of range will never experience
this error and do not need to deal with error handling. Future callers
of region_del() (such as fallocate hole punch) will need to handle this
error.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
5e9113731 mm/hugetlb: add cache of descriptors to resv_map for region_add ... Browse Code »

hugetlbfs is used today by applications that want a high degree of
control over huge page usage. Often, large hugetlbfs files are used to
map a large number huge pages into the application processes. The
applications know when page ranges within these large files will no
longer be used, and ideally would like to release them back to the
subpool or global pools for other uses. The fallocate() system call
provides an interface for preallocation and hole punching within files.
This patch set adds fallocate functionality to hugetlbfs.

fallocate hole punch will want to remove a specific range of pages.
When pages are removed, their associated entries in the region/reserve
map will also be removed. This will break an assumption in the
region_chg/region_add calling sequence. If a new region descriptor must
be allocated, it is done as part of the region_chg processing. In this
way, region_add can not fail because it does not need to attempt an
allocation.

To prepare for fallocate hole punch, create a "cache" of descriptors
that can be used by region_add if necessary. region_chg will ensure
there are sufficient entries in the cache. It will be necessary to
track the number of in progress add operations to know a sufficient
number of descriptors reside in the cache. A new routine region_abort
is added to adjust this in progress count when add operations are
aborted. vma_abort_reservation is also added for callers creating
reservations with vma_needs_reservation/vma_commit_reservation.

[akpm@linux-foundation.org: fix typo in comment, use more cols]
Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Dave Hansen
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Christoph Hellwig
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800

05 Sep, 2015

2 commits

559ec2f8f mm/hugetlb.c: make vma_has_reserves() return bool ... Browse Code »

This makes vma_has_reserves() return bool due to this particular function
only returning either one or zero as its return value.

Signed-off-by: Nicholas Krause
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Krause
2015-09-05 07:54:41 +0800
31aafb45f mm/hugetlb.c: make vma_shareable() return bool ... Browse Code »

This makes vma_shareable() return bool now due to this particular function
only ever returning either one or zero as its return value.

Signed-off-by: Nicholas Krause
Acked-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Krause
2015-09-05 07:54:41 +0800

26 Jun, 2015

1 commit

8408427e6 mm/hugetlb: remove unused arch hook prepare/release_hugepage ... Browse Code »

With s390 dropping support for emulated hugepages, the last user of
arch_prepare_hugepage and arch_release_hugepage is gone.

Signed-off-by: Dominik Dingel
Acked-by: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dominik Dingel
2015-06-26 08:00:35 +0800

25 Jun, 2015

3 commits

33039678c mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages ... Browse Code »

alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
number of pages which will be added to the reserve map. Subpool and
global reserve counts are adjusted based on the output of region_chg.
Before the pages are actually added to the reserve map, these routines
could race and add fewer pages than expected. If this happens, the
subpool and global reserve counts are not correct.

Compare the number of pages actually added (region_add) to those expected
to added (region_chg). If fewer pages are actually added, this indicates
a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Reviewed-by: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800
cf3ad20bf mm/hugetlb: compute/return the number of regions added by region_add() ... Browse Code »

Modify region_add() to keep track of regions(pages) added to the reserve
map and return this value. The return value can be compared to the return
value of region_chg() to determine if the map was modified between calls.

Make vma_commit_reservation() also pass along the return value of
region_add(). In the normal case, we want vma_commit_reservation to
return the same value as the preceding call to vma_needs_reservation.
Create a common __vma_reservation_common routine to help keep the special
case return values in sync

Signed-off-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800
1dd308a7b mm/hugetlb: document the reserve map/region tracking routines ... Browse Code »

While working on hugetlbfs fallocate support, I noticed the following race
in the existing code. It is unlikely that this race is hit very often in
the current code. However, if more functionality to add and remove pages
to hugetlbfs mappings (such as fallocate) is added the likelihood of
hitting this race will increase.

alloc_huge_page and hugetlb_reserve_pages use information from the reserve
map to determine if there are enough available huge pages to complete the
operation, as well as adjust global reserve and subpool usage counts. The
order of operations is as follows:

- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map

The issue is that reserve map could change between the call to region_chg
and region_add. In this case, the counters which were adjusted based on
the output of region_chg will not be correct.

In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.

The patch set does not prevent the race from happening. Rather, it adds
simple functionality to detect when the race has occurred. If a race is
detected, then the incorrect counts are adjusted.

Review comments pointed out the need for documentation of the existing
region/reserve map routines. This patch set also adds documentation in
this area.

This patch (of 3):

This is a documentation only patch and does not modify any code.
Descriptions of the routines used for reserve map/region tracking are
added.

Signed-off-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800