05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

1 commit

  • There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
    consistently.

    Miscellanea:

    - Coalesce formats
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

10 Mar, 2016

2 commits

  • Replace ENOTSUPP with EOPNOTSUPP. If hugepages are not supported, this
    value is propagated to userspace. EOPNOTSUPP is part of uapi and is
    widely supported by libc libraries.

    It gives nicer message to user, rather than:

    # cat /proc/sys/vm/nr_hugepages
    cat: /proc/sys/vm/nr_hugepages: Unknown error 524

    And also LTP's proc01 test was failing because this ret code (524)
    was unexpected:

    proc01 1 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
    proc01 2 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
    proc01 3 TFAIL : proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524

    Signed-off-by: Jan Stancek
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Acked-by: Hillf Danton
    Cc: Mike Kravetz
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Stancek
     
  • The warning message "killed due to inadequate hugepage pool" simply
    indicates that SIGBUS was sent, not that the process was forcibly killed.
    If the process has a signal handler installed does not fix the problem,
    this message can rapidly spam the kernel log.

    On my amd64 dev machine that does not have hugepages configured, I can
    reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
    4 megabytes of huge pages) and running something that sets a signal
    handler and forks, like

    #include
    #include
    #include
    #include

    sig_atomic_t counter = 10;
    void handler(int signal)
    {
    if (counter-- == 0)
    exit(0);
    }

    int main(void)
    {
    int status;
    char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (addr == MAP_FAILED) {perror("mmap"); return 1;}
    *addr = 'x';
    switch (fork()) {
    case -1:
    perror("fork"); return 1;
    case 0:
    signal(SIGBUS, handler);
    *addr = 'x';
    break;
    default:
    *addr = 'x';
    wait(&status);
    if (WIFSIGNALED(status)) {
    psignal(WTERMSIG(status), "child");
    }
    break;
    }
    }

    Signed-off-by: Geoffrey Thomas
    Cc: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geoffrey Thomas
     

19 Feb, 2016

1 commit

  • Currently incorrect default hugepage pool size is reported by proc
    nr_hugepages when number of pages for the default huge page size is
    specified twice.

    When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
    indicates the current number of pre-allocated huge pages of the default
    size. Basically /proc/sys/vm/nr_hugepages displays default_hstate->
    max_huge_pages and after boot time pre-allocation, max_huge_pages should
    equal the number of pre-allocated pages (nr_hugepages).

    Test case:

    Note that this is specific to x86 architecture.

    Boot the kernel with command line option 'default_hugepagesz=1G
    hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'. After
    boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
    returns the value X. However, dmesg output shows that Z huge pages were
    pre-allocated.

    So, the root cause of the problem here is that the global variable
    default_hstate_max_huge_pages is set if a default huge page size is
    specified (directly or indirectly) on the command line. After the command
    line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
    the value is assigned to default_hstae.max_huge_pages. However,
    default_hstate.max_huge_pages may have already been set based on the
    number of pre-allocated huge pages of default_hstate size.

    The solution to this problem is if hstate->max_huge_pages is already set
    then it should not set as a result of global max_huge_pages value.
    Basically if the value of the variable hugepages is set multiple times on
    a command line for a specific supported hugepagesize then proc layer
    should consider the last specified value.

    Signed-off-by: Vaishali Thakkar
    Reviewed-by: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: Kirill A. Shutemov
    Cc: Dave Hansen
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vaishali Thakkar
     

06 Feb, 2016

2 commits

  • Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
    at runtime") has added the runtime gigantic page allocation via
    alloc_contig_range(), making this support available only when CONFIG_CMA
    is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
    associated infrastructure, it is possible with few simple adjustments to
    require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.

    After this patch, alloc_contig_range() and related functions are
    available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
    enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
    supporting runtime gigantic pages without the CMA-specific checks in
    page allocator fastpaths.

    Signed-off-by: Vlastimil Babka
    Cc: Luiz Capitulino
    Cc: Kirill A. Shutemov
    Cc: Zhang Yanfei
    Cc: Yasuaki Ishimatsu
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Davidlohr Bueso
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Attempting to preallocate 1G gigantic huge pages at boot time with
    "hugepagesz=1G hugepages=1" on the kernel command line will prevent
    booting with the following:

    kernel BUG at mm/hugetlb.c:1218!

    When mapcount accounting was reworked, the setting of
    compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
    a result, the validation of mapcount in free_huge_page fails.

    The "BUG_ON" checks in free_huge_page were also changed to
    "VM_BUG_ON_PAGE" to assist with debugging.

    Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs")
    Signed-off-by: Mike Kravetz
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Acked-by: David Rientjes
    Tested-by: Vlastimil Babka
    Cc: "Aneesh Kumar K.V"
    Cc: Jerome Marchand
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

16 Jan, 2016

4 commits

  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tail page refcounting is utterly complicated and painful to support.

    It uses ->_mapcount on tail pages to store how many times this page is
    pinned. get_page() bumps ->_mapcount on tail page in addition to
    ->_count on head. This information is required by split_huge_page() to
    be able to distribute pins from head of compound page to tails during
    the split.

    We will need ->_mapcount to account PTE mappings of subpages of the
    compound page. We eliminate need in current meaning of ->_mapcount in
    tail pages by forbidding split entirely if the page is pinned.

    The only user of tail page refcounting is THP which is marked BROKEN for
    now.

    Let's drop all this mess. It makes get_page() and put_page() much
    simpler.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As far as I can see there's no users of PG_reserved on compound pages.
    Let's use PF_NO_COMPOUND here.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • The Kconfig currently controlling compilation of this code is:

    config HUGETLBFS
    bool "HugeTLB file system support"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the modular code that is essentially orphaned, so that when
    reading the driver there is no doubt it is builtin-only.

    Since module_init translates to device_initcall in the non-modular case,
    the init ordering gets moved to earlier levels when we use the more
    appropriate initcalls here.

    Originally I had the fs part and the mm part as separate commits, just
    by happenstance of the nature of how I detected these non-modular use
    cases. But that can possibly introduce regressions if the patch merge
    ordering puts the fs part 1st -- as the 0-day testing reported a splat
    at mount time.

    Investigating with "initcall_debug" showed that the delta was
    init_hugetlbfs_fs being called _before_ hugetlb_init instead of after. So
    both the fs change and the mm change are here together.

    In addition, it worked before due to luck of link order, since they were
    both in the same initcall category. So we now have the fs part using
    fs_initcall, and the mm part using subsys_initcall, which puts it one
    bucket earlier. It now passes the basic sanity test that failed in
    earlier 0-day testing.

    We delete the MODULE_LICENSE tag and capture that information at the top
    of the file alongside author comments, etc.

    We don't replace module.h with init.h since the file already has that.
    Also note that MODULE_ALIAS is a no-op for non-modular code.

    Signed-off-by: Paul Gortmaker
    Reported-by: kernel test robot
    Cc: Nadia Yvette Chambers
    Cc: Alexander Viro
    Cc: Naoya Horiguchi
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Hillf Danton
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

13 Dec, 2015

3 commits

  • Dmitry Vyukov reported the following memory leak

    unreferenced object 0xffff88002eaafd88 (size 32):
    comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
    hex dump (first 32 bytes):
    28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmalloc include/linux/slab.h:458
    region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
    __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
    vma_needs_reservation mm/hugetlb.c:1813
    alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
    hugetlb_no_page mm/hugetlb.c:3543
    hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
    follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
    __get_user_pages+0x542/0xf30 mm/gup.c:497
    populate_vma_page_range+0xde/0x110 mm/gup.c:919
    __mm_populate+0x1c7/0x310 mm/gup.c:969
    do_mlock+0x291/0x360 mm/mlock.c:637
    SYSC_mlock2 mm/mlock.c:658
    SyS_mlock2+0x4b/0x70 mm/mlock.c:648

    Dmitry identified a potential memory leak in the routine region_chg,
    where a region descriptor is not free'ed on an error path.

    However, the root cause for the above memory leak resides in region_del.
    In this specific case, a "placeholder" entry is created in region_chg.
    The associated page allocation fails, and the placeholder entry is left
    in the reserve map. This is "by design" as the entry should be deleted
    when the map is released. The bug is in the region_del routine which is
    used to delete entries within a specific range (and when the map is
    released). region_del did not handle the case where a placeholder entry
    exactly matched the start of the range range to be deleted. In this
    case, the entry would not be deleted and leaked. The fix is to take
    these special placeholder entries into account in region_del.

    The region_chg error path leak is also fixed.

    Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
    and check whether the obtained *ptep is a migration/hwpoison entry or
    not. And if not, then we get to call huge_pte_alloc(). This is racy
    because the *ptep could turn into migration/hwpoison entry after the
    huge_pte_offset() check. This race results in BUG_ON in
    huge_pte_alloc().

    We don't have to call huge_pte_alloc() when the huge_pte_offset()
    returns non-NULL, so let's fix this bug with moving the code into else
    block.

    Note that the *ptep could turn into a migration/hwpoison entry after
    this block, but that's not a problem because we have another
    !pte_present check later (we never go into hugetlb_no_page() in that
    case.)

    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Mike Kravetz
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
    alloc_buddy_huge_page() to directly create a hugepage from the buddy
    allocator.

    In that case, however, if alloc_buddy_huge_page() succeeds we don't
    decrement h->resv_huge_pages, which means that successful
    hugetlb_fault() returns without releasing the reserve count. As a
    result, subsequent hugetlb_fault() might fail despite that there are
    still free hugepages.

    This patch simply adds decrementing code on that code path.

    I reproduced this problem when testing v4.3 kernel in the following situation:
    - the test machine/VM is a NUMA system,
    - hugepage overcommiting is enabled,
    - most of hugepages are allocated and there's only one free hugepage
    which is on node 0 (for example),
    - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
    node 1, tries to allocate a hugepage,
    - the allocation should fail but the reserve count is still hold.

    Signed-off-by: Naoya Horiguchi
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Nov, 2015

1 commit


07 Nov, 2015

3 commits

  • Let's try to be consistent about data type of page order.

    [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
    [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch halves space occupied by compound_dtor and compound_order in
    struct page.

    For compound_order, it's trivial long -> short conversion.

    For get_compound_page_dtor(), we now use hardcoded table for destructor
    lookup and store its index in the struct page instead of direct pointer
    to destructor. It shouldn't be a big trouble to maintain the table: we
    have only two destructor and NULL currently.

    This patch free up one word in tail pages for reuse. This is preparation
    for the next patch.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

5 commits

  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • My recent patch "mm, hugetlb: use memory policy when available" added some
    bloat to hugetlb.o. This patch aims to get some of the bloat back,
    especially when NUMA is not in play.

    It does this with an implicit #ifdef and marking some things static that
    should have been static in my first patch. It also makes the warnings
    only VM_WARN_ON()s. They were responsible for a pretty big chunk of the
    bloat.

    Doing this gets our NUMA=n text size back to a wee bit _below_ where we
    started before the original patch.

    It also shaves a bit of space off the NUMA=y case, but not much.
    Enforcing the mempolicy definitely takes some text and it's hard to avoid.

    size(1) output:

    text data bss dec hex filename
    30745 3433 2492 36670 8f3e hugetlb.o.nonuma.baseline
    31305 3755 2492 37552 92b0 hugetlb.o.nonuma.patch1
    30713 3433 2492 36638 8f1e hugetlb.o.nonuma.patch2 (this patch)
    25235 473 41276 66984 105a8 hugetlb.o.numa.baseline
    25715 475 41276 67466 1078a hugetlb.o.numa.patch1
    25491 473 41276 67240 106a8 hugetlb.o.numa.patch2 (this patch)

    Signed-off-by: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I have a hugetlbfs user which is never explicitly allocating huge pages
    with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
    the pages be allocated from the buddy allocator at fault time.

    This works, but they noticed that mbind() was not doing them any good and
    the pages were being allocated without respect for the policy they
    specified.

    The code in question is this:

    > struct page *alloc_huge_page(struct vm_area_struct *vma,
    ...
    > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
    > if (!page) {
    > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

    dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
    But, it only grabs _existing_ huge pages from the huge page pool. If the
    pool is empty, we fall back to alloc_buddy_huge_page() which obviously
    can't do anything with the VMA's policy because it isn't even passed the
    VMA.

    Almost everybody preallocates huge pages. That's probably why nobody has
    ever noticed this. Looking back at the git history, I don't think this
    _ever_ worked from when alloc_buddy_huge_page() was introduced in
    7893d1d5, 8 years ago.

    The fix is to pass vma/addr down in to the places where we actually call
    in to the buddy allocator. It's fairly straightforward plumbing. This
    has been lightly tested.

    Signed-off-by: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • There are no users of the node_hstates array outside of the
    mm/hugetlb.c. So let's make it static.

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Currently there's no easy way to get per-process usage of hugetlb pages,
    which is inconvenient because userspace applications which use hugetlb
    typically want to control their processes on the basis of how much memory
    (including hugetlb) they use. So this patch simply provides easy access
    to the info via /proc/PID/status.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Joern Engel
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

02 Oct, 2015

1 commit

  • SunDong reported the following on

    https://bugzilla.kernel.org/show_bug.cgi?id=103841

    I think I find a linux bug, I have the test cases is constructed. I
    can stable recurring problems in fedora22(4.0.4) kernel version,
    arch for x86_64. I construct transparent huge page, when the parent
    and child process with MAP_SHARE, MAP_PRIVATE way to access the same
    huge page area, it has the opportunity to lead to huge page copy on
    write failure, and then it will munmap the child corresponding mmap
    area, but then the child mmap area with VM_MAYSHARE attributes, child
    process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
    functions (vma - > vm_flags & VM_MAYSHARE).

    There were a number of problems with the report (e.g. it's hugetlbfs that
    triggers this, not transparent huge pages) but it was fundamentally
    correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
    looks like this

    vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
    next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
    prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
    pgoff 0 file ffff88106bdb9800 private_data (null)
    flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
    ------------
    kernel BUG at mm/hugetlb.c:462!
    SMP
    Modules linked in: xt_pkttype xt_LOG xt_limit [..]
    CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
    Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
    set_vma_resv_flags+0x2d/0x30

    The VM_BUG_ON is correct because private and shared mappings have
    different reservation accounting but the warning clearly shows that the
    VMA is shared.

    When a private COW fails to allocate a new page then only the process
    that created the VMA gets the page -- all the children unmap the page.
    If the children access that data in the future then they get killed.

    The problem is that the same file is mapped shared and private. During
    the COW, the allocation fails, the VMAs are traversed to unmap the other
    private pages but a shared VMA is found and the bug is triggered. This
    patch identifies such VMAs and skips them.

    Signed-off-by: Mel Gorman
    Reported-by: SunDong
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Sep, 2015

9 commits

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently, there is only a single place where hugetlbfs pages are added
    to the page cache. The new fallocate code be adding a second one, so
    break the functionality out into its own helper.

    Signed-off-by: Dave Hansen
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Areas hole punched by fallocate will not have entries in the
    region/reserve map. However, shared mappings with min_size subpool
    reservations may still have reserved pages. alloc_huge_page needs to
    handle this special case and do the proper accounting.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • In vma_has_reserves(), the current assumption is that reserves are
    always present for shared mappings. However, this will not be the case
    with fallocate hole punch. When punching a hole, the present page will
    be deleted as well as the region/reserve map entry (and hence any
    reservation). vma_has_reserves is passed "chg" which indicates whether
    or not a region/reserve map is present. Use this to determine if
    reserves are actually present or were removed via hole punch.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlb page faults are currently synchronized by the table of mutexes
    (htlb_fault_mutex_table). fallocate code will need to synchronize with
    the page fault code when it allocates or deletes pages. Expose
    interfaces so that fallocate operations can be synchronized with page
    faults. Minor name changes to be more consistent with other global
    hugetlb symbols.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to remove a specific range of pages. The
    existing region_truncate() routine deletes all region/reserve map
    entries after a specified offset. region_del() will provide this same
    functionality if the end of region is specified as LONG_MAX. Hence,
    region_del() can replace region_truncate().

    Unlike region_truncate(), region_del() can return an error in the rare
    case where it can not allocate memory for a region descriptor. This
    ONLY happens in the case where an existing region must be split.
    Current callers passing LONG_MAX as end of range will never experience
    this error and do not need to deal with error handling. Future callers
    of region_del() (such as fallocate hole punch) will need to handle this
    error.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlbfs is used today by applications that want a high degree of
    control over huge page usage. Often, large hugetlbfs files are used to
    map a large number huge pages into the application processes. The
    applications know when page ranges within these large files will no
    longer be used, and ideally would like to release them back to the
    subpool or global pools for other uses. The fallocate() system call
    provides an interface for preallocation and hole punching within files.
    This patch set adds fallocate functionality to hugetlbfs.

    fallocate hole punch will want to remove a specific range of pages.
    When pages are removed, their associated entries in the region/reserve
    map will also be removed. This will break an assumption in the
    region_chg/region_add calling sequence. If a new region descriptor must
    be allocated, it is done as part of the region_chg processing. In this
    way, region_add can not fail because it does not need to attempt an
    allocation.

    To prepare for fallocate hole punch, create a "cache" of descriptors
    that can be used by region_add if necessary. region_chg will ensure
    there are sufficient entries in the cache. It will be necessary to
    track the number of in progress add operations to know a sufficient
    number of descriptors reside in the cache. A new routine region_abort
    is added to adjust this in progress count when add operations are
    aborted. vma_abort_reservation is also added for callers creating
    reservations with vma_needs_reservation/vma_commit_reservation.

    [akpm@linux-foundation.org: fix typo in comment, use more cols]
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Sep, 2015

2 commits


26 Jun, 2015

1 commit


25 Jun, 2015

3 commits

  • alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
    number of pages which will be added to the reserve map. Subpool and
    global reserve counts are adjusted based on the output of region_chg.
    Before the pages are actually added to the reserve map, these routines
    could race and add fewer pages than expected. If this happens, the
    subpool and global reserve counts are not correct.

    Compare the number of pages actually added (region_add) to those expected
    to added (region_chg). If fewer pages are actually added, this indicates
    a race and adjust counters accordingly.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify region_add() to keep track of regions(pages) added to the reserve
    map and return this value. The return value can be compared to the return
    value of region_chg() to determine if the map was modified between calls.

    Make vma_commit_reservation() also pass along the return value of
    region_add(). In the normal case, we want vma_commit_reservation to
    return the same value as the preceding call to vma_needs_reservation.
    Create a common __vma_reservation_common routine to help keep the special
    case return values in sync

    Signed-off-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • While working on hugetlbfs fallocate support, I noticed the following race
    in the existing code. It is unlikely that this race is hit very often in
    the current code. However, if more functionality to add and remove pages
    to hugetlbfs mappings (such as fallocate) is added the likelihood of
    hitting this race will increase.

    alloc_huge_page and hugetlb_reserve_pages use information from the reserve
    map to determine if there are enough available huge pages to complete the
    operation, as well as adjust global reserve and subpool usage counts. The
    order of operations is as follows:

    - call region_chg() to determine the expected change based on reserve map
    - determine if enough resources are available for this operation
    - adjust global counts based on the expected change
    - call region_add() to update the reserve map

    The issue is that reserve map could change between the call to region_chg
    and region_add. In this case, the counters which were adjusted based on
    the output of region_chg will not be correct.

    In order to hit this race today, there must be an existing shared hugetlb
    mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
    page via this mapping must occur at the same another task is mapping the
    same region without the MAP_NORESERVE flag.

    The patch set does not prevent the race from happening. Rather, it adds
    simple functionality to detect when the race has occurred. If a race is
    detected, then the incorrect counts are adjusted.

    Review comments pointed out the need for documentation of the existing
    region/reserve map routines. This patch set also adds documentation in
    this area.

    This patch (of 3):

    This is a documentation only patch and does not modify any code.
    Descriptions of the routines used for reserve map/region tracking are
    added.

    Signed-off-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz