13 Dec, 2015

3 commits

  • Dmitry Vyukov reported the following memory leak

    unreferenced object 0xffff88002eaafd88 (size 32):
    comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
    hex dump (first 32 bytes):
    28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff (.Nc....(.Nc....
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    kmalloc include/linux/slab.h:458
    region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
    __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
    vma_needs_reservation mm/hugetlb.c:1813
    alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
    hugetlb_no_page mm/hugetlb.c:3543
    hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
    follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
    __get_user_pages+0x542/0xf30 mm/gup.c:497
    populate_vma_page_range+0xde/0x110 mm/gup.c:919
    __mm_populate+0x1c7/0x310 mm/gup.c:969
    do_mlock+0x291/0x360 mm/mlock.c:637
    SYSC_mlock2 mm/mlock.c:658
    SyS_mlock2+0x4b/0x70 mm/mlock.c:648

    Dmitry identified a potential memory leak in the routine region_chg,
    where a region descriptor is not free'ed on an error path.

    However, the root cause for the above memory leak resides in region_del.
    In this specific case, a "placeholder" entry is created in region_chg.
    The associated page allocation fails, and the placeholder entry is left
    in the reserve map. This is "by design" as the entry should be deleted
    when the map is released. The bug is in the region_del routine which is
    used to delete entries within a specific range (and when the map is
    released). region_del did not handle the case where a placeholder entry
    exactly matched the start of the range range to be deleted. In this
    case, the entry would not be deleted and leaked. The fix is to take
    these special placeholder entries into account in region_del.

    The region_chg error path leak is also fixed.

    Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries")
    Signed-off-by: Mike Kravetz
    Reported-by: Dmitry Vyukov
    Acked-by: Hillf Danton
    Cc: [4.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
    and check whether the obtained *ptep is a migration/hwpoison entry or
    not. And if not, then we get to call huge_pte_alloc(). This is racy
    because the *ptep could turn into migration/hwpoison entry after the
    huge_pte_offset() check. This race results in BUG_ON in
    huge_pte_alloc().

    We don't have to call huge_pte_alloc() when the huge_pte_offset()
    returns non-NULL, so let's fix this bug with moving the code into else
    block.

    Note that the *ptep could turn into a migration/hwpoison entry after
    this block, but that's not a problem because we have another
    !pte_present check later (we never go into hugetlb_no_page() in that
    case.)

    Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Acked-by: David Rientjes
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Mike Kravetz
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
    alloc_buddy_huge_page() to directly create a hugepage from the buddy
    allocator.

    In that case, however, if alloc_buddy_huge_page() succeeds we don't
    decrement h->resv_huge_pages, which means that successful
    hugetlb_fault() returns without releasing the reserve count. As a
    result, subsequent hugetlb_fault() might fail despite that there are
    still free hugepages.

    This patch simply adds decrementing code on that code path.

    I reproduced this problem when testing v4.3 kernel in the following situation:
    - the test machine/VM is a NUMA system,
    - hugepage overcommiting is enabled,
    - most of hugepages are allocated and there's only one free hugepage
    which is on node 0 (for example),
    - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
    node 1, tries to allocate a hugepage,
    - the allocation should fail but the reserve count is still hold.

    Signed-off-by: Naoya Horiguchi
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Cc: [3.16+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Nov, 2015

1 commit


07 Nov, 2015

3 commits

  • Let's try to be consistent about data type of page order.

    [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
    [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugh has pointed that compound_head() call can be unsafe in some
    context. There's one example:

    CPU0 CPU1

    isolate_migratepages_block()
    page_count()
    compound_head()
    !!PageTail() == true
    put_page()
    tail->first_page = NULL
    head = tail->first_page
    alloc_pages(__GFP_COMP)
    prep_compound_page()
    tail->first_page = head
    __SetPageTail(p);
    !!PageTail() == true

    The race is pure theoretical. I don't it's possible to trigger it in
    practice. But who knows.

    We can fix the race by changing how encode PageTail() and compound_head()
    within struct page to be able to update them in one shot.

    The patch introduces page->compound_head into third double word block in
    front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
    the rest bits are pointer to head page if bit zero is set.

    The patch moves page->pmd_huge_pte out of word, just in case if an
    architecture defines pgtable_t into something what can have the bit 0
    set.

    hugetlb_cgroup uses page->lru.next in the second tail page to store
    pointer struct hugetlb_cgroup. The patch switch it to use page->private
    in the second tail page instead. The space is free since ->first_page is
    removed from the union.

    The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
    limitation, since there's now space in first tail page to store struct
    hugetlb_cgroup pointer. But that's out of scope of the patch.

    That means page->compound_head shares storage space with:

    - page->lru.next;
    - page->next;
    - page->rcu_head.next;

    That's too long list to be absolutely sure, but looks like nobody uses
    bit 0 of the word.

    page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
    call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
    call_rcu_lazy() is not allowed as it makes use of the bit and we can
    get false positive PageTail().

    [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Reviewed-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Acked-by: Paul E. McKenney
    Cc: Aneesh Kumar K.V
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch halves space occupied by compound_dtor and compound_order in
    struct page.

    For compound_order, it's trivial long -> short conversion.

    For get_compound_page_dtor(), we now use hardcoded table for destructor
    lookup and store its index in the struct page instead of direct pointer
    to destructor. It shouldn't be a big trouble to maintain the table: we
    have only two destructor and NULL currently.

    This patch free up one word in tail pages for reuse. This is preparation
    for the next patch.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrea Arcangeli
    Cc: "Paul E. McKenney"
    Cc: Andi Kleen
    Cc: Aneesh Kumar K.V
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

5 commits

  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • My recent patch "mm, hugetlb: use memory policy when available" added some
    bloat to hugetlb.o. This patch aims to get some of the bloat back,
    especially when NUMA is not in play.

    It does this with an implicit #ifdef and marking some things static that
    should have been static in my first patch. It also makes the warnings
    only VM_WARN_ON()s. They were responsible for a pretty big chunk of the
    bloat.

    Doing this gets our NUMA=n text size back to a wee bit _below_ where we
    started before the original patch.

    It also shaves a bit of space off the NUMA=y case, but not much.
    Enforcing the mempolicy definitely takes some text and it's hard to avoid.

    size(1) output:

    text data bss dec hex filename
    30745 3433 2492 36670 8f3e hugetlb.o.nonuma.baseline
    31305 3755 2492 37552 92b0 hugetlb.o.nonuma.patch1
    30713 3433 2492 36638 8f1e hugetlb.o.nonuma.patch2 (this patch)
    25235 473 41276 66984 105a8 hugetlb.o.numa.baseline
    25715 475 41276 67466 1078a hugetlb.o.numa.patch1
    25491 473 41276 67240 106a8 hugetlb.o.numa.patch2 (this patch)

    Signed-off-by: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I have a hugetlbfs user which is never explicitly allocating huge pages
    with 'nr_hugepages'. They only set 'nr_overcommit_hugepages' and then let
    the pages be allocated from the buddy allocator at fault time.

    This works, but they noticed that mbind() was not doing them any good and
    the pages were being allocated without respect for the policy they
    specified.

    The code in question is this:

    > struct page *alloc_huge_page(struct vm_area_struct *vma,
    ...
    > page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
    > if (!page) {
    > page = alloc_buddy_huge_page(h, NUMA_NO_NODE);

    dequeue_huge_page_vma() is smart and will respect the VMA's memory policy.
    But, it only grabs _existing_ huge pages from the huge page pool. If the
    pool is empty, we fall back to alloc_buddy_huge_page() which obviously
    can't do anything with the VMA's policy because it isn't even passed the
    VMA.

    Almost everybody preallocates huge pages. That's probably why nobody has
    ever noticed this. Looking back at the git history, I don't think this
    _ever_ worked from when alloc_buddy_huge_page() was introduced in
    7893d1d5, 8 years ago.

    The fix is to pass vma/addr down in to the places where we actually call
    in to the buddy allocator. It's fairly straightforward plumbing. This
    has been lightly tested.

    Signed-off-by: Dave Hansen
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Cc: Hillf Danton
    Cc: David Rientjes
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • There are no users of the node_hstates array outside of the
    mm/hugetlb.c. So let's make it static.

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Currently there's no easy way to get per-process usage of hugetlb pages,
    which is inconvenient because userspace applications which use hugetlb
    typically want to control their processes on the basis of how much memory
    (including hugetlb) they use. So this patch simply provides easy access
    to the info via /proc/PID/status.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Joern Engel
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

02 Oct, 2015

1 commit

  • SunDong reported the following on

    https://bugzilla.kernel.org/show_bug.cgi?id=103841

    I think I find a linux bug, I have the test cases is constructed. I
    can stable recurring problems in fedora22(4.0.4) kernel version,
    arch for x86_64. I construct transparent huge page, when the parent
    and child process with MAP_SHARE, MAP_PRIVATE way to access the same
    huge page area, it has the opportunity to lead to huge page copy on
    write failure, and then it will munmap the child corresponding mmap
    area, but then the child mmap area with VM_MAYSHARE attributes, child
    process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
    functions (vma - > vm_flags & VM_MAYSHARE).

    There were a number of problems with the report (e.g. it's hugetlbfs that
    triggers this, not transparent huge pages) but it was fundamentally
    correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
    looks like this

    vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
    next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
    prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
    pgoff 0 file ffff88106bdb9800 private_data (null)
    flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
    ------------
    kernel BUG at mm/hugetlb.c:462!
    SMP
    Modules linked in: xt_pkttype xt_LOG xt_limit [..]
    CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
    Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
    set_vma_resv_flags+0x2d/0x30

    The VM_BUG_ON is correct because private and shared mappings have
    different reservation accounting but the warning clearly shows that the
    VMA is shared.

    When a private COW fails to allocate a new page then only the process
    that created the VMA gets the page -- all the children unmap the page.
    If the children access that data in the future then they get killed.

    The problem is that the same file is mapped shared and private. During
    the COW, the allocation fails, the VMAs are traversed to unmap the other
    private pages but a shared VMA is found and the bug is triggered. This
    patch identifies such VMAs and skips them.

    Signed-off-by: Mel Gorman
    Reported-by: SunDong
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Sep, 2015

9 commits

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This is based on the shmem version, but it has diverged quite a bit. We
    have no swap to worry about, nor the new file sealing. Add
    synchronication via the fault mutex table to coordinate page faults,
    fallocate allocation and fallocate hole punch.

    What this allows us to do is move physical memory in and out of a
    hugetlbfs file without having it mapped. This also gives us the ability
    to support MADV_REMOVE since it is currently implemented using
    fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of
    a hugetlbfs file, which wasn't possible before.

    hugetlbfs fallocate only operates on whole huge pages.

    Based on code by Dave Hansen.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently, there is only a single place where hugetlbfs pages are added
    to the page cache. The new fallocate code be adding a second one, so
    break the functionality out into its own helper.

    Signed-off-by: Dave Hansen
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Areas hole punched by fallocate will not have entries in the
    region/reserve map. However, shared mappings with min_size subpool
    reservations may still have reserved pages. alloc_huge_page needs to
    handle this special case and do the proper accounting.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • In vma_has_reserves(), the current assumption is that reserves are
    always present for shared mappings. However, this will not be the case
    with fallocate hole punch. When punching a hole, the present page will
    be deleted as well as the region/reserve map entry (and hence any
    reservation). vma_has_reserves is passed "chg" which indicates whether
    or not a region/reserve map is present. Use this to determine if
    reserves are actually present or were removed via hole punch.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify truncate_hugepages() to take a range of pages (start, end)
    instead of simply start. If an end value of LLONG_MAX is passed, the
    current "truncate" functionality is maintained. Existing callers are
    modified to pass LLONG_MAX as end of range. By keying off end ==
    LLONG_MAX, the routine behaves differently for truncate and hole punch.
    Page removal is now synchronized with page allocation via faults by
    using the fault mutex table. The hole punch case can experience the
    rare region_del error and must handle accordingly.

    Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
    the case where region_del returns an error.

    Since the routine handles more than just the truncate case, it is
    renamed to remove_inode_hugepages(). To be consistent, the routine
    truncate_huge_page() is renamed remove_huge_page().

    Downstream of remove_inode_hugepages(), the routine
    hugetlb_unreserve_pages() is also modified to take a range of pages.
    hugetlb_unreserve_pages is modified to detect an error from region_del and
    pass it back to the caller.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlb page faults are currently synchronized by the table of mutexes
    (htlb_fault_mutex_table). fallocate code will need to synchronize with
    the page fault code when it allocates or deletes pages. Expose
    interfaces so that fallocate operations can be synchronized with page
    faults. Minor name changes to be more consistent with other global
    hugetlb symbols.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • fallocate hole punch will want to remove a specific range of pages. The
    existing region_truncate() routine deletes all region/reserve map
    entries after a specified offset. region_del() will provide this same
    functionality if the end of region is specified as LONG_MAX. Hence,
    region_del() can replace region_truncate().

    Unlike region_truncate(), region_del() can return an error in the rare
    case where it can not allocate memory for a region descriptor. This
    ONLY happens in the case where an existing region must be split.
    Current callers passing LONG_MAX as end of range will never experience
    this error and do not need to deal with error handling. Future callers
    of region_del() (such as fallocate hole punch) will need to handle this
    error.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlbfs is used today by applications that want a high degree of
    control over huge page usage. Often, large hugetlbfs files are used to
    map a large number huge pages into the application processes. The
    applications know when page ranges within these large files will no
    longer be used, and ideally would like to release them back to the
    subpool or global pools for other uses. The fallocate() system call
    provides an interface for preallocation and hole punching within files.
    This patch set adds fallocate functionality to hugetlbfs.

    fallocate hole punch will want to remove a specific range of pages.
    When pages are removed, their associated entries in the region/reserve
    map will also be removed. This will break an assumption in the
    region_chg/region_add calling sequence. If a new region descriptor must
    be allocated, it is done as part of the region_chg processing. In this
    way, region_add can not fail because it does not need to attempt an
    allocation.

    To prepare for fallocate hole punch, create a "cache" of descriptors
    that can be used by region_add if necessary. region_chg will ensure
    there are sufficient entries in the cache. It will be necessary to
    track the number of in progress add operations to know a sufficient
    number of descriptors reside in the cache. A new routine region_abort
    is added to adjust this in progress count when add operations are
    aborted. vma_abort_reservation is also added for callers creating
    reservations with vma_needs_reservation/vma_commit_reservation.

    [akpm@linux-foundation.org: fix typo in comment, use more cols]
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

05 Sep, 2015

2 commits


26 Jun, 2015

1 commit


25 Jun, 2015

5 commits

  • alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
    number of pages which will be added to the reserve map. Subpool and
    global reserve counts are adjusted based on the output of region_chg.
    Before the pages are actually added to the reserve map, these routines
    could race and add fewer pages than expected. If this happens, the
    subpool and global reserve counts are not correct.

    Compare the number of pages actually added (region_add) to those expected
    to added (region_chg). If fewer pages are actually added, this indicates
    a race and adjust counters accordingly.

    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Modify region_add() to keep track of regions(pages) added to the reserve
    map and return this value. The return value can be compared to the return
    value of region_chg() to determine if the map was modified between calls.

    Make vma_commit_reservation() also pass along the return value of
    region_add(). In the normal case, we want vma_commit_reservation to
    return the same value as the preceding call to vma_needs_reservation.
    Create a common __vma_reservation_common routine to help keep the special
    case return values in sync

    Signed-off-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • While working on hugetlbfs fallocate support, I noticed the following race
    in the existing code. It is unlikely that this race is hit very often in
    the current code. However, if more functionality to add and remove pages
    to hugetlbfs mappings (such as fallocate) is added the likelihood of
    hitting this race will increase.

    alloc_huge_page and hugetlb_reserve_pages use information from the reserve
    map to determine if there are enough available huge pages to complete the
    operation, as well as adjust global reserve and subpool usage counts. The
    order of operations is as follows:

    - call region_chg() to determine the expected change based on reserve map
    - determine if enough resources are available for this operation
    - adjust global counts based on the expected change
    - call region_add() to update the reserve map

    The issue is that reserve map could change between the call to region_chg
    and region_add. In this case, the counters which were adjusted based on
    the output of region_chg will not be correct.

    In order to hit this race today, there must be an existing shared hugetlb
    mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
    page via this mapping must occur at the same another task is mapping the
    same region without the MAP_NORESERVE flag.

    The patch set does not prevent the race from happening. Rather, it adds
    simple functionality to detect when the race has occurred. If a race is
    detected, then the incorrect counts are adjusted.

    Review comments pointed out the need for documentation of the existing
    region/reserve map routines. This patch set also adds documentation in
    this area.

    This patch (of 3):

    This is a documentation only patch and does not modify any code.
    Descriptions of the routines used for reserve map/region tracking are
    added.

    Signed-off-by: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Luiz Capitulino
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently the initial value of order in dissolve_free_huge_page is 64 or
    32, which leads to the following warning in static checker:

    mm/hugetlb.c:1203 dissolve_free_huge_pages()
    warn: potential right shift more than type allows '9,18,64'

    This is a potential risk of infinite loop, because 1 << order (== 0) is used
    in for-loop like this:

    for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
    ...

    So this patch fixes it by using global minimum_order calculated at boot time.

    text data bss dec hex filename
    28313 469 84236 113018 1b97a mm/hugetlb.o
    28256 473 84236 112965 1b945 mm/hugetlb.o (patched)

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Reported-by: Dan Carpenter
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently we have many duplicates in definitions of huge_pmd_unshare. In
    all architectures this function just returns 0 when
    CONFIG_ARCH_WANT_HUGE_PMD_SHARE is N.

    This patch puts the default implementation in mm/hugetlb.c and lets these
    architectures use the common code.

    Signed-off-by: Zhang Zhen
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Tony Luck
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Martin Schwidefsky
    Cc: Chris Metcalf
    Cc: David Rientjes
    Cc: James Yang
    Cc: Aneesh Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Zhen
     

16 Apr, 2015

5 commits

  • Now we have an easy access to hugepages' activeness, so existing helpers to
    get the information can be cleaned up.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • We are not safe from calling isolate_huge_page() on a hugepage
    concurrently, which can make the victim hugepage in invalid state and
    results in BUG_ON().

    The root problem of this is that we don't have any information on struct
    page (so easily accessible) about hugepages' activeness. Note that
    hugepages' activeness means just being linked to
    hstate->hugepage_activelist, which is not the same as normal pages'
    activeness represented by PageActive flag.

    Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
    before isolation, so let's do similarly for hugetlb with a new
    paeg_huge_active().

    set/clear_page_huge_active() should be called within hugetlb_lock. But
    hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
    in these functions set_page_huge_active() is called right after the
    hugepage is allocated and no other thread tries to isolate it.

    [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
    [fengguang.wu@intel.com: set_page_huge_active() can be static]
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Fengguang Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Make 'min_size=' be an option when mounting a hugetlbfs. This
    option takes the same value as the 'size' option. min_size can be
    specified without specifying size. If both are specified, min_size must
    be less that or equal to size else the mount will fail. If min_size is
    specified, then at mount time an attempt is made to reserve min_size
    pages. If the reservation fails, the mount fails. At umount time, the
    reserved pages are released.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The same routines that perform subpool maximum size accounting
    hugepage_subpool_get/put_pages() are modified to also perform minimum size
    accounting. When a delta value is passed to these routines, calculate how
    global reservations must be adjusted to maintain the subpool minimum size.
    The routines now return this global reserve count adjustment. This
    global reserve count adjustment is then passed to the global accounting
    routine hugetlb_acct_memory().

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • hugetlbfs allocates huge pages from the global pool as needed. Even if
    the global pool contains a sufficient number pages for the filesystem size
    at mount time, those global pages could be grabbed for some other use. As
    a result, filesystem huge page allocations may fail due to lack of pages.

    Applications such as a database want to use huge pages for performance
    reasons. hugetlbfs filesystem semantics with ownership and modes work
    well to manage access to a pool of huge pages. However, the application
    would like some reasonable assurance that allocations will not fail due to
    a lack of huge pages. At application startup time, the application would
    like to configure itself to use a specific number of huge pages. Before
    starting, the application can check to make sure that enough huge pages
    exist in the system global pools. However, there are no guarantees that
    those pages will be available when needed by the application. What the
    application wants is exclusive use of a subset of huge pages.

    Add a new hugetlbfs mount option 'min_size=' to indicate that the
    specified number of pages will be available for use by the filesystem. At
    mount time, this number of huge pages will be reserved for exclusive use
    of the filesystem. If there is not a sufficient number of free pages, the
    mount will fail. As pages are allocated to and freeed from the
    filesystem, the number of reserved pages is adjusted so that the specified
    minimum is maintained.

    This patch (of 4):

    Add a field to the subpool structure to indicate the minimimum number of
    huge pages to always be used by this subpool. This minimum count includes
    allocated pages as well as reserved pages. If the minimum number of pages
    for the subpool have not been allocated, pages are reserved up to this
    minimum. An additional field (rsv_hpages) is used to track the number of
    pages reserved to meet this minimum size. The hstate pointer in the
    subpool is convenient to have when reserving and unreserving the pages.

    Signed-off-by: Mike Kravetz
    Cc: Davidlohr Bueso
    Cc: Aneesh Kumar
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

15 Apr, 2015

2 commits

  • If __get_user_pages() is faulting a significant number of hugetlb pages,
    usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
    very large amount of memory.

    If the process has been oom killed, this will cause a lot of memory to
    potentially deplete memory reserves.

    In the same way that commit 4779280d1ea4 ("mm: make get_user_pages()
    interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
    memory, based on the premise of commit 462e00cc7151 ("oom: stop
    allocating user memory if TIF_MEMDIE is set"), hugetlb page faults now
    terminate when the process has been oom killed.

    Signed-off-by: David Rientjes
    Acked-by: Rik van Riel
    Acked-by: Greg Thelen
    Cc: Naoya Horiguchi
    Acked-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around
    follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte
    layout differ and using pte_page() on a huge pmd will return wrong
    results. Using pmd_page() instead fixes this.

    All architectures that were touched by that commit have pmd_page()
    defined, so this should not break anything on other architectures.

    Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*"
    Signed-off-by: Gerald Schaefer
    Acked-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: Michal Hocko , Andrea Arcangeli
    Cc: Martin Schwidefsky
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

13 Mar, 2015

1 commit

  • Now that gigantic pages are dynamically allocatable, care must be taken to
    ensure that p->first_page is valid before setting PageTail.

    If this isn't done, then it is possible to race and have compound_head()
    return NULL.

    Signed-off-by: David Rientjes
    Acked-by: Davidlohr Bueso
    Cc: Luiz Capitulino
    Cc: Joonsoo Kim
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Feb, 2015

2 commits

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • If __unmap_hugepage_range() tries to unmap the address range over which
    hugepage migration is on the way, we get the wrong page because pte_page()
    doesn't work for migration entries. This patch simply clears the pte for
    migration entries as we do for hwpoison entries.

    Fixes: 290408d4a2 ("hugetlb: hugepage migration core")
    Signed-off-by: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi