02 Oct, 2020

1 commit


13 Aug, 2020

5 commits

  • new_non_cma_page() in gup.c requires to allocate the new page that is not
    on the CMA area. new_non_cma_page() implements it by using allocation
    scope APIs.

    However, there is a work-around for hugetlb. Normal hugetlb page
    allocation API for migration is alloc_huge_page_nodemask(). It consists
    of two steps. First is dequeing from the pool. Second is, if there is no
    available page on the queue, allocating by using the page allocator.

    new_non_cma_page() can't use this API since first step (deque) isn't aware
    of scope API to exclude CMA area. So, new_non_cma_page() exports hugetlb
    internal function for the second step, alloc_migrate_huge_page(), to
    global scope and uses it directly. This is suboptimal since hugetlb pages
    on the queue cannot be utilized.

    This patch tries to fix this situation by making the deque function on
    hugetlb CMA aware. In the deque function, CMA memory is skipped if
    PF_MEMALLOC_NOCMA flag is found.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1596180906-8442-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have well defined scope API to exclude CMA region. Use it rather than
    manipulating gfp_mask manually. With this change, we can now restore
    __GFP_MOVABLE for gfp_mask like as usual migration target allocation. It
    would result in that the ZONE_MOVABLE is also searched by page allocator.
    For hugetlb, gfp_mask is redefined since it has a regular allocation mask
    filter for migration target. __GPF_NOWARN is added to hugetlb gfp_mask
    filter since a new user for gfp_mask filter, gup, want to be silent when
    allocation fails.

    Note that this can be considered as a fix for the commit 9a4e9f3b2d73
    ("mm: update get_user_pages_longterm to migrate pages allocated from CMA
    region"). However, "Fixes" tag isn't added here since it is just
    suboptimal but it doesn't cause any problem.

    Suggested-by: Michal Hocko
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Roman Gushchin
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Link: http://lkml.kernel.org/r/1596180906-8442-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There are some similar functions for migration target allocation. Since
    there is no fundamental difference, it's better to keep just one rather
    than keeping all variants. This patch implements base migration target
    allocation function. In the following patches, variants will be converted
    to use this function.

    Changes should be mechanical, but, unfortunately, there are some
    differences. First, some callers' nodemask is assgined to NULL since NULL
    nodemask will be considered as all available nodes, that is,
    &node_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
    redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
    user provided gfp_mask has it. This is because future caller of this
    function requires to set this node constaint. Lastly, if provided nodeid
    is NUMA_NO_NODE, nodeid is set up to the node where migration source
    lives. It helps to remove simple wrappers for setting up the nodeid.

    Note that PageHighmem() call in previous function is changed to open-code
    "is_highmem_idx()" since it provides more readability.

    [akpm@linux-foundation.org: tweak patch title, per Vlastimil]
    [akpm@linux-foundation.org: fix typo in comment]

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is no difference between two migration callback functions,
    alloc_huge_page_node() and alloc_huge_page_nodemask(), except
    __GFP_THISNODE handling. It's redundant to have two almost similar
    functions in order to handle this flag. So, this patch tries to remove
    one by introducing a new argument, gfp_mask, to
    alloc_huge_page_nodemask().

    After introducing gfp_mask argument, it's caller's job to provide correct
    gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed
    to provide gfp_mask.

    Note that it's safe to remove a node id check in alloc_huge_page_node()
    since there is no caller passing NUMA_NO_NODE as a node id.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
    in at least read mode. This is because the explicit locking in
    huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
    the code, the call to huge_pte_alloc in the else block at the beginning of
    hugetlb_fault was missed.

    Unfortunately, that else clause is exercised when there is no page table
    entry. This will likely lead to a call to huge_pmd_share. If
    huge_pmd_share thinks pmd sharing is possible, it will traverse the
    mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
    modifying the tree, bad things such as addressing exceptions or worse
    could happen.

    Simply remove the else clause. It should have been removed previously.
    The code following the else will call huge_pte_alloc with the appropriate
    locking.

    To prevent this type of issue in the future, add routines to assert that
    i_mmap_rwsem is held, and call these routines in huge pmd sharing
    routines.

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Suggested-by: Matthew Wilcox
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A.Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

10 Jun, 2020

1 commit

  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

5 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • There are multiple similar definitions for arch_clear_hugepage_flags() on
    various platforms. Lets just add it's generic fallback definition for
    platforms that do not override. This help reduce code duplication.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Thomas Bogendoerfer
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Link: http://lkml.kernel.org/r/1588907271-11920-4-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • There are multiple similar definitions for is_hugepage_only_range() on
    various platforms. Lets just add it's generic fallback definition for
    platforms that do not override. This help reduce code duplication.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Mike Kravetz
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Thomas Bogendoerfer
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Link: http://lkml.kernel.org/r/1588907271-11920-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Now that architectures provide arch_hugetlb_valid_size(), parsing of
    "hugepagesz=" can be done in architecture independent code. Create a
    single routine to handle hugepagesz= parsing and remove all arch specific
    routines. We can also remove the interface hugetlb_bad_size() as this is
    no longer used outside arch independent code.

    This also provides consistent behavior of hugetlbfs command line options.
    The hugepagesz= option should only be specified once for a specific size,
    but some architectures allow multiple instances. This appears to be more
    of an oversight when code was added by some architectures to set up ALL
    huge pages sizes.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Sandipan Das
    Reviewed-by: Peter Xu
    Acked-by: Mina Almasry
    Acked-by: Gerald Schaefer [s390]
    Acked-by: Will Deacon
    Cc: Albert Ou
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Longpeng
    Cc: Nitesh Narayan Lal
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Anders Roxell
    Cc: "Aneesh Kumar K.V"
    Cc: Qian Cai
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200417185049.275845-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20200428205614.246260-3-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "Clean up hugetlb boot command line processing", v4.

    Longpeng(Mike) reported a weird message from hugetlb command line
    processing and proposed a solution [1]. While the proposed patch does
    address the specific issue, there are other related issues in command line
    processing. As hugetlbfs evolved, updates to command line processing have
    been made to meet immediate needs and not necessarily in a coordinated
    manner. The result is that some processing is done in arch specific code,
    some is done in arch independent code and coordination is problematic.
    Semantics can vary between architectures.

    The patch series does the following:
    - Define arch specific arch_hugetlb_valid_size routine used to validate
    passed huge page sizes.
    - Move hugepagesz= command line parsing out of arch specific code and into
    an arch independent routine.
    - Clean up command line processing to follow desired semantics and
    document those semantics.

    [1] https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpeng2@huawei.com

    This patch (of 3):

    The architecture independent routine hugetlb_default_setup sets up the
    default huge pages size. It has no way to verify if the passed value is
    valid, so it accepts it and attempts to validate at a later time. This
    requires undocumented cooperation between the arch specific and arch
    independent code.

    For architectures that support more than one huge page size, provide a
    routine arch_hugetlb_valid_size to validate a huge page size.
    hugetlb_default_setup can use this to validate passed values.

    arch_hugetlb_valid_size will also be used in a subsequent patch to move
    processing of the "hugepagesz=" in arch specific code to a common routine
    in arch independent code.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Gerald Schaefer [s390]
    Acked-by: Will Deacon
    Cc: Catalin Marinas
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: David S. Miller
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Jonathan Corbet
    Cc: Longpeng
    Cc: Christophe Leroy
    Cc: Randy Dunlap
    Cc: Mina Almasry
    Cc: Peter Xu
    Cc: Nitesh Narayan Lal
    Cc: Anders Roxell
    Cc: "Aneesh Kumar K.V"
    Cc: Qian Cai
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200428205614.246260-1-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20200428205614.246260-2-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20200417185049.275845-1-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20200417185049.275845-2-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

11 Apr, 2020

1 commit

  • Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
    at runtime") has added the run-time allocation of gigantic pages.

    However it actually works only at early stages of the system loading,
    when the majority of memory is free. After some time the memory gets
    fragmented by non-movable pages, so the chances to find a contiguous 1GB
    block are getting close to zero. Even dropping caches manually doesn't
    help a lot.

    At large scale rebooting servers in order to allocate gigantic hugepages
    is quite expensive and complex. At the same time keeping some constant
    percentage of memory in reserved hugepages even if the workload isn't
    using it is a big waste: not all workloads can benefit from using 1 GB
    pages.

    The following solution can solve the problem:
    1) On boot time a dedicated cma area* is reserved. The size is passed
    as a kernel argument.
    2) Run-time allocations of gigantic hugepages are performed using the
    cma allocator and the dedicated cma area

    In this case gigantic hugepages can be allocated successfully with a
    high probability, however the memory isn't completely wasted if nobody
    is using 1GB hugepages: it can be used for pagecache, anon memory, THPs,
    etc.

    * On a multi-node machine a per-node cma area is allocated on each node.
    Following gigantic hugetlb allocation are using the first available
    numa node if the mask isn't specified by a user.

    Usage:
    1) configure the kernel to allocate a cma area for hugetlb allocations:
    pass hugetlb_cma=10G as a kernel argument

    2) allocate hugetlb pages as usual, e.g.
    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

    If the option isn't enabled or the allocation of the cma area failed,
    the current behavior of the system is preserved.

    x86 and arm-64 are covered by this patch, other architectures can be
    trivially added later.

    The patch contains clean-ups and fixes proposed and implemented by Aslan
    Bakirov and Randy Dunlap. It also contains ideas and suggestions
    proposed by Rik van Riel, Michal Hocko and Mike Kravetz. Thanks!

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Tested-by: Andreas Schaufler
    Acked-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Aslan Bakirov
    Cc: Randy Dunlap
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200407163840.92263-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

03 Apr, 2020

5 commits

  • When CONFIG_HUGETLB_PAGE is set but not CONFIG_HUGETLBFS, the following
    build failure is encoutered:

    In file included from arch/powerpc/mm/fault.c:33:0:
    include/linux/hugetlb.h: In function 'hstate_inode':
    include/linux/hugetlb.h:477:9: error: implicit declaration of function 'HUGETLBFS_SB' [-Werror=implicit-function-declaration]
    return HUGETLBFS_SB(i->i_sb)->hstate;
    ^
    include/linux/hugetlb.h:477:30: error: invalid type argument of '->' (have 'int')
    return HUGETLBFS_SB(i->i_sb)->hstate;
    ^

    Gate hstate_inode() with CONFIG_HUGETLBFS instead of CONFIG_HUGETLB_PAGE.

    Fixes: a137e1cc6d6e ("hugetlbfs: per mount huge page sizes")
    Reported-by: kbuild test robot
    Signed-off-by: Christophe Leroy
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Baoquan He
    Cc: Nishanth Aravamudan
    Cc: Nick Piggin
    Cc: Adam Litke
    Cc: Andi Kleen
    Link: http://lkml.kernel.org/r/7e8c3a3c9a587b9cd8a2f146df32a421b961f3a2.1584432148.git.christophe.leroy@c-s.fr
    Link: https://patchwork.ozlabs.org/patch/1255548/#2386036
    Signed-off-by: Linus Torvalds

    Christophe Leroy
     
  • For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
    in the resv_map entries, in file_region->reservation_counter.

    After a call to region_chg, we charge the approprate hugetlb_cgroup, and
    if successful, we pass on the hugetlb_cgroup info to a follow up
    region_add call. When a file_region entry is added to the resv_map via
    region_add, we put the pointer to that cgroup in
    file_region->reservation_counter. If charging doesn't succeed, we report
    the error to the caller, so that the kernel fails the reservation.

    On region_del, which is when the hugetlb memory is unreserved, we also
    uncharge the file_region->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct file_region]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Mike Kravetz
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-5-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Normally the pointer to the cgroup to uncharge hangs off the struct page,
    and gets queried when it's time to free the page. With hugetlb_cgroup
    reservations, this is not possible. Because it's possible for a page to
    be reserved by one task and actually faulted in by another task.

    The best place to put the hugetlb_cgroup pointer to uncharge for
    reservations is in the resv_map. But, because the resv_map has different
    semantics for private and shared mappings, the code patch to
    charge/uncharge shared and private mappings is different. This patch
    implements charging and uncharging for private mappings.

    For private mappings, the counter to uncharge is in
    resv_map->reservation_counter. On initializing the resv_map this is set
    to NULL. On reservation of a region in private mapping, the tasks
    hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
    resv_map->reservation_counter.

    On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

    [akpm@linux-foundation.org: forward declare struct resv_map]
    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shakeel Butt
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200211213128.73302-3-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • These counters will track hugetlb reservations rather than hugetlb memory
    faulted in. This patch only adds the counter, following patches add the
    charging and uncharging of the counter.

    This is patch 1 of an 9 patch series.

    Problem:

    Currently tasks attempting to reserve more hugetlb memory than is
    available get a failure at mmap/shmget time. This is thanks to Hugetlbfs
    Reservations [1]. However, if a task attempts to reserve more hugetlb
    memory than its hugetlb_cgroup limit allows, the kernel will allow the
    mmap/shmget call, but will SIGBUS the task when it attempts to fault in
    the excess memory.

    We have users hitting their hugetlb_cgroup limits and thus we've been
    looking at this failure mode. We'd like to improve this behavior such
    that users violating the hugetlb_cgroup limits get an error on mmap/shmget
    time, rather than getting SIGBUS'd when they try to fault the excess
    memory in. This gives the user an opportunity to fallback more gracefully
    to non-hugetlbfs memory for example.

    The underlying problem is that today's hugetlb_cgroup accounting happens
    at hugetlb memory *fault* time, rather than at *reservation* time. Thus,
    enforcing the hugetlb_cgroup limit only happens at fault time, and the
    offending task gets SIGBUS'd.

    Proposed Solution:

    A new page counter named
    'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
    slightly different semantics than
    'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

    - While usage_in_bytes tracks all *faulted* hugetlb memory,
    rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and hugetlb
    memory faulted in without a prior reservation.

    - If a task attempts to reserve more memory than limit_in_bytes allows,
    the kernel will allow it to do so. But if a task attempts to reserve
    more memory than rsvd.limit_in_bytes, the kernel will fail this
    reservation.

    This proposal is implemented in this patch series, with tests to verify
    functionality and show the usage.

    Alternatives considered:

    1. A new cgroup, instead of only a new page_counter attached to the
    existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
    duplication with hugetlb_cgroup. Keeping hugetlb related page counters
    under hugetlb_cgroup seemed cleaner as well.

    2. Instead of adding a new counter, we considered adding a sysctl that
    modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
    accounting at reservation time rather than fault time. Adding a new
    page_counter seems better as userspace could, if it wants, choose to
    enforce different cgroups differently: one via limit_in_bytes, and
    another via rsvd.limit_in_bytes. This could be very useful if you're
    transitioning how hugetlb memory is partitioned on your system one
    cgroup at a time, for example. Also, someone may find usage for both
    limit_in_bytes and rsvd.limit_in_bytes concurrently, and this approach
    gives them the option to do so.

    Testing:
    - Added tests passing.
    - Used libhugetlbfs for regression testing.

    [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

    Signed-off-by: Mina Almasry
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: David Rientjes
    Cc: Shuah Khan
    Cc: Shakeel Butt
    Cc: Greg Thelen
    Cc: Sandipan Das
    Link: http://lkml.kernel.org/r/20200211213128.73302-1-almasrymina@google.com
    Signed-off-by: Linus Torvalds

    Mina Almasry
     
  • Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

    While discussing the issue with huge_pte_offset [1], I remembered that
    there were more outstanding hugetlb races. These issues are:

    1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
    invalid via a call to huge_pmd_unshare by another thread.
    2) hugetlbfs page faults can race with truncation causing invalid global
    reserve counts and state.

    A previous attempt was made to use i_mmap_rwsem in this manner as
    described at [2]. However, those patches were reverted starting with [3]
    due to locking issues.

    To effectively use i_mmap_rwsem to address the above issues it needs to be
    held (in read mode) during page fault processing. However, during fault
    processing we need to lock the page we will be adding. Lock ordering
    requires we take page lock before i_mmap_rwsem. Waiting until after
    taking the page lock is too late in the fault process for the
    synchronization we want to do.

    To address this lock ordering issue, the following patches change the lock
    ordering for hugetlb pages. This is not too invasive as hugetlbfs
    processing is done separate from core mm in many places. However, I don't
    really like this idea. Much ugliness is contained in the new routine
    hugetlb_page_mapping_lock_write() of patch 1.

    The only other way I can think of to address these issues is by catching
    all the races. After catching a race, cleanup, backout, retry ... etc,
    as needed. This can get really ugly, especially for huge page
    reservations. At one time, I started writing some of the reservation
    backout code for page faults and it got so ugly and complicated I went
    down the path of adding synchronization to avoid the races. Any other
    suggestions would be welcome.

    [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
    [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
    [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
    [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

    This patch (of 2):

    While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:
    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with
    the ptep.
    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

    One problem with this scheme is that it requires taking i_mmap_rwsem
    before taking the page lock during page faults. This is not the order
    specified in the rest of mm code. Handling of hugetlbfs pages is mostly
    isolated today. Therefore, we use this alternative locking order for
    PageHuge() pages.

    mapping->i_mmap_rwsem
    hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
    page->flags PG_locked (lock_page)

    To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
    introduced to write lock the i_mmap_rwsem associated with a page.

    In most cases it is easy to get address_space via vma->vm_file->f_mapping.
    However, in the case of migration or memory errors for anon pages we do
    not have an associated vma. A new routine _get_hugetlb_page_mapping()
    will use anon_vma to get address_space in these cases.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

17 Dec, 2019

1 commit

  • In the effort of supporting cgroups v2 into Kubernetes, I stumped on
    the lack of the hugetlb controller.

    When the controller is enabled, it exposes four new files for each
    hugetlb size on non-root cgroups:

    - hugetlb..current
    - hugetlb..max
    - hugetlb..events
    - hugetlb..events.local

    The differences with the legacy hierarchy are in the file names and
    using the value "max" instead of "-1" to disable a limit.

    The file .limit_in_bytes is renamed to .max.

    The file .usage_in_bytes is renamed to .current.

    .failcnt is not provided as a single file anymore, but its value can
    be read through the new flat-keyed files .events and .events.local,
    through the "max" key.

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Tejun Heo

    Giuseppe Scrivano
     

02 Dec, 2019

3 commits

  • The first parameter hstate in function hugetlb_fault_mutex_hash() is not
    used anymore.

    This patch removes it.

    [akpm@linux-foundation.org: various build fixes]
    [cai@lca.pw: fix a GCC compilation warning]
    Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw
    Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Signed-off-by: Qian Cai
    Suggested-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Mike Kravetz
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • huge_pte_offset() produced a sparse warning due to an improper return
    type when the kernel was built with !CONFIG_HUGETLB_PAGE. Fix the bad
    type and also convert all the macros in this block to static inline
    wrappers. Two existing wrappers in this block had lines in excess of 80
    columns so clean those up as well.

    No functional change.

    Link: http://lkml.kernel.org/r/20191112194558.139389-3-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Ben Dooks
    Suggested-by: Jason Gunthorpe
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • A new clang diagnostic (-Wsizeof-array-div) warns about the calculation
    to determine the number of u32's in an array of unsigned longs.
    Suppress warning by adding parentheses.

    While looking at the above issue, noticed that the 'address' parameter
    to hugetlb_fault_mutex_hash is no longer used. So, remove it from the
    definition and all callers.

    No functional change.

    Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Nathan Chancellor
    Reviewed-by: Nathan Chancellor
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Andrew Morton
    Cc: Nick Desaulniers
    Cc: Ilie Halip
    Cc: David Bolvansky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

25 Sep, 2019

1 commit

  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Jul, 2019

2 commits

  • While only powerpc supports the hugepd case, the code is pretty generic
    and I'd like to keep all GUP internals in one place.

    Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Instead of using defines, which loses type safety and provokes unused
    variable warnings from gcc, put the constants into static inlines.

    Link: http://lkml.kernel.org/r/20190522235102.GA15370@mellanox.com
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     

15 May, 2019

1 commit

  • hugetlb uses a fault mutex hash table to prevent page faults of the
    same pages concurrently. The key for shared and private mappings is
    different. Shared keys off address_space and file index. Private keys
    off mm and virtual address. Consider a private mappings of a populated
    hugetlbfs file. A fault will map the page from the file and if needed
    do a COW to map a writable page.

    Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
    pages. It uses the address_space file index key. However, private
    mappings will use a different key and could race with this code to map
    the file page. This causes problems (BUG) for the page cache remove
    code as it expects the page to be unmapped. A sample stack is:

    page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
    kernel BUG at mm/filemap.c:169!
    ...
    RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
    ...
    Call Trace:
    __delete_from_page_cache+0x39/0x220
    delete_from_page_cache+0x45/0x70
    remove_inode_hugepages+0x13c/0x380
    ? __add_to_page_cache_locked+0x162/0x380
    hugetlbfs_fallocate+0x403/0x540
    ? _cond_resched+0x15/0x30
    ? __inode_security_revalidate+0x5d/0x70
    ? selinux_file_permission+0x100/0x130
    vfs_fallocate+0x13f/0x270
    ksys_fallocate+0x3c/0x80
    __x64_sys_fallocate+0x1a/0x20
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    There seems to be another potential COW issue/race with this approach
    of different private and shared keys as noted in commit 8382d914ebf7
    ("mm, hugetlb: improve page-fault scalability").

    Since every hugetlb mapping (even anon and private) is actually a file
    mapping, just use the address_space index key for all mappings. This
    results in potentially more hash collisions. However, this should not
    be the common case.

    Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
    Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
    Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
    Signed-off-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Davidlohr Bueso
    Cc: Joonsoo Kim
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

30 Mar, 2019

1 commit

  • kbuild produces the below warning:

    tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
    head: 5453a3df2a5eb49bc24615d4cf0d66b2aae05e5f
    commit 3d3539018d2c ("mm: create the new vm_fault_t type")
    reproduce:
    # apt-get install sparse
    git checkout 3d3539018d2cbd12e5af4a132636ee7fd8d43ef0
    make ARCH=x86_64 allmodconfig
    make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

    >> mm/memory.c:3968:21: sparse: incorrect type in assignment (different
    >> base types) @@ expected restricted vm_fault_t [usertype] ret @@
    >> got e] ret @@
    mm/memory.c:3968:21: expected restricted vm_fault_t [usertype] ret
    mm/memory.c:3968:21: got int

    This patch converts to return vm_fault_t type for hugetlb_fault() when
    CONFIG_HUGETLB_PAGE=n.

    Regarding the sparse warning, Luc said:

    : This is the expected behaviour. The constant 0 is magic regarding bitwise
    : types but ({ ...; 0; }) is not, it is just an ordinary expression of type
    : 'int'.
    :
    : So, IMHO, Souptick's patch is the right thing to do.

    Link: http://lkml.kernel.org/r/20190318162604.GA31553@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Mike Kravetz
    Cc: Matthew Wilcox
    Cc: Luc Van Oostenryck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

06 Mar, 2019

5 commits

  • This patch updates get_user_pages_longterm to migrate pages allocated
    out of CMA region. This makes sure that we don't keep non-movable pages
    (due to page reference count) in the CMA area.

    This will be used by ppc64 in a later patch to avoid pinning pages in
    the CMA region. ppc64 uses CMA region for allocation of the hardware
    page table (hash page table) and not able to migrate pages out of CMA
    region results in page table allocation failures.

    One case where we hit this easy is when a guest using a VFIO passthrough
    device. VFIO locks all the guest's memory and if the guest memory is
    backed by CMA region, it becomes unmovable resulting in fragmenting the
    CMA and possibly preventing other guests from allocation a large enough
    hash page table.

    NOTE: We allocate the new page without using __GFP_THISNODE

    Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: David Gibson
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like ppc64 require to do a conditional tlb flush based on
    the old and new value of pte. Follow the regular pte change protection
    sequence for hugetlb too. This allows the architectures to override the
    update sequence.

    Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Nicholas Piggin
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Architectures like arm64 have HugeTLB page sizes which are different
    than generic sizes at PMD, PUD, PGD level and implemented via contiguous
    bits. At present these special size HugeTLB pages cannot be identified
    through macros like (PMD|PUD|PGDIR)_SHIFT and hence chosen not be
    migrated.

    Enabling migration support for these special HugeTLB page sizes along
    with the generic ones (PMD|PUD|PGD) would require identifying all of
    them on a given platform. A platform specific hook can precisely
    enumerate all huge page sizes supported for migration. Instead of
    comparing against standard huge page orders let
    hugetlb_migration_support() function call a platform hook
    arch_hugetlb_migration_support(). Default definition for the platform
    hook maintains existing semantics which checks standard huge page order.
    But an architecture can choose to override the default and provide
    support for a comprehensive set of huge page sizes.

    Link: http://lkml.kernel.org/r/1545121450-1663-4-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Steve Capper
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Architectures like arm64 have PUD level HugeTLB pages for certain configs
    (1GB huge page is PUD based on ARM64_4K_PAGES base page size) that can
    be enabled for migration. It can be achieved through checking for
    PUD_SHIFT order based HugeTLB pages during migration.

    Link: http://lkml.kernel.org/r/1545121450-1663-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Steve Capper
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Patch series "arm64/mm: Enable HugeTLB migration", v4.

    This patch series enables HugeTLB migration support for all supported
    huge page sizes at all levels including contiguous bit implementation.
    Following HugeTLB migration support matrix has been enabled with this
    patch series. All permutations have been tested except for the 16GB.

    CONT PTE PMD CONT PMD PUD
    -------- --- -------- ---
    4K: 64K 2M 32M 1G
    16K: 2M 32M 1G
    64K: 2M 512M 16G

    First the series adds migration support for PUD based huge pages. It
    then adds a platform specific hook to query an architecture if a given
    huge page size is supported for migration while also providing a default
    fallback option preserving the existing semantics which just checks for
    (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB
    migration on arm64 and subscribe to this new platform specific hook by
    defining an override.

    The second patch differentiates between movability and migratability
    aspects of huge pages and implements hugepage_movable_supported() which
    can then be used during allocation to decide whether to place the huge
    page in movable zone or not.

    This patch (of 5):

    During huge page allocation it's migratability is checked to determine
    if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE.
    But the movability aspect of the huge page could depend on other factors
    than just migratability. Movability in itself is a distinct property
    which should not be tied with migratability alone.

    This differentiates these two and implements an enhanced movability check
    which also considers huge page size to determine if it is feasible to be
    placed under a movable zone. At present it just checks for gigantic pages
    but going forward it can incorporate other enhanced checks.

    Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Steve Capper
    Reviewed-by: Naoya Horiguchi
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

06 Oct, 2018

1 commit

  • The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

18 Aug, 2018

1 commit

  • This reverts ee8f248d266e ("hugetlb: add phys addr to struct
    huge_bootmem_page").

    At one time powerpc used this field and supporting code. However that
    was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support
    for reserving gigantic huge pages via kernel command line").

    There are no users of this field and supporting code, so remove it.

    Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Cannon Matthews
    Cc: Becky Bruce
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

01 Feb, 2018

4 commits

  • Dan Carpenter has noticed that mbind migration callback (new_page) can
    get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
    relies on the VMA to get the hstate. We used to BUG_ON this case but
    the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
    mbind hugetlb migration".

    The proper way to handle this is to get the hstate from the migrated
    page and rely on huge_node (resp. get_vma_policy) do the right thing
    with null VMA. We are currently falling back to the default mempolicy
    in that case which is in line what THP path is doing here.

    Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Dan Carpenter
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
    pages. alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
    allocation function which has to take care of reserves, overcommit or
    hugetlb cgroup accounting. None of that is really required for the page
    migration because the new page is only temporal and either will replace
    the original page or it will be dropped. This is essentially as for
    other migration call paths and there shouldn't be any reason to handle
    mbind in a special way.

    The current implementation is even suboptimal because the migration
    might fail just because the hugetlb cgroup limit is reached, or the
    overcommit is saturated.

    Fix this by making mbind like other hugetlb migration paths. Add a new
    migration helper alloc_huge_page_vma as a wrapper around
    alloc_huge_page_nodemask with additional mempolicy handling.

    alloc_huge_page_noerr has no more users and it can go.

    Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • hugepage migration relies on __alloc_buddy_huge_page to get a new page.
    This has 2 main disadvantages.

    1) it doesn't allow to migrate any huge page if the pool is used
    completely which is not an exceptional case as the pool is static and
    unused memory is just wasted.

    2) it leads to a weird semantic when migration between two numa nodes
    might increase the pool size of the destination NUMA node while the
    page is in use. The issue is caused by per NUMA node surplus pages
    tracking (see free_huge_page).

    Address both issues by changing the way how we allocate and account
    pages allocated for migration. Those should temporal by definition. So
    we mark them that way (we will abuse page flags in the 3rd page) and
    update free_huge_page to free such pages to the page allocator. Page
    migration path then just transfers the temporal status from the new page
    to the old one which will be freed on the last reference. The global
    surplus count will never change during this path but we still have to be
    careful when migrating a per-node suprlus page. This is now handled in
    move_hugetlb_state which is called from the migration path and it copies
    the hugetlb specific page state and fixes up the accounting when needed

    Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
    reflect its purpose. The new allocation routine for the migration path
    is __alloc_migrate_huge_page.

    The user visible effect of this patch is that migrated pages are really
    temporal and they travel between NUMA nodes as per the migration
    request:

    Before migration
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

    After
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
    /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
    /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

    with the previous implementation, both nodes would have nr_hugepages:1
    until the page is freed.

    Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Reviewed-by: Naoya Horiguchi
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Implements memfd sealing, similar to shmem:
    - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
    memfd_add_seals(). write() doesn't exist for hugetlbfs.
    - SHRINK: added similar check as shmem_setattr()
    - GROW: added similar check as shmem_setattr() & shmem_fallocate()

    Except write() operation that doesn't exist with hugetlbfs, that should
    make sealing as close as it can be to shmem support.

    Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau