01 Jul, 2021

7 commits

  • When using HUGETLB_PAGE_FREE_VMEMMAP, the freeing unused vmemmap pages
    associated with each HugeTLB page is default off. Now the vmemmap is PMD
    mapped. So there is no side effect when this feature is enabled with no
    HugeTLB pages in the system. Someone may want to enable this feature in
    the compiler time instead of using boot command line. So add a config to
    make it default on when someone do not want to enable it via command line.

    Link: https://lkml.kernel.org/r/20210616094915.34432-4-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Cc: Chen Huang
    Cc: David Hildenbrand
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • Patch series "Split huge PMD mapping of vmemmap pages", v4.

    In order to reduce the difficulty of code review in series[1]. We disable
    huge PMD mapping of vmemmap pages when that feature is enabled. In this
    series, we do not disable huge PMD mapping of vmemmap pages anymore. We
    will split huge PMD mapping when needed. When HugeTLB pages are freed
    from the pool we do not attempt coalasce and move back to a PMD mapping
    because it is much more complex.

    [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/

    This patch (of 3):

    In [1], PMD mappings of vmemmap pages were disabled if the the feature
    hugetlb_free_vmemmap was enabled. This was done to simplify the initial
    implementation of vmmemap freeing for hugetlb pages. Now, remove this
    simplification by allowing PMD mapping and switching to PTE mappings as
    needed for allocated hugetlb pages.

    When a hugetlb page is allocated, the vmemmap page tables are walked to
    free vmemmap pages. During this walk, split huge PMD mappings to PTE
    mappings as required. In the unlikely case PTE pages can not be
    allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
    page.

    When HugeTLB pages are freed from the pool, we do not attempt to
    coalesce and move back to a PMD mapping because it is much more complex.

    [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com

    Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
    Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Reviewed-by: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Chen Huang
    Cc: Jonathan Corbet
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • All the infrastructure is ready, so we introduce nr_free_vmemmap_pages
    field in the hstate to indicate how many vmemmap pages associated with a
    HugeTLB page that can be freed to buddy allocator. And initialize it in
    the hugetlb_vmemmap_init(). This patch is actual enablement of the
    feature.

    There are only (RESERVE_VMEMMAP_SIZE / sizeof(struct page)) struct page
    structs that can be used when CONFIG_HUGETLB_PAGE_FREE_VMEMMAP, so add a
    BUILD_BUG_ON to catch invalid usage of the tail struct page.

    Link: https://lkml.kernel.org/r/20210510030027.56044-10-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Acked-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Reviewed-by: Miaohe Lin
    Tested-by: Chen Huang
    Tested-by: Bodeddula Balasubramaniam
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Balbir Singh
    Cc: Barry Song
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: HORIGUCHI NAOYA
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joao Martins
    Cc: Joerg Roedel
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Oliver Neukum
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • Add a kernel parameter hugetlb_free_vmemmap to enable the feature of
    freeing unused vmemmap pages associated with each hugetlb page on boot.

    We disable PMD mapping of vmemmap pages for x86-64 arch when this feature
    is enabled. Because vmemmap_remap_free() depends on vmemmap being base
    page mapped.

    Link: https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Reviewed-by: Oscar Salvador
    Reviewed-by: Barry Song
    Reviewed-by: Miaohe Lin
    Tested-by: Chen Huang
    Tested-by: Bodeddula Balasubramaniam
    Reviewed-by: Mike Kravetz
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Balbir Singh
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: HORIGUCHI NAOYA
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joao Martins
    Cc: Joerg Roedel
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Oliver Neukum
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • When we free a HugeTLB page to the buddy allocator, we need to allocate
    the vmemmap pages associated with it. However, we may not be able to
    allocate the vmemmap pages when the system is under memory pressure. In
    this case, we just refuse to free the HugeTLB page. This changes behavior
    in some corner cases as listed below:

    1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

    2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

    3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages. We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page. This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones. Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

    4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

    Vmemmap pages are allocated from the page freeing context. In order for
    those allocations to be not disruptive (e.g. trigger oom killer)
    __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because
    a non sleeping allocation would be too fragile and it could fail too
    easily under memory pressure. GFP_ATOMIC or other modes to access memory
    reserves is not used because we want to prevent consuming reserves under
    heavy hugetlb freeing.

    [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
    Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
    [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
    Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

    Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Signed-off-by: Mike Kravetz
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Balbir Singh
    Cc: Barry Song
    Cc: Bodeddula Balasubramaniam
    Cc: Borislav Petkov
    Cc: Chen Huang
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: HORIGUCHI NAOYA
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joao Martins
    Cc: Joerg Roedel
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Oliver Neukum
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • In the subsequent patch, we should allocate the vmemmap pages when freeing
    a HugeTLB page. But update_and_free_page() can be called under any
    context, so we cannot use GFP_KERNEL to allocate vmemmap pages. However,
    we can defer the actual freeing in a kworker to prevent from using
    GFP_ATOMIC to allocate the vmemmap pages.

    The __update_and_free_page() is where the call to allocate vmemmmap pages
    will be inserted.

    Link: https://lkml.kernel.org/r/20210510030027.56044-6-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Balbir Singh
    Cc: Barry Song
    Cc: Bodeddula Balasubramaniam
    Cc: Borislav Petkov
    Cc: Chen Huang
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: HORIGUCHI NAOYA
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joao Martins
    Cc: Joerg Roedel
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mina Almasry
    Cc: Oliver Neukum
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song
     
  • Every HugeTLB has more than one struct page structure. We __know__ that
    we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
    store metadata associated with each HugeTLB.

    There are a lot of struct page structures associated with each HugeTLB
    page. For tail pages, the value of compound_head is the same. So we can
    reuse first page of tail page structures. We map the virtual addresses of
    the remaining pages of tail page structures to the first tail page struct,
    and then free these page frames. Therefore, we need to reserve two pages
    as vmemmap areas.

    When we allocate a HugeTLB page from the buddy, we can free some vmemmap
    pages associated with each HugeTLB page. It is more appropriate to do it
    in the prep_new_huge_page().

    The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
    associated with a HugeTLB page can be freed, returns zero for now, which
    means the feature is disabled. We will enable it once all the
    infrastructure is there.

    [willy@infradead.org: fix documentation warning]
    Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org

    Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
    Signed-off-by: Muchun Song
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Oscar Salvador
    Tested-by: Chen Huang
    Tested-by: Bodeddula Balasubramaniam
    Acked-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Anshuman Khandual
    Cc: Balbir Singh
    Cc: Barry Song
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: HORIGUCHI NAOYA
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Joao Martins
    Cc: Joerg Roedel
    Cc: Jonathan Corbet
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Mina Almasry
    Cc: Oliver Neukum
    Cc: Paul E. McKenney
    Cc: Pawan Gupta
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Xiongchun Duan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Muchun Song