07 Aug, 2014

1 commit

  • They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
    passing extra2 == NULL is equivalent to infinity.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Jul, 2014

1 commit

  • PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
    ("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
    another symbol to filter *hugetlbfs* pages.

    If a user hope to filter user pages, makedumpfile tries to exclude them by
    checking the condition whether the page is anonymous, but hugetlbfs pages
    aren't anonymous while they also be user pages.

    We know it's possible to detect them in the same way as PageHuge(),
    so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
    if (!PageCompound(page))
    return 0;

    page = compound_head(page);
    return get_compound_page_dtor(page) == free_huge_page;
    }

    For that reason, this patch changes free_huge_page() into public
    to export it to VMCOREINFO.

    Signed-off-by: Atsushi Kumagai
    Acked-by: Baoquan He
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Kumagai
     

05 Jun, 2014

3 commits

  • We already have a function named hugepages_supported(), and the similar
    name hugepage_migration_support() is a bit unconfortable, so let's rename
    it hugepage_migration_supported().

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Signed-off-by: Luiz Capitulino
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yasuaki Ishimatsu
    Reviewed-by: Davidlohr Bueso
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zhang Yanfei
    Cc: David Rientjes
    Cc: Marcelo Tosatti
    Cc: Rik van Riel
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luiz Capitulino
     
  • Currently hugepage migration is available for all archs which support
    pmd-level hugepage, but testing is done only for x86_64 and there're
    bugs for other archs. So to avoid breaking such archs, this patch
    limits the availability strictly to x86_64 until developers of other
    archs get interested in enabling this feature.

    Simply disabling hugepage migration on non-x86_64 archs is not enough to
    fix the reported problem where sys_move_pages() hits the BUG_ON() in
    follow_page(FOLL_GET), so let's fix this by checking if hugepage
    migration is supported in vma_migratable().

    Signed-off-by: Naoya Horiguchi
    Reported-by: Michael Ellerman
    Tested-by: Michael Ellerman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: Tony Luck
    Cc: Russell King
    Cc: Martin Schwidefsky
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: David Miller
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 May, 2014

1 commit

  • Currently, I am seeing the following when I `mount -t hugetlbfs /none
    /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
    related to the fact that hugetlbfs is properly not correctly setting
    itself up in this state?:

    Unable to handle kernel paging request for data at address 0x00000031
    Faulting instruction address: 0xc000000000245710
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    ....

    In KVM guests on Power, in a guest not backed by hugepages, we see the
    following:

    AnonHugePages: 0 kB
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 64 kB

    HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
    are not supported at boot-time, but this is only checked in
    hugetlb_init(). Extract the check to a helper function, and use it in a
    few relevant places.

    This does make hugetlbfs not supported (not registered at all) in this
    environment. I believe this is fine, as there are no valid hugepages
    and that won't change at runtime.

    [akpm@linux-foundation.org: use pr_info(), per Mel]
    [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
    Signed-off-by: Nishanth Aravamudan
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Mel Gorman
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

04 Apr, 2014

2 commits

  • There is a race condition if we map a same file on different processes.
    Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
    When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
    mmap_sem (exclusively). This doesn't prevent other tasks from modifying
    the region structure, so it can be modified by two processes
    concurrently.

    To solve this, introduce a spinlock to resv_map and make region
    manipulation function grab it before they do actual work.

    [davidlohr@hp.com: updated changelog]
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Joonsoo Kim
    Suggested-by: Joonsoo Kim
    Acked-by: David Gibson
    Cc: David Gibson
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Currently, to track reserved and allocated regions, we use two different
    ways, depending on the mapping. For MAP_SHARED, we use
    address_mapping's private_list and, while for MAP_PRIVATE, we use a
    resv_map.

    Now, we are preparing to change a coarse grained lock which protect a
    region structure to fine grained lock, and this difference hinder it.
    So, before changing it, unify region structure handling, consistently
    using a resv_map regardless of the kind of mapping.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

24 Jan, 2014

1 commit

  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

2 commits

  • This skips the _mapcount mangling for slab and hugetlbfs pages.

    The main trouble in doing this is to guarantee that PageSlab and
    PageHeadHuge remains constant for all get_page/put_page run on the tail
    of slab or hugetlbfs compound pages. Otherwise if they're set during
    get_page but not set during put_page, the _mapcount of the tail page
    would underflow.

    PageHeadHuge will remain true until the compound page is released and
    enters the buddy allocator so it won't risk to change even if the tail
    page is the last reference left on the page.

    PG_slab instead is cleared before the slab frees the head page with
    put_page, so if the tail pin is released after the slab freed the page,
    we would have a problem. But in the slab case the tail pin cannot be
    the last reference left on the page. This is because the slab code is
    free to reuse the compound page after a kfree/kmem_cache_free without
    having to check if there's any tail pin left. In turn all tail pins
    must be always released while the head is still pinned by the slab code
    and so we know PG_slab will be still set too.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Dave Jiang reported that he was seeing oopses when running NUMA systems
    and default_hugepagesz=1G. I traced the issue down to
    migrate_page_copy() trying to use the same code for hugetlb pages and
    transparent hugepages. It should not have been trying to pass thp pages
    in there.

    So, add some VM_BUG_ON()s for the next hapless VM developer that tries
    the same thing.

    Signed-off-by: Dave Hansen
    Reviewed-by: Naoya Horiguchi
    Tested-by: Dave Jiang
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

13 Dec, 2013

1 commit

  • With CONFIG_HUGETLBFS=n:

    mm/migrate.c: In function `do_move_page_to_node_array':
    include/linux/hugetlb.h:140:33: warning: statement with no effect [-Wunused-value]
    #define isolate_huge_page(p, l) false
    ^
    mm/migrate.c:1170:4: note: in expansion of macro `isolate_huge_page'
    isolate_huge_page(page, &pagelist);

    Reported-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Signed-off-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

22 Nov, 2013

2 commits

  • Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database
    caused by THP") can cause dereference of a dangling pointer if
    split_huge_page runs during PageHuge() if there are updates to the
    tail_page->private field.

    Also it is repeating compound_head twice for hugetlbfs and it is running
    compound_head+compound_trans_head for THP when a single one is needed in
    both cases.

    The new code within the PageSlab() check doesn't need to verify that the
    THP page size is never bigger than the smallest hugetlbfs page size, to
    avoid memory corruption.

    A longstanding theoretical race condition was found while fixing the
    above (see the change right after the skip_unlock label, that is
    relevant for the compound_lock path too).

    By re-establishing the _mapcount tail refcounting for all compound
    pages, this also fixes the below problem:

    echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    BUG: Bad page state in process bash pfn:59a01
    page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0
    page flags: 0x1c00000000008000(tail)
    Modules linked in:
    CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x55/0x76
    bad_page+0xd5/0x130
    free_pages_prepare+0x213/0x280
    __free_pages+0x36/0x80
    update_and_free_page+0xc1/0xd0
    free_pool_huge_page+0xc2/0xe0
    set_max_huge_pages.part.58+0x14c/0x220
    nr_hugepages_store_common.isra.60+0xd0/0xf0
    nr_hugepages_store+0x13/0x20
    kobj_attr_store+0xf/0x20
    sysfs_write_file+0x189/0x1e0
    vfs_write+0xc5/0x1f0
    SyS_write+0x55/0xb0
    system_call_fastpath+0x16/0x1b

    Signed-off-by: Khalid Aziz
    Signed-off-by: Andrea Arcangeli
    Tested-by: Khalid Aziz
    Cc: Pravin Shelar
    Cc: Greg Kroah-Hartman
    Cc: Ben Hutchings
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

15 Nov, 2013

1 commit

  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Sep, 2013

4 commits

  • Currently hugepage migration works well only for pmd-based hugepages
    (mainly due to lack of testing,) so we had better not enable migration of
    other levels of hugepages until we are ready for it.

    Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
    page table walk and check pud/pmd_huge() there, so they are safe. But the
    other users (softoffline and memory hotremove) don't do this, so without
    this patch they can try to migrate unexpected types of hugepages.

    To prevent this, we introduce hugepage_migration_support() as an
    architecture dependent check of whether hugepage are implemented on a pmd
    basis or not. And on some architecture multiple sizes of hugepages are
    available, so hugepage_migration_support() also checks hugepage size.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Until now we can't offline memory blocks which contain hugepages because a
    hugepage is considered as an unmovable page. But now with this patch
    series, a hugepage has become movable, so by using hugepage migration we
    can offline such memory blocks.

    What's different from other users of hugepage migration is that we need to
    decompose all the hugepages inside the target memory block into free buddy
    pages after hugepage migration, because otherwise free hugepages remaining
    in the memory block intervene the memory offlining. For this reason we
    introduce new functions dissolve_free_huge_page() and
    dissolve_free_huge_pages().

    Other than that, what this patch does is straightforwardly to add hugepage
    migration code, that is, adding hugepage code to the functions which scan
    over pfn and collect hugepages to be migrated, and adding a hugepage
    allocation function to alloc_migrate_target().

    As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
    over them because it's larger than memory block. So we now simply leave
    it to fail as it is.

    [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
    migrate hugepage with mbind(2) after applying the enablement patch which
    comes later in this series.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migration is available only for soft offlining, but
    it's also useful for some other users of page migration (clearly because
    users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
    So this patchset tries to extend such users to support hugepage migration.

    The target of this patchset is to enable hugepage migration for NUMA
    related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
    memory hotplug.

    This patchset does not add hugepage migration for memory compaction,
    because users of memory compaction mainly expect to construct thp by
    arranging raw pages, and there's little or no need to compact hugepages.
    CMA, another user of page migration, can have benefit from hugepage
    migration, but is not enabled to support it for now (just because of lack
    of testing and expertise in CMA.)

    Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
    x86_64, or hugepages in architectures like ia64) is not enabled for now
    (again, because of lack of testing.)

    As for how these are achived, I extended the API (migrate_pages()) to
    handle hugepage (with patch 1 and 2) and adjusted code of each caller to
    check and collect movable hugepages (with patch 3-7). Remaining 2 patches
    are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
    about making sure that we only migrate pmd-based hugepages. And patch 9
    is about choosing appropriate zone for hugepage allocation.

    My test is mainly functional one, simply kicking hugepage migration via
    each entry point and confirm that migration is done correctly. Test code
    is available here:

    git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

    And I always run libhugetlbfs test when changing hugetlbfs's code. With
    this patchset, no regression was found in the test.

    This patch (of 9):

    Before enabling each user of page migration to support hugepage,
    this patch enables the list of pages for migration to link not only
    LRU pages, but also hugepages. As a result, putback_movable_pages()
    and migrate_pages() can handle both of LRU pages and hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Jul, 2013

2 commits

  • hugetlb_prefault() is not used any more, this patch removes it.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Pull ARM64 updates from Catalin Marinas:
    "Main features:
    - KVM and Xen ports to AArch64
    - Hugetlbfs and transparent huge pages support for arm64
    - Applied Micro X-Gene Kconfig entry and dts file
    - Cache flushing improvements

    For arm64 huge pages support, there are x86 changes moving part of
    arch/x86/mm/hugetlbpage.c into mm/hugetlb.c to be re-used by arm64"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (66 commits)
    arm64: Add initial DTS for APM X-Gene Storm SOC and APM Mustang board
    arm64: Add defines for APM ARMv8 implementation
    arm64: Enable APM X-Gene SOC family in the defconfig
    arm64: Add Kconfig option for APM X-Gene SOC family
    arm64/Makefile: provide vdso_install target
    ARM64: mm: THP support.
    ARM64: mm: Raise MAX_ORDER for 64KB pages and THP.
    ARM64: mm: HugeTLB support.
    ARM64: mm: Move PTE_PROT_NONE bit.
    ARM64: mm: Make PAGE_NONE pages read only and no-execute.
    ARM64: mm: Restore memblock limit when map_mem finished.
    mm: thp: Correct the HPAGE_PMD_ORDER check.
    x86: mm: Remove general hugetlb code from x86.
    mm: hugetlb: Copy general hugetlb code from x86 to mm.
    x86: mm: Remove x86 version of huge_pmd_share.
    mm: hugetlb: Copy huge_pmd_share from x86 to mm.
    arm64: KVM: document kernel object mappings in HYP
    arm64: KVM: MAINTAINERS update
    arm64: KVM: userspace API documentation
    arm64: KVM: enable initialization of a 32bit vcpu
    ...

    Linus Torvalds
     

26 Jun, 2013

1 commit

  • The futex_keys of process shared futexes are generated from the page
    offset, the mapping host and the mapping index of the futex user space
    address. This should result in an unique identifier for each futex.

    Though this is not true when futexes are located in different subpages
    of an hugepage. The reason is, that the mapping index for all those
    futexes evaluates to the index of the base page of the hugetlbfs
    mapping. So a futex at offset 0 of the hugepage mapping and another
    one at offset PAGE_SIZE of the same hugepage mapping have identical
    futex_keys. This happens because the futex code blindly uses
    page->index.

    Steps to reproduce the bug:

    1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
    and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
    mapping.

    The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
    PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
    their keys solely depend on the user space address.

    2. Lock mutex1 and mutex2

    3. Create thread1 and in the thread function lock mutex1, which
    results in thread1 blocking on the locked mutex1.

    4. Create thread2 and in the thread function lock mutex2, which
    results in thread2 blocking on the locked mutex2.

    5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
    still blocks on mutex2 because the futex_key points to mutex1.

    To solve this issue we need to take the normal page index of the page
    which contains the futex into account, if the futex is in an hugetlbfs
    mapping. In other words, we calculate the normal page mapping index of
    the subpage in the hugetlbfs mapping.

    Mappings which are not based on hugetlbfs are not affected and still
    use page->index.

    Thanks to Mel Gorman who provided a patch for adding proper evaluation
    functions to the hugetlbfs code to avoid exposing hugetlbfs specific
    details to the futex code.

    [ tglx: Massaged changelog ]

    Signed-off-by: Zhang Yi
    Reviewed-by: Jiang Biao
    Tested-by: Ma Chenggong
    Reviewed-by: 'Mel Gorman'
    Acked-by: 'Darren Hart'
    Cc: 'Peter Zijlstra'
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
    Signed-off-by: Thomas Gleixner

    Zhang Yi
     

14 Jun, 2013

1 commit

  • Under x86, multiple puds can be made to reference the same bank of
    huge pmds provided that they represent a full PUD_SIZE of shared
    huge memory that is aligned to a PUD_SIZE boundary.

    The code to share pmds does not require any architecture specific
    knowledge other than the fact that pmds can be indexed, thus can
    be beneficial to some other architectures.

    This patch copies the huge pmd sharing (and unsharing) logic from
    x86/ to mm/ and introduces a new config option to activate it:
    CONFIG_ARCH_WANTS_HUGE_PMD_SHARE

    Signed-off-by: Steve Capper
    Acked-by: Catalin Marinas
    Acked-by: Andrew Morton

    Steve Capper
     

08 May, 2013

1 commit

  • The current kernel returns -EINVAL unless a given mmap length is
    "almost" hugepage aligned. This is because in sys_mmap_pgoff() the
    given length is passed to vm_mmap_pgoff() as it is without being aligned
    with hugepage boundary.

    This is a regression introduced in commit 40716e29243d ("hugetlbfs: fix
    alignment of huge page requests"), where alignment code is pushed into
    hugetlb_file_setup() and the variable len in caller side is not changed.

    To fix this, this patch partially reverts that commit, and adds
    alignment code in caller side. And it also introduces hstate_sizelog()
    in order to get proper hstate to specified hugepage size.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=56881

    [akpm@linux-foundation.org: fix warning when CONFIG_HUGETLB_PAGE=n]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Johannes Weiner
    Reported-by:
    Cc: Steven Truelove
    Cc: Jianguo Wu
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

30 Apr, 2013

1 commit

  • Particularly in oom conditions, it's troublesome that hugetlb memory is
    not displayed. All other meminfo that is emitted will not add up to
    what is expected, and there is no artifact left in the kernel log to
    show that a potentially significant amount of memory is actually
    allocated as hugepages which are not available to be reclaimed.

    Booting with hugepages=8192 on the command line, this memory is now
    shown in oom conditions. For example, with echo m >
    /proc/sysrq-trigger:

    Node 0 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
    Node 1 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
    Node 2 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB
    Node 3 hugepages_total=2048 hugepages_free=2048 hugepages_surp=0 hugepages_size=2048kB

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit

  • Use long type for page counts in mm_populate() so as to avoid integer
    overflow when running the following test code:

    int main(void) {
    void *p = mmap(NULL, 0x100000000000, PROT_READ,
    MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("p: %p\n", p);
    mlockall(MCL_CURRENT);
    printf("done\n");
    return 0;
    }

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

23 Feb, 2013

1 commit


17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • There was some desire in large applications using MAP_HUGETLB or
    SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
    others. This is useful together with NUMA policy: use 2MB interleaving
    on some mappings, but 1GB on local mappings.

    This patch extends the IPC/SHM syscall interfaces slightly to allow
    specifying the page size.

    It borrows some upper bits in the existing flag arguments and allows
    encoding the log of the desired page size in addition to the *_HUGETLB
    flag. When 0 is specified the default size is used, this makes the
    change fully compatible.

    Extending the internal hugetlb code to handle this is straight forward.
    Instead of a single mount it just keeps an array of them and selects the
    right mount based on the specified page size. When no page size is
    specified it uses the mount of the default page size.

    The change is not visible in /proc/mounts because internal mounts don't
    appear there. It also has very little overhead: the additional mounts
    just consume a super block, but not more memory when not used.

    I also exported the new flags to the user headers (they were previously
    under __KERNEL__). Right now only symbols for x86 and some other
    architecture for 1GB and 2MB are defined. The interface should already
    work for all other architectures though. Only architectures that define
    multiple hugetlb sizes actually need it (that is currently x86, tile,
    powerpc). However tile and powerpc have user configurable hugetlb
    sizes, so it's not easy to add defines. A program on those
    architectures would need to query sysfs and use the appropiate log2.

    [akpm@linux-foundation.org: cleanups]
    [rientjes@google.com: fix build]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Andi Kleen
    Cc: Michael Kerrisk
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

11 Dec, 2012

1 commit

  • This will be used for three kinds of purposes:

    - to optimize mprotect()

    - to speed up working set scanning for working set areas that
    have not been touched

    - to more accurately scan per real working set

    No change in functionality from this patch.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Aug, 2012

6 commits

  • If a process creates a large hugetlbfs mapping that is eligible for page
    table sharing and forks heavily with children some of whom fault and
    others which destroy the mapping then it is possible for page tables to
    get corrupted. Some teardowns of the mapping encounter a "bad pmd" and
    output a message to the kernel log. The final teardown will trigger a
    BUG_ON in mm/filemap.c.

    This was reproduced in 3.4 but is known to have existed for a long time
    and goes back at least as far as 2.6.37. It was probably was introduced
    in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages
    look like this;

    [ ..........] Lots of bad pmd messages followed by this
    [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7).
    [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7).
    [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7).
    [ 127.186778] ------------[ cut here ]------------
    [ 127.186781] kernel BUG at mm/filemap.c:134!
    [ 127.186782] invalid opcode: 0000 [#1] SMP
    [ 127.186783] CPU 7
    [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod
    [ 127.186801]
    [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR
    [ 127.186804] RIP: 0010:[] [] __delete_from_page_cache+0x15e/0x160
    [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002
    [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0
    [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00
    [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003
    [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8
    [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8
    [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000
    [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0
    [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0)
    [ 127.186821] Stack:
    [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b
    [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98
    [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000
    [ 127.186827] Call Trace:
    [ 127.186829] [] delete_from_page_cache+0x3b/0x80
    [ 127.186832] [] truncate_hugepages+0x115/0x220
    [ 127.186834] [] hugetlbfs_evict_inode+0x13/0x30
    [ 127.186837] [] evict+0xa7/0x1b0
    [ 127.186839] [] iput_final+0xd3/0x1f0
    [ 127.186840] [] iput+0x39/0x50
    [ 127.186842] [] d_kill+0xf8/0x130
    [ 127.186843] [] dput+0xd2/0x1a0
    [ 127.186845] [] __fput+0x170/0x230
    [ 127.186848] [] ? rb_erase+0xce/0x150
    [ 127.186849] [] fput+0x1d/0x30
    [ 127.186851] [] remove_vma+0x37/0x80
    [ 127.186853] [] do_munmap+0x2d2/0x360
    [ 127.186855] [] sys_shmdt+0xc9/0x170
    [ 127.186857] [] system_call_fastpath+0x16/0x1b
    [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0
    [ 127.186868] RIP [] __delete_from_page_cache+0x15e/0x160
    [ 127.186870] RSP
    [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]---

    The bug is a race and not always easy to reproduce. To reproduce it I was
    doing the following on a single socket I7-based machine with 16G of RAM.

    $ hugeadm --pool-pages-max DEFAULT:13G
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax
    $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall
    $ for i in `seq 1 9000`; do ./hugetlbfs-test; done

    On my particular machine, it usually triggers within 10 minutes but
    enabling debug options can change the timing such that it never hits.
    Once the bug is triggered, the machine is in trouble and needs to be
    rebooted. The machine will respond but processes accessing proc like "ps
    aux" will hang due to the BUG_ON. shutdown will also hang and needs a
    hard reset or a sysrq-b.

    The basic problem is a race between page table sharing and teardown. For
    the most part page table sharing depends on i_mmap_mutex. In some cases,
    it is also taking the mm->page_table_lock for the PTE updates but with
    shared page tables, it is the i_mmap_mutex that is more important.

    Unfortunately it appears to be also insufficient. Consider the following
    situation

    Process A Process B
    --------- ---------
    hugetlb_fault shmdt
    LockWrite(mmap_sem)
    do_munmap
    unmap_region
    unmap_vmas
    unmap_single_vma
    unmap_hugepage_range
    Lock(i_mmap_mutex)
    Lock(mm->page_table_lock)
    huge_pmd_unshare/unmap tables page_table_lock)
    Unlock(i_mmap_mutex)
    huge_pte_alloc ...
    Lock(i_mmap_mutex) ...
    vma_prio_walk, find svma, spte ...
    Lock(mm->page_table_lock) ...
    share spte ...
    Unlock(mm->page_table_lock) ...
    Unlock(i_mmap_mutex) ...
    hugetlb_no_page < end; a += 4096)
    *a = 0;
    }

    int main(int argc, char **argv)
    {
    key_t key = IPC_PRIVATE;
    size_t sizeA = nr_huge_page_A * huge_page_size;
    size_t sizeB = nr_huge_page_B * huge_page_size;
    int shmidA, shmidB;
    void *addrA = NULL, *addrB = NULL;
    int nr_children = 300, n = 0;

    if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }
    if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) {
    perror("shmget:");
    return 1;
    }

    if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) {
    perror("shmat");
    return 1;
    }

    fork_child:
    switch(fork()) {
    case 0:
    switch (n%3) {
    case 0:
    play(addrA, sizeA);
    break;
    case 1:
    play(addrB, sizeB);
    break;
    case 2:
    break;
    }
    break;
    case -1:
    perror("fork:");
    break;
    default:
    if (++n < nr_children)
    goto fork_child;
    play(addrA, sizeA);
    break;
    }
    shmdt(addrA);
    shmdt(addrB);
    do {
    wait(NULL);
    } while (--n > 0);
    shmctl(shmidA, IPC_RMID, NULL);
    shmctl(shmidB, IPC_RMID, NULL);
    return 0;
    }

    [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add the control files for hugetlb controller

    [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g]
    [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/]
    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • We will use them later in hugetlb_cgroup.c

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • hugepage_activelist will be used to track currently used HugeTLB pages.
    We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal.
    On cgroup removal we update the page's HugeTLB cgroup to point to parent
    cgroup.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Use a mmu_gather instead of a temporary linked list for accumulating pages
    when we unmap a hugepage range

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Add an inline helper and use it in the code.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

26 May, 2012

1 commit

  • The tile support for multiple-size huge pages requires tagging
    the hugetlb PTE with a "super" bit for PTEs that are multiples of
    the basic size of a pagetable span. To set that bit properly
    we need to tweak the PTe in make_huge_pte() based on the vma.

    This change provides the API for a subsequent tile-specific
    change to use.

    Reviewed-by: Hillf Danton
    Signed-off-by: Chris Metcalf

    Chris Metcalf
     

22 Mar, 2012

2 commits

  • When calling shmget() with SHM_HUGETLB, shmget aligns the request size to
    PAGE_SIZE, but this is not sufficient.

    Modify hugetlb_file_setup() to align requests to the huge page size, and
    to accept an address argument so that all alignment checks can be
    performed in hugetlb_file_setup(), rather than in its callers. Change
    newseg() and mmap_pgoff() to match the new prototype and eliminate a now
    redundant alignment check.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Steven Truelove
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Truelove
     
  • hugetlbfs_{get,put}_quota() are badly named. They don't interact with the
    general quota handling code, and they don't much resemble its behaviour.
    Rather than being about maintaining limits on on-disk block usage by
    particular users, they are instead about maintaining limits on in-memory
    page usage (including anonymous MAP_PRIVATE copied-on-write pages)
    associated with a particular hugetlbfs filesystem instance.

    Worse, they work by having callbacks to the hugetlbfs filesystem code from
    the low-level page handling code, in particular from free_huge_page().
    This is a layering violation of itself, but more importantly, if the
    kernel does a get_user_pages() on hugepages (which can happen from KVM
    amongst others), then the free_huge_page() can be delayed until after the
    associated inode has already been freed. If an unmount occurs at the
    wrong time, even the hugetlbfs superblock where the "quota" limits are
    stored may have been freed.

    Andrew Barry proposed a patch to fix this by having hugepages, instead of
    storing a pointer to their address_space and reaching the superblock from
    there, had the hugepages store pointers directly to the superblock,
    bumping the reference count as appropriate to avoid it being freed.
    Andrew Morton rejected that version, however, on the grounds that it made
    the existing layering violation worse.

    This is a reworked version of Andrew's patch, which removes the extra, and
    some of the existing, layering violation. It works by introducing the
    concept of a hugepage "subpool" at the lower hugepage mm layer - that is a
    finite logical pool of hugepages to allocate from. hugetlbfs now creates
    a subpool for each filesystem instance with a page limit set, and a
    pointer to the subpool gets added to each allocated hugepage, instead of
    the address_space pointer used now. The subpool has its own lifetime and
    is only freed once all pages in it _and_ all other references to it (i.e.
    superblocks) are gone.

    subpools are optional - a NULL subpool pointer is taken by the code to
    mean that no subpool limits are in effect.

    Previous discussion of this bug found in: "Fix refcounting in hugetlbfs
    quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or
    http://marc.info/?l=linux-mm&m=126928970510627&w=1

    v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to
    alloc_huge_page() - since it already takes the vma, it is not necessary.

    Signed-off-by: Andrew Barry
    Signed-off-by: David Gibson
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hillf Danton
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson