12 Feb, 2009

1 commit

  • Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
    in line with the core VM by obeying VM_NORESERVE and not reserving
    hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
    are specified. However, it is still taking filesystem quota
    unconditionally.

    At fault time, if there are no reserves and attempt is made to allocate
    the page and account for filesystem quota. If either fail, the fault
    fails. The impact is that quota is getting accounted for twice. This
    patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
    help prevent this mistake happening again, it improves the documentation
    of hugetlb_reserve_pages()

    Reported-by: Andy Whitcroft
    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jan, 2009

4 commits

  • At this point we already know that 'addr' is not NULL so get rid of
    redundant 'if'. Probably gcc eliminate it by optimization pass.

    [akpm@linux-foundation.org: use __weak, too]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Fix the following sparse warnings:

    mm/hugetlb.c:375:3: warning: returning void-valued expression
    mm/hugetlb.c:408:3: warning: returning void-valued expression

    Signed-off-by: Hannes Eder
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hannes Eder
     
  • The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
    kernel to back a VMA. This matches the size used by the MMU in the
    majority of cases. However, one counter-example occurs on PPC64 kernels
    whereby a kernel using 64K as a base pagesize may still use 4K pages for
    the MMU on older processor. To distinguish, this patch reports
    MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

    Signed-off-by: Mel Gorman
    Cc: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is useful to verify a hugepage-aware application is using the expected
    pagesizes for its memory regions. This patch creates an entry called
    KernelPageSize in /proc/pid/smaps that is the size of page used by the
    kernel to back a VMA. The entry is not called PageSize as it is possible
    the MMU uses a different size. This extension should not break any sensible
    parser that skips lines containing unrecognised information.

    Signed-off-by: Mel Gorman
    Acked-by: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Nov, 2008

1 commit

  • Oops. Part of the hugetlb private reservation code was not fully
    converted to use hstates.

    When a huge page must be unmapped from VMAs due to a failed COW,
    HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of
    the page size being used. This works if the VMA is using the default
    huge page size. Otherwise we might unmap too much, too little, or
    trigger a BUG_ON. Rare but serious -- fix it.

    Signed-off-by: Adam Litke
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

07 Nov, 2008

2 commits

  • As we can determine exactly when a gigantic page is in use we can optimise
    the common regular page cases by pulling out gigantic page initialisation
    into its own function. As gigantic pages are never released to buddy we
    do not need a destructor. This effectivly reverts the previous change to
    the main buddy allocator. It also adds a paranoid check to ensure we
    never release gigantic pages from hugetlbfs to the main buddy.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • When working with hugepages, hugetlbfs assumes that those hugepages are
    smaller than MAX_ORDER. Specifically it assumes that the mem_map is
    contigious and uses that to optimise access to the elements of the mem_map
    that represent the hugepage. Gigantic pages (such as 16GB pages on
    powerpc) by definition are of greater order than MAX_ORDER (larger than
    MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of
    the buddy alloctor guarentees for the contiguity of the mem_map, which
    ensures that the mem_map is at least contigious for maximmally aligned
    areas of MAX_ORDER_NR_PAGES pages.

    This patch adds new mem_map accessors and iterator helpers which handle
    any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to
    implement gigantic page versions of copy_huge_page and clear_huge_page,
    and to allow follow_hugetlb_page handle gigantic pages.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

23 Oct, 2008

1 commit


20 Oct, 2008

3 commits

  • Presently hugepage doesn't use zero page at all because zero page is only
    used for coredumping and hugepage can't core dump.

    However we have now implemented hugepage coredumping. Therefore we should
    implement the zero page of hugepage.

    Implementation note:

    o Why do we only check VM_SHARED for zero page?
    normal page checked as ..

    static inline int use_zero_page(struct vm_area_struct *vma)
    {
    if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
    return 0;

    return !vma->vm_ops || !vma->vm_ops->fault;
    }

    First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

    Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
    doesn't have any file backing. Thus ops->fault checking is meaningless.

    o Why don't we use zero page if !pte.

    !pte indicate {pud, pmd} doesn't exist or some error happened. So we
    shouldn't return zero page if any error occurred.

    Signed-off-by: KOSAKI Motohiro
    Cc: Adam Litke
    Cc: Hugh Dickins
    Cc: Kawai Hidehiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
    mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
    mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
    mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

1 commit

  • The page fault path for normal pages, if the fault is neither a no-page
    fault nor a write-protect fault, will update the DIRTY and ACCESSED bits
    in the page table appropriately.

    The hugepage fault path, however, does not do this, handling only no-page
    or write-protect type faults. It assumes that either the ACCESSED and
    DIRTY bits are irrelevant for hugepages (usually true, since they are
    never swapped) or that they are handled by the arch code.

    This is inconvenient for some software-loaded TLB architectures, where the
    _PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write)
    access to the page at the TLB miss. This could be worked around in the
    arch TLB miss code, but the TLB miss fast path can be made simple more
    easily if the hugetlb_fault() path handles this, as the normal page fault
    path does.

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

13 Aug, 2008

3 commits

  • [Andrew this should replace the previous version which did not check
    the returns from the region prepare for errors. This has been tested by
    us and Gerald and it looks good.

    Bah, while reviewing the locking based on your previous email I spotted
    that we need to check the return from the vma_needs_reservation call for
    allocation errors. Here is an updated patch to correct this. This passes
    testing here.]

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • In the normal case, hugetlbfs reserves hugepages at map time so that the
    pages exist for future faults. A struct file_region is used to track when
    reservations have been consumed and where. These file_regions are
    allocated as necessary with kmalloc() which can sleep with the
    mm->page_table_lock held. This is wrong and triggers may-sleep warning
    when PREEMPT is enabled.

    Updates to the underlying file_region are done in two phases. The first
    phase prepares the region for the change, allocating any necessary memory,
    without actually making the change. The second phase actually commits the
    change. This patch makes use of this by checking the reservations before
    the page_table_lock is taken; triggering any necessary allocations. This
    may then be safely repeated within the locks without any allocations being
    required.

    Credit to Mel Gorman for diagnosing this failure and initial versions of
    the patch.

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The s390 software large page emulation implements shared page tables by
    using page->index of the first tail page from a compound large page to
    store page table information. This is set up in arch_prepare_hugepage(),
    which is called from alloc_fresh_huge_page_node().

    A similar call to arch_prepare_hugepage() is missing for surplus large
    pages that are allocated in alloc_buddy_huge_page(), which breaks the
    software emulation mode for (surplus) large pages on s390. This patch
    adds the missing call to arch_prepare_hugepage(). It will have no effect
    on other architectures where arch_prepare_hugepage() is a nop.

    Also, use the correct order in the error path in alloc_fresh_huge_page_node().

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Acked-by: Nick Piggin
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

07 Aug, 2008

1 commit


02 Aug, 2008

2 commits

  • Some platform decide whether they support huge pages at boot time. On
    these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is
    set to 0 when there is no such support.

    The patches to introduce multiple huge pages support broke that causing
    the kernel to crash at boot time on machines such as POWER3 which lack
    support for multiple page sizes.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
    mm/hugetlb.c must #include
    video: Fix up hp6xx driver build regressions.
    sh: defconfig updates.
    sh: Kill off stray mach-rsk7203 reference.
    serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
    sh: Move out individual boards without mach groups.
    sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
    sh: Allow SH-3 and SH-5 to use common headers.
    sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
    sh/maple: clean maple bus code
    sh: More header path fixups for mach dir refactoring.
    sh: Move out the solution engine headers to arch/sh/include/mach-se/
    sh: I2C fix for AP325RXA and Migo-R
    sh: Shuffle the board directories in to mach groups.
    sh: dma-sh: Fix up dreamcast dma.h mach path.
    sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
    sh: Add ARCH_DEFCONFIG entries for sh and sh64.
    sh: Fix compile error of Solution Engine
    sh: Proper __put_user_asm() size mismatch fix.
    sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
    ...

    Linus Torvalds
     

30 Jul, 2008

1 commit

  • This patch fixes the following build error on sh caused by
    commit aa888a74977a8f2120ae9332376e179c39a6b07d
    (hugetlb: support larger than MAX_ORDER):

    ...
    CC mm/hugetlb.o
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    /home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
    make[2]: *** [mm/hugetlb.o] Error 1

    Reported-by: Adrian Bunk
    Signed-off-by: Adrian Bunk
    Signed-off-by: Paul Mundt

    Adrian Bunk
     

29 Jul, 2008

2 commits

  • This patch fixes the following build error on sh caused by commit
    aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than
    MAX_ORDER"):

    mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
    mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'

    Signed-off-by: Adrian Bunk
    Cc: Hirokazu Takata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Jul, 2008

1 commit

  • Fixes a build failure reported by Alan Cox:

    mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507:
    error: implicit declaration of function `cpuset_mems_nr'

    Also reverts Ingo's

    commit e44d1b2998d62a1f2f4d7eb17b56ba396535509f
    Author: Ingo Molnar
    Date: Fri Jul 25 12:57:41 2008 +0200

    mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL

    which fixed the build error but added some unused-static-function warnings.

    Signed-off-by: Nishanth Aravamudan
    Cc: Alan Cox
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

26 Jul, 2008

1 commit

  • on !CONFIG_SYSCTL on x86 with latest -git i get:

    mm/hugetlb.c: In function 'decrement_hugepage_resv_vma':
    mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function)
    mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once
    mm/hugetlb.c:83: error: for each function it appears in.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

25 Jul, 2008

15 commits

  • With shared reservations (and now also with private reservations), we reserve
    huge pages at mmap time. We also account for the mapping against fs quota to
    prevent a reservation from being preempted by quota exhaustion.

    When testing with the libhugetlbfs test suite, I found a problem with quota
    accounting. FS quota for allocated pages is handled correctly but we are not
    releasing quota for private pages that were reserved but never allocated. Do
    this in hugetlb_vm_op_close() at the same time as unused page reservations are
    released.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • When removing a huge page from the hugepage pool for a fault the system checks
    to see if the mapping requires additional pages to be reserved, and if it does
    whether there are any unreserved pages remaining. If not, the allocation
    fails without even attempting to get a page. In order to determine whether to
    apply this check we call vma_has_private_reserves() which tells us if this vma
    is MAP_PRIVATE and is the owner. This incorrectly triggers the remaining
    reservation test for MAP_SHARED mappings which prevents allocation of the
    final page in the pool even though it is reserved for this mapping.

    In reality we only want to check this for MAP_PRIVATE mappings where the
    process is not the original mapper. Replace vma_has_private_reserves() with
    vma_has_reserves() which indicates whether further reserves are required, and
    update the caller.

    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allow alloc_bootmem_huge_page() to be overridden by architectures that
    can't always use bootmem. This requires huge_boot_pages to be available
    for use by this function.

    This is required for powerpc 16G pages, which have to be reserved prior to
    boot-time. The location of these pages are indicated in the device tree.

    Acked-by: Adam Litke
    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • Allow configurations with the default huge page size which is different to
    the traditional HPAGE_SIZE size. The default huge page size is the one
    represented in the legacy /proc ABIs, SHM, and which is defaulted to when
    mounting hugetlbfs filesystems.

    This is implemented with a new kernel option default_hugepagesz=, which
    defaults to HPAGE_SIZE if not specified.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Straight forward extensions for huge pages located in the PUD instead of
    PMDs.

    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • - Reword sentence to clarify meaning with multiple options
    - Add support for using GB prefixes for the page size
    - Add extra printk to delayed > MAX_ORDER allocation code

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Make some infrastructure changes to allow boot-time allocation of
    different hugepage page sizes.

    - move all basic hstate initialisation into hugetlb_add_hstate
    - create a new function hugetlb_hstate_alloc_pages() to do the
    actual initial page allocations. Call this function early in
    order to allocate giant pages from bootmem.
    - Check for multiple hugepages= parameters

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Acked-by: Andrew Hastings
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
    not practical to enlarge MAX_ORDER to 1GB.

    Instead the 1GB pages are only allocated at boot using the bootmem
    allocator using the hugepages=... option.

    These 1G bootmem pages are never freed. In theory it would be possible to
    implement that with some complications, but since it would be a one-way
    street (>= MAX_ORDER pages cannot be allocated later) I decided not to
    currently.

    The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very
    big and the ifdef uglyness seemed not be worth it.

    Known problems: /proc/meminfo and "free" do not display the memory
    allocated for gb pages in "Total". This is a little confusing for the
    user.

    Acked-by: Andrew Hastings
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Need this as a separate function for a future patch.

    No behaviour change.

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Provide new hugepages user APIs that are more suited to multiple hstates
    in sysfs. There is a new directory, /sys/kernel/hugepages. Underneath
    that directory there will be a directory per-supported hugepage size,
    e.g.:

    /sys/kernel/hugepages/hugepages-64kB
    /sys/kernel/hugepages/hugepages-16384kB
    /sys/kernel/hugepages/hugepages-16777216kB

    corresponding to 64k, 16m and 16g respectively. Within each
    hugepages-size directory there are a number of files, corresponding to the
    tracked counters in the hstate, e.g.:

    /sys/kernel/hugepages/hugepages-64/nr_hugepages
    /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
    /sys/kernel/hugepages/hugepages-64/free_hugepages
    /sys/kernel/hugepages/hugepages-64/resv_hugepages
    /sys/kernel/hugepages/hugepages-64/surplus_hugepages

    Of these files, the first two are read-write and the latter three are
    read-only. The size of the hugepage being manipulated is trivially
    deducible from the enclosing directory and is always expressed in kB (to
    match meminfo).

    [dave@linux.vnet.ibm.com: fix build]
    [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
    [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Nick Piggin
    Cc: Dave Hansen
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Add the ability to configure the hugetlb hstate used on a per mount basis.

    - Add a new pagesize= option to the hugetlbfs mount that allows setting
    the page size
    - This option causes the mount code to find the hstate corresponding to the
    specified size, and sets up a pointer to the hstate in the mount's
    superblock.
    - Change the hstate accessors to use this information rather than the
    global_hstate they were using (requires a slight change in mm/memory.c
    so we don't NULL deref in the error-unmap path -- see comments).

    [np: take hstate out of hugetlbfs inode and vma->vm_private_data]

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Add basic support for more than one hstate in hugetlbfs. This is the key
    to supporting multiple hugetlbfs page sizes at once.

    - Rather than a single hstate, we now have an array, with an iterator
    - default_hstate continues to be the struct hstate which we use by default
    - Add functions for architectures to register new hstates

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Needed to avoid code duplication in follow up patches.

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Hugh adds: vma_pagecache_offset() has a dangerously misleading name, since
    it's using hugepage units: rename it to vma_hugecache_offset().

    [apw@shadowen.org: restack onto fixed MAP_PRIVATE reservations]
    [akpm@linux-foundation.org: vma_split conversion]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Nick Piggin
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner