25 Jul, 2008

40 commits

  • On 32-bit architectures PAGE_ALIGN() truncates 64-bit values to the 32-bit
    boundary. For example:

    u64 val = PAGE_ALIGN(size);

    always returns a value < 4GB even if size is greater than 4GB.

    The problem resides in PAGE_MASK definition (from include/asm-x86/page.h for
    example):

    #define PAGE_SHIFT 12
    #define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
    #define PAGE_MASK (~(PAGE_SIZE-1))
    ...
    #define PAGE_ALIGN(addr) (((addr)+PAGE_SIZE-1)&PAGE_MASK)

    The "~" is performed on a 32-bit value, so everything in "and" with
    PAGE_MASK greater than 4GB will be truncated to the 32-bit boundary.
    Using the ALIGN() macro seems to be the right way, because it uses
    typeof(addr) for the mask.

    Also move the PAGE_ALIGN() definitions out of include/asm-*/page.h in
    include/linux/mm.h.

    See also lkml discussion: http://lkml.org/lkml/2008/6/11/237

    [akpm@linux-foundation.org: fix drivers/media/video/uvc/uvc_queue.c]
    [akpm@linux-foundation.org: fix v850]
    [akpm@linux-foundation.org: fix powerpc]
    [akpm@linux-foundation.org: fix arm]
    [akpm@linux-foundation.org: fix mips]
    [akpm@linux-foundation.org: fix drivers/media/video/pvrusb2/pvrusb2-dvb.c]
    [akpm@linux-foundation.org: fix drivers/mtd/maps/uclinux.c]
    [akpm@linux-foundation.org: fix powerpc]
    Signed-off-by: Andrea Righi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • Make the needlessly global register_page_bootmem_info_section() static.

    Signed-off-by: Adrian Bunk
    Acked-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • This patch contains the following cleanups:
    - make the following needlessly global variables static:
    - required_kernelcore
    - zone_movable_pfn[]
    - make the following needlessly global functions static:
    - move_freepages()
    - move_freepages_block()
    - setup_pageset()
    - find_usable_zone_for_movable()
    - adjust_zone_range_for_zone_movable()
    - __absent_pages_in_range()
    - find_min_pfn_for_node()
    - find_zone_movable_pfns_for_nodes()

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • alloc_pages_exact() is similar to alloc_pages(), except that it allocates
    the minimum number of pages to fulfill the request. This is useful if you
    want to allocate a very large buffer that is slightly larger than an even
    power-of-two number of pages. In that case, alloc_pages() will waste a
    lot of memory.

    I have a video driver that wants to allocate a 5MB buffer. alloc_pages()
    wiill waste 3MB of physically-contiguous memory.

    Signed-off-by: Timur Tabi
    Cc: Andi Kleen
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Timur Tabi
     
  • Almost all users of this field need a PFN instead of a physical address,
    so replace node_boot_start with node_min_pfn.

    [Lee.Schermerhorn@hp.com: fix spurious BUG_ON() in mark_bootmem()]
    Signed-off-by: Johannes Weiner
    Cc:
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since alloc_bootmem_core does no goal-fallback anymore and just returns
    NULL if the allocation fails, we might now use it in alloc_bootmem_section
    without all the fixup code for a misplaced allocation.

    Also, the limit can be the first PFN of the next section as the semantics
    is that the limit is _above_ the allocated region, not within.

    Signed-off-by: Johannes Weiner
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __alloc_bootmem_node already does this, make the interface consistent.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The old node-agnostic code tried allocating on all nodes starting from the
    one with the lowest range. alloc_bootmem_core retried without the goal if
    it could not satisfy it and so the goal was only respected at all when it
    happened to be on the first (lowest page numbers) node (or theoretically
    if allocations failed on all nodes before to the one holding the goal).

    Introduce a non-panicking helper that starts allocating from the node
    holding the goal and falls back only after all thes tries failed, thus
    moving the goal fallback code out of alloc_bootmem_core.

    Make all other allocation functions benefit from this new helper.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce new helpers that mark a range that resides completely on a node
    or node-agnostic ranges that might also span node boundaries.

    The free/reserve API functions will then directly use these helpers.

    Note that the free/reserve semantics become more strict: while the prior
    code took basically arbitrary range arguments and marked the PFNs that
    happen to fall into that range, the new code requires node-specific ranges
    to be completely on the node. The node-agnostic requests might span node
    boundaries as long as the nodes are contiguous.

    Passing ranges that do not satisfy these criteria is a bug.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Factor out the common operation of marking a range on the bitmap.

    [akpm@linux-foundation.org: fix various warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • alloc_bootmem_core has become quite nasty to read over time. This is a
    clean rewrite that keeps the semantics.

    bdata->last_pos has been dropped.

    bdata->last_success has been renamed to hint_idx and it is now an index
    relative to the node's range. Since further block searching might start
    at this index, it is now set to the end of a succeeded allocation rather
    than its beginning.

    bdata->last_offset has been renamed to last_end_off to be more clear that
    it represents the ending address of the last allocation relative to the
    node.

    [y-goto@jp.fujitsu.com: fix new alloc_bootmem_core()]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rewrite the code in a more concise way using less variables.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Cc: Yinghai Lu
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • link_bootmem handles an insertion of a new descriptor into the sorted list
    in more or less three explicit branches; empty list, insert in between and
    append. These cases can be expressed implicite.

    Also mark the sorted list as initdata as it can be thrown away after boot
    as well.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reincarnate get_mapsize as bootmap_bytes and implement
    bootmem_bootmap_pages on top of it.

    Adjust users of these helpers and make free_all_bootmem_core use
    bootmem_bootmap_pages instead of open-coding it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce the bootmem_debug kernel parameter that enables very verbose
    diagnostics regarding all range operations of bootmem as well as the
    initialization and release of nodes.

    [akpm@linux-foundation.org: fix printk warnings]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Change the description, move a misplaced comment about the allocator
    itself and add me to the list of copyright holders.

    Signed-off-by: Johannes Weiner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This only reorders functions so that further patches will be easier to
    read. No code changed.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With shared reservations (and now also with private reservations), we reserve
    huge pages at mmap time. We also account for the mapping against fs quota to
    prevent a reservation from being preempted by quota exhaustion.

    When testing with the libhugetlbfs test suite, I found a problem with quota
    accounting. FS quota for allocated pages is handled correctly but we are not
    releasing quota for private pages that were reserved but never allocated. Do
    this in hugetlb_vm_op_close() at the same time as unused page reservations are
    released.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • When removing a huge page from the hugepage pool for a fault the system checks
    to see if the mapping requires additional pages to be reserved, and if it does
    whether there are any unreserved pages remaining. If not, the allocation
    fails without even attempting to get a page. In order to determine whether to
    apply this check we call vma_has_private_reserves() which tells us if this vma
    is MAP_PRIVATE and is the owner. This incorrectly triggers the remaining
    reservation test for MAP_SHARED mappings which prevents allocation of the
    final page in the pool even though it is reserved for this mapping.

    In reality we only want to check this for MAP_PRIVATE mappings where the
    process is not the original mapper. Replace vma_has_private_reserves() with
    vma_has_reserves() which indicates whether further reserves are required, and
    update the caller.

    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Instead of using the variable mmu_huge_psize to keep track of the huge
    page size we use an array of MMU_PAGE_* values. For each supported huge
    page size we need to know the hugepte_shift value and have a
    pgtable_cache. The hstate or an mmu_huge_psizes index is passed to
    functions so that they know which huge page size they should use.

    The hugepage sizes 16M and 64K are setup(if available on the hardware) so
    that they don't have to be set on the boot cmd line in order to use them.
    The number of 16G pages have to be specified at boot-time though (e.g.
    hugepagesz=16G hugepages=5).

    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • Adds a check for an overflow in the filesystem size so if someone is
    checking with statfs() on a 16G blocksize hugetlbfs in a 32bit binary that
    it will report back EOVERFLOW instead of a size of 0.

    Acked-by: Nishanth Aravamudan
    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • The huge page size is defined for 16G pages. If a hugepagesz of 16G is
    specified at boot-time then it becomes the huge page size instead of the
    default 16M.

    The change in pgtable-64K.h is to the macro pte_iterate_hashed_subpages to
    make the increment to va (the 1 being shifted) be a long so that it is not
    shifted to 0. Otherwise it would create an infinite loop when the shift
    value is for a 16G page (when base page size is 64K).

    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • The 16G huge pages have to be reserved in the HMC prior to boot. The
    location of the pages are placed in the device tree. This patch adds code
    to scan the device tree during very early boot and save these page
    locations until hugetlbfs is ready for them.

    Acked-by: Adam Litke
    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • The 16G page locations have been saved during early boot in an array. The
    alloc_bootmem_huge_page() function adds a page from here to the
    huge_boot_pages list.

    Acked-by: Adam Litke
    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • Allow alloc_bootmem_huge_page() to be overridden by architectures that
    can't always use bootmem. This requires huge_boot_pages to be available
    for use by this function.

    This is required for powerpc 16G pages, which have to be reserved prior to
    boot-time. The location of these pages are indicated in the device tree.

    Acked-by: Adam Litke
    Signed-off-by: Jon Tollefson
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jon Tollefson
     
  • Allow configurations with the default huge page size which is different to
    the traditional HPAGE_SIZE size. The default huge page size is the one
    represented in the legacy /proc ABIs, SHM, and which is defaulted to when
    mounting hugetlbfs filesystems.

    This is implemented with a new kernel option default_hugepagesz=, which
    defaults to HPAGE_SIZE if not specified.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add an hugepagesz=... option similar to IA64, PPC etc. to x86-64.

    This finally allows to select GB pages for hugetlbfs in x86 now that all
    the infrastructure is in place.

    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Acked-by: Adam Litke
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Straight forward extensions for huge pages located in the PUD instead of
    PMDs.

    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • - Reword sentence to clarify meaning with multiple options
    - Add support for using GB prefixes for the page size
    - Add extra printk to delayed > MAX_ORDER allocation code

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Make some infrastructure changes to allow boot-time allocation of
    different hugepage page sizes.

    - move all basic hstate initialisation into hugetlb_add_hstate
    - create a new function hugetlb_hstate_alloc_pages() to do the
    actual initial page allocations. Call this function early in
    order to allocate giant pages from bootmem.
    - Check for multiple hugepages= parameters

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Acked-by: Andrew Hastings
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
    not practical to enlarge MAX_ORDER to 1GB.

    Instead the 1GB pages are only allocated at boot using the bootmem
    allocator using the hugepages=... option.

    These 1G bootmem pages are never freed. In theory it would be possible to
    implement that with some complications, but since it would be a one-way
    street (>= MAX_ORDER pages cannot be allocated later) I decided not to
    currently.

    The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very
    big and the ifdef uglyness seemed not be worth it.

    Known problems: /proc/meminfo and "free" do not display the memory
    allocated for gb pages in "Total". This is a little confusing for the
    user.

    Acked-by: Andrew Hastings
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • hugetlb will need to get compound pages from bootmem to handle the case of
    them being greater than or equal to MAX_ORDER. Export the constructor
    function needed for this.

    Acked-by: Adam Litke
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Straight forward variant of the existing __alloc_bootmem_node, only
    subsequent patch when allocating giant hugepages at boot -- don't want to
    panic if we can't allocate as many as the user asked for.

    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Need this as a separate function for a future patch.

    No behaviour change.

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Provide new hugepages user APIs that are more suited to multiple hstates
    in sysfs. There is a new directory, /sys/kernel/hugepages. Underneath
    that directory there will be a directory per-supported hugepage size,
    e.g.:

    /sys/kernel/hugepages/hugepages-64kB
    /sys/kernel/hugepages/hugepages-16384kB
    /sys/kernel/hugepages/hugepages-16777216kB

    corresponding to 64k, 16m and 16g respectively. Within each
    hugepages-size directory there are a number of files, corresponding to the
    tracked counters in the hstate, e.g.:

    /sys/kernel/hugepages/hugepages-64/nr_hugepages
    /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
    /sys/kernel/hugepages/hugepages-64/free_hugepages
    /sys/kernel/hugepages/hugepages-64/resv_hugepages
    /sys/kernel/hugepages/hugepages-64/surplus_hugepages

    Of these files, the first two are read-write and the latter three are
    read-only. The size of the hugepage being manipulated is trivially
    deducible from the enclosing directory and is always expressed in kB (to
    match meminfo).

    [dave@linux.vnet.ibm.com: fix build]
    [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
    [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Nick Piggin
    Cc: Dave Hansen
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Add the ability to configure the hugetlb hstate used on a per mount basis.

    - Add a new pagesize= option to the hugetlbfs mount that allows setting
    the page size
    - This option causes the mount code to find the hstate corresponding to the
    specified size, and sets up a pointer to the hstate in the mount's
    superblock.
    - Change the hstate accessors to use this information rather than the
    global_hstate they were using (requires a slight change in mm/memory.c
    so we don't NULL deref in the error-unmap path -- see comments).

    [np: take hstate out of hugetlbfs inode and vma->vm_private_data]

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Add basic support for more than one hstate in hugetlbfs. This is the key
    to supporting multiple hugetlbfs page sizes at once.

    - Rather than a single hstate, we now have an array, with an iterator
    - default_hstate continues to be the struct hstate which we use by default
    - Add functions for architectures to register new hstates

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen