07 Jan, 2006

40 commits

  • This adds the function get_swap_page_of_type() allowing us to specify an index
    in swap_info[] and select a swap_info_struct structure to be used for
    allocating a swap page.

    This function (or another one of similar functionality) will be necessary for
    implementing the image-writing part of swsusp in the user space.  It can also
    be used for simplifying the current in-kernel implementation of the
    image-writing part of swsusp.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • On architectures that implement sparsemem but not discontigmem we want to
    be able to hide the flatmem option in some cases. On ppc64 for example,
    when we select NUMA we must not select flatmem.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • The attached patch makes the SYSV IPC shared memory facilities use the new
    ramfs facilities on a no-MMU kernel.

    The following changes are made:

    (1) There are now shmem_mmap() and shmem_get_unmapped_area() functions to
    allow the IPC SHM facilities to commune with the tiny-shmem and shmem
    code.

    (2) ramfs files now need resizing using do_truncate() rather than by modifying
    the inode size directly (see shmem_file_setup()). This causes ramfs to
    attempt to bind a block of pages of sufficient size to the inode.

    (3) CONFIG_SYSVIPC is no longer contingent on CONFIG_MMU.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Optimise page_state manipulations by introducing interrupt unsafe accessors
    to page_state fields. Callers must provide their own locking (either
    disable interrupts or not update from interrupt context).

    Switch over the hot callsites that can easily be moved under interrupts off
    sections.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Give j and r meaningful names.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The use k in the inner loop means that the highest zone nr is always used
    if any zone of a node is populated. This means that the policy zone is not
    correctly determined on arches that do no use HIGHMEM like ia64.

    Change the loop to decrement k which also simplifies the BUG_ON.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Currently the function to build a zonelist for a BIND policy has the side
    effect to set the policy_zone. This seems to be a bit strange. policy
    zone seems to not be initialized elsewhere and therefore 0. Do we police
    ZONE_DMA if no bind policy has been used yet?

    This patch moves the determination of the zone to apply policies to into
    the page allocator. We determine the zone while building the zonelist for
    nodes.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Simplify build_zonelists_node by removing the case statement.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There are numerous places we check whether a zone is populated or not.

    Provide a helper function to check for populated zones and convert all
    checks for zone->present_pages.

    Signed-off-by: Con Kolivas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Cc: Christoph Lameter
    Cc: Rajesh Shah
    Cc: Li Shaohua
    Cc: Srivatsa Vaddagiri
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Revert a patch which went into 2.6.8-rc1. The changelog for that patch was:

    The shrink_zone() logic can, under some circumstances, cause far too many
    pages to be reclaimed. Say, we're scanning at high priority and suddenly
    hit a large number of reclaimable pages on the LRU.

    Change things so we bale out when SWAP_CLUSTER_MAX pages have been
    reclaimed.

    Problem is, this change caused significant imbalance in inter-zone scan
    balancing by truncating scans of larger zones.

    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone
    balancing algorithm would require that if we're scanning 100 pages of
    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will
    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
    reclaimed. Thus effectively causing smaller zones to be scanned relatively
    harder than large ones.

    Now I need to remember what the workload was which caused me to write this
    patch originally, then fix it up in a different way...

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This atomic operation is superfluous: the pte will be added with the
    referenced bit set, and the page will be referenced through this mapping after
    the page fault handler returns anyway.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Optimise rmap functions by minimising atomic operations when we know there
    will be no concurrent modifications.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Cut down size slightly by not passing bad_page the function name (it should be
    able to be determined by dump_stack()). And cut down the number of printks in
    bad_page.

    Also, cut down some branching in the destroy_compound_page path.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add dma32 to zone statistics. Also attempt to arrange struct page_state a
    bit better (visually).

    Signed-off-by: Nick Piggin
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Remove the last bits of Martin's ill-fated sys_set_zone_reclaim().

    Cc: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • As find_lock_page() already checks with TestSetPageLocked() that page is
    locked, there is no need to call lock_page() that will try-lock page again
    (chances of page being unlocked in between are small). Call __lock_page()
    directly, this saves one atomic operation.

    Also, mark truncate-while-slept path as unlikely while we are here.

    (akpm: ug. But this is actually a common path for normal old read()s against
    a page which is under readahead I/O so ho-hum.)

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     
  • The attached patch cleans up the way the bootmem allocator frees pages.

    A new function, __free_pages_bootmem(), is provided in mm/page_alloc.c that is
    called from mm/bootmem.c to turn pages over to the main allocator. All the
    bits of code to initialise pages (clearing PG_reserved and setting the page
    count) are moved to here. The checks on page validity are removed, on the
    assumption that the struct page arrays will have been prepared correctly.

    Signed-off-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Patch cleans up the alloc_bootmem fix for swiotlb. Patch removes
    alloc_bootmem_*_limit api and fixes alloc_boot_*low api to do the right
    thing -- allocate from low32 memory.

    Signed-off-by: Ravikiran Thirumalai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Small cleanups that does not change generated code with the gcc's I've tested
    with.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • read_page_state and __get_page_state only traverse online CPUs, which will
    cause results to fluctuate when CPUs are plugged in or out.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • struct per_cpu_pages.low is useless. Remove it.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • bad_range is supposed to be a temporary check. It would be a pity to throw it
    out. Make it depend on CONFIG_DEBUG_VM instead.

    CONFIG_HOLES_IN_ZONE systems were relying on this to check pfn_valid in the
    page allocator. Add that to page_is_buddy instead.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Micro optimise some conditionals where we don't need lazy evaluation.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Inline set_page_refs.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Slightly optimise some page allocation and freeing functions by taking
    advantage of knowing whether or not interrupts are disabled.

    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Minor optimization (though it doesn't help in the PREEMPT case, severely
    constrained by small ZAP_BLOCK_SIZE). free_pages_and_swap_cache works in
    chunks of 16, calling release_pages which works in chunks of PAGEVEC_SIZE.
    But PAGEVEC_SIZE was dropped from 16 to 14 in 2.6.10, so we're now doing more
    spin_lock_irq'ing than necessary: use PAGEVEC_SIZE throughout.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The NODES_SPAN_OTHER_NODES config option was created so that DISCONTIGMEM
    could handle pSeries numa layouts. However, support for DISCONTIGMEM has
    been replaced by SPARSEMEM on powerpc. As a result, this config option and
    supporting code is no longer needed.

    I have already sent a patch to Paul that removes the option from powerpc
    specific code. This removes the arch independent piece. Doesn't really
    matter which is applied first.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • The number of parameters for find_or_alloc_page increases significantly after
    policy support is added to huge pages. Simplify the code by folding
    find_or_alloc_huge_page() into hugetlb_no_page().

    Adam Litke objected to this piece in an earlier patch but I think this is a
    good simplification. Diffstat shows that we can get rid of almost half of the
    lines of find_or_alloc_page(). If we can find no consensus then lets simply
    drop this patch.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • mempolicy.c contains provisional interface for huge page allocation based on
    node numbers. This is in use in SLES9 but was never used (AFAIK) in upstream
    versions of Linux.

    Huge page allocations now use zonelists to figure out where to allocate pages.
    The use of zonelists allows us to find the closest hugepage which was the
    consideration of the NUMA distance for huge page allocations.

    Remove the obsolete functions.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The huge_zonelist() function in the memory policy layer provides an list of
    zones ordered by NUMA distance. The hugetlb layer will walk that list looking
    for a zone that has available huge pages but is also in the nodeset of the
    current cpuset.

    This patch does not contain the folding of find_or_alloc_huge_page() that was
    controversial in the earlier discussion.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This was discussed at
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2

    This patch changes the dequeueing to select a huge page near the node
    executing instead of always beginning to check for free nodes from node 0.
    This will result in a placement of the huge pages near the executing
    processor improving performance.

    The existing implementation can place the huge pages far away from the
    executing processor causing significant degradation of performance. The
    search starting from zero also means that the lower zones quickly run out
    of memory. Selecting a huge page near the process distributed the huge
    pages better.

    Signed-off-by: Christoph Lameter
    Cc: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
    supported. This helps us to safely use hugetlb pages in many more
    applications. The patch makes the following changes. If needed, I also have
    it broken out according to the following paragraphs.

    1. Add a pair of functions to set/clear write access on huge ptes. The
    writable check in make_huge_pte is moved out to the caller for use by COW
    later.

    2. Hugetlb copy-on-write requires special case handling in the following
    situations:

    - copy_hugetlb_page_range() - Copied pages must be write protected so
    a COW fault will be triggered (if necessary) if those pages are written
    to.

    - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
    page cache. MAP_PRIVATE pages still need to be locked however.

    3. Provide hugetlb_cow() and calls from hugetlb_fault() and
    hugetlb_no_page() which handles the COW fault by making the actual copy.

    4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
    will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED
    mapping check.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • This patch splits the "no_page()" type activity into its own function,
    hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults
    and delegates to the appropriate handler depending on the type of fault.
    Right now we still have only hugetlb_no_page() but a later patch introduces a
    COW fault.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • find_lock_huge_page() isn't a great name, since it does extra things not
    analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is
    closer to the mark.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • cleanup

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Here is the patch to implement madvise(MADV_REMOVE) - which frees up a
    given range of pages & its associated backing store. Current
    implementation supports only shmfs/tmpfs and other filesystems return
    -ENOSYS.

    "Some app allocates large tmpfs files, then when some task quits and some
    client disconnect, some memory can be released. However the only way to
    release tmpfs-swap is to MADV_REMOVE". - Andrea Arcangeli

    Databases want to use this feature to drop a section of their bufferpool
    (shared memory segments) - without writing back to disk/swap space.

    This feature is also useful for supporting hot-plug memory on UML.

    Concerns raised by Andrew Morton:

    - "We have no plan for holepunching! If we _do_ have such a plan (or
    might in the future) then what would the API look like? I think
    sys_holepunch(fd, start, len), so we should start out with that."

    - Using madvise is very weird, because people will ask "why do I need to
    mmap my file before I can stick a hole in it?"

    - None of the other madvise operations call into the filesystem in this
    manner. A broad question is: is this capability an MM operation or a
    filesytem operation? truncate, for example, is a filesystem operation
    which sometimes has MM side-effects. madvise is an mm operation and with
    this patch, it gains FS side-effects, only they're really, really
    significant ones."

    Comments:

    - Andrea suggested the fs operation too but then it's more efficient to
    have it as a mm operation with fs side effects, because they don't
    immediatly know fd and physical offset of the range. It's possible to
    fixup in userland and to use the fs operation but it's more expensive,
    the vmas are already in the kernel and we can use them.

    Short term plan & Future Direction:

    - We seem to need this interface only for shmfs/tmpfs files in the short
    term. We have to add hooks into the filesystem for correctness and
    completeness. This is what this patch does.

    - In the future, plan is to support both fs and mmap apis also. This
    also involves (other) filesystem specific functions to be implemented.

    - Current patch doesn't support VM_NONLINEAR - which can be addressed in
    the future.

    Signed-off-by: Badari Pulavarty
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michael Kerrisk
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • This patch makes truncate_inode_pages_range from truncate_inode_pages.
    truncate_inode_pages became a one-liner call to truncate_inode_pages_range.

    Reiser4 needs truncate_inode_pages_ranges because it tries to keep
    correspondence between existences of metadata pointing to data pages and pages
    to which those metadata point to. So, when metadata of certain part of file
    is removed from filesystem tree, only pages of corresponding range are to be
    truncated.

    (Needed by the madvise(MADV_REMOVE) patch)

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Reiser
     
  • __add_section defines an unused pointer to the zones pgdat. Remove this
    definition. This fixes a compile warning.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Two changes to the setting of the ALLOC_CPUSET flag in
    mm/page_alloc.c:__alloc_pages()

    - A bug fix - the "ignoring mins" case should not be honoring ALLOC_CPUSET.
    This case of all cases, since it is handling a request that will free up
    more memory than is asked for (exiting tasks, e.g.) should be allowed to
    escape cpuset constraints when memory is tight.

    - A logic change to make it simpler. Honor cpusets even on GFP_ATOMIC
    (!wait) requests. With this, cpuset confinement applies to all requests
    except ALLOC_NO_WATERMARKS, so that in a subsequent cleanup patch, I can
    remove the ALLOC_CPUSET flag entirely. Since I don't know any real reason
    this logic has to be either way, I am choosing the path of the simplest
    code.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson