24 Jun, 2005

14 commits

  • This patch removes redundant VM_ClearReadHint from mm/madvice.c which was
    left there by Prasanna's patch.

    Signed-off-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • This patch creates a new kstrdup library function and changes the "local"
    implementations in several places to use this function.

    Most of the changes come from the sound and net subsystems. The sound part
    had already been acknowledged by Takashi Iwai and the net part by David S.
    Miller.

    I left UML alone for now because I would need more time to read the code
    carefully before making changes there.

    Signed-off-by: Paulo Marques
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paulo Marques
     
  • Patch to allocate the control structures for for ide devices on the node of
    the device itself (for NUMA systems). The patch depends on the Slab API
    change patch by Manfred and me (in mm) and the pcidev_to_node patch that I
    posted today.

    Does some realignment too.

    Signed-off-by: Justin M. Forbes
    Signed-off-by: Christoph Lameter
    Signed-off-by: Pravin Shelar
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Make sparse's initalization be accessible at runtime. This allows sparse
    mappings to be created after boot in a hotplug situation.

    This patch is separated from the previous one just to give an indication how
    much of the sparse infrastructure is *just* for hotplug memory.

    The section_mem_map doesn't really store a pointer. It stores something that
    is convenient to do some math against to get a pointer. It isn't valid to
    just do *section_mem_map, so I don't think it should be stored as a pointer.

    There are a couple of things I'd like to store about a section. First of all,
    the fact that it is !NULL does not mean that it is present. There could be
    such a combination where section_mem_map *is* NULL, but the math gets you
    properly to a real mem_map. So, I don't think that check is safe.

    Since we're storing 32-bit-aligned structures, we have a few bits in the
    bottom of the pointer to play with. Use one bit to encode whether there's
    really a mem_map there, and the other one to tell whether there's a valid
    section there. We need to distinguish between the two because sometimes
    there's a gap between when a section is discovered to be present and when we
    can get the mem_map for it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jack Steiner
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The part of the sparsemem patch which modifies memmap_init_zone() has recently
    become a problem. It changes behavior so that there is a call to
    pfn_to_page() for each individual page inside of a node's range:
    node_start_pfn through node_end_pfn. It used to simply do this once, at the
    beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
    of a node made it necessary to change.

    Mike Kravetz recently wrote a patch which made the NUMA code accept some new
    kinds of layouts. The system's memory was laid out like this, with node 0's
    memory in two pieces: one before and one after node 1's memory:

    Node 0: +++++ +++++
    Node 1: +++++

    Previous behavior before Mike's patch was to assign nodes like this:

    Node 0: 00000 XXXXX
    Node 1: 11111

    Where the 'X' areas were simply thrown away. The new behavior was to make the
    pg_data_t span node 0 across all of its areas, including areas that are really
    node 1's: Node 0: 000000000000000 Node 1: 11111

    This wastes a little bit of mem_map space, but ends up being OK, and more
    fully utilizes the system's memory. memmap_init_zone() initializes all of the
    "struct page"s for node 0, even for the "hole", but those never get used,
    because there is no pfn_to_page() that resolves to those pages. However, only
    calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
    allocated for node0->node_mem_map because:

    struct page *start = pfn_to_page(start_pfn);
    // effectively start = &node->node_mem_map[0]
    for (page = start; page < (start + size); page++) {
    init_page_here();...
    page++;
    }

    Slow, and wasteful, but generally harmless.

    But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
    does):

    for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
    page = pfn_to_page(pfn);
    }

    And you end up trying to initialize node 1's pages too early, along with bogus
    data from node 0. This patch checks for those weird layouts and declines to
    touch the pages, making the more frequent pfn_to_page() calls OK to do.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Sparsemem abstracts the use of discontiguous mem_maps[]. This kind of
    mem_map[] is needed by discontiguous memory machines (like in the old
    CONFIG_DISCONTIGMEM case) as well as memory hotplug systems. Sparsemem
    replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
    become a complete replacement.

    A significant advantage over DISCONTIGMEM is that it's completely separated
    from CONFIG_NUMA. When producing this patch, it became apparent in that NUMA
    and DISCONTIG are often confused.

    Another advantage is that sparse doesn't require each NUMA node's ranges to be
    contiguous. It can handle overlapping ranges between nodes with no problems,
    where DISCONTIGMEM currently throws away that memory.

    Sparsemem uses an array to provide different pfn_to_page() translations for
    each SECTION_SIZE area of physical memory. This is what allows the mem_map[]
    to be chopped up.

    In order to do quick pfn_to_page() operations, the section number of the page
    is encoded in page->flags. Part of the sparsemem infrastructure enables
    sharing of these bits more dynamically (at compile-time) between the
    page_zone() and sparsemem operations. However, on 32-bit architectures, the
    number of bits is quite limited, and may require growing the size of the
    page->flags type in certain conditions. Several things might force this to
    occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
    memory), an increase in the physical address space, or an increase in the
    number of used page->flags.

    One thing to note is that, once sparsemem is present, the NUMA node
    information no longer needs to be stored in the page->flags. It might provide
    speed increases on certain platforms and will be stored there if there is
    room. But, if out of room, an alternate (theoretically slower) mechanism is
    used.

    This patch introduces CONFIG_FLATMEM. It is used in almost all cases where
    there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
    often have to compile out the same areas of code.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Adrian Bunk
    Signed-off-by: Yasunori Goto
    Signed-off-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Allow architectures to indicate that they will be providing hooks to indice
    installed memory areas, memory_present(). Provide prototypes for the i386
    implementation.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • This gives DISCONTIGMEM a bit more help text to explain what it does, not just
    when to choose it.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I got some feedback from users who think that the new "Memory Model" menu is a
    little invasive. This patch will hide that menu, except when
    CONFIG_EXPERIMENTAL is enabled *or* when an individual architecture wants it.

    An individual arch may want to enable it because they've removed their
    arch-specific DISCONTIG prompt in favor of the mm/Kconfig one.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • The following patch applies on top of 2.6.12-rc2-mm1. It fixes a minor
    user interaction issue, and an early reference to SPARSEMEM.

    This "choice" menu would always default to FLATMEM, as it was listed first.
    Move it to the end so that the other defaults have a chance first.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • There is some confusion that arose when working on SPARSEMEM patch between
    what is needed for DISCONTIG vs. NUMA.

    Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
    All of the current NUMA implementations require an implementation of
    DISCONTIG. Because of this, quite a lot of code which is really needed for
    NUMA is actually under DISCONTIG #ifdefs. For SPARSEMEM, we changed some
    of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
    case.

    Introducing this new NEED_MULTIPLE_NODES config option allows code that is
    needed for both NUMA or DISCONTIG to be separated out from code that is
    specific to DISCONTIG.

    One great advantage of this approach is that it doesn't require every
    architecture to be converted over. All of the current implementations
    should "just work", only the ones implementing SPARSEMEM will have to be
    fixed up.

    The change to free_area_init() makes it work inside, or out of the new
    config option.

    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • With sparsemem being introduced, we need a central place for new
    memory-related .config options: mm/Kconfig. This allows us to remove many
    of the duplicated arch-specific options.

    The new option, CONFIG_FLATMEM, is there to enable us to detangle NUMA and
    DISCONTIGMEM. This is a requirement for sparsemem because sparsemem uses
    the NUMA code without the presence of DISCONTIGMEM. The sparsemem patches
    use CONFIG_FLATMEM in generic code, so this patch is a requirement before
    applying them.

    Almost all places that used to do '#ifndef CONFIG_DISCONTIGMEM' should use
    '#ifdef CONFIG_FLATMEM' instead.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Generify the value fields in the page_flags. The aim is to allow the location
    and size of these fields to be varied. Additionally we want to move away from
    fixed allocations per field whilst still enforcing the overall bit utilisation
    limits. We rely on the compiler to spot and optimise the accessor functions.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Introduce a simple allocator for the NUMA remap space. This space is very
    scarce, used for structures which are best allocated node local.

    This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
    the pgdat->node_mem_map initialized in a consistent place for all
    architectures.

    Issues:
    o alloc_remap takes a node_id where we might expect a pgdat which was intended
    to allow us to allocate the pgdat's using this mechanism; which we do not yet
    do. Could have alloc_remap_node() and alloc_remap_nid() for this purpose.

    Signed-off-by: Andy Whitcroft
    Signed-off-by: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

23 Jun, 2005

1 commit

  • The boot_pageset needs to be preserved for hotplugging and for off line
    processors and nodes. Otherwise pointers will point into memory that has
    now a different use. /proc/zoneinfo is currently showing strange results
    if processors / nodes are not present.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

22 Jun, 2005

25 commits

  • OOM killer prints a stray newline.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Vlasenko
     
  • It's common practice to msync a large address range regularly, in which
    often only a few ptes have actually been dirtied since the previous pass.

    sync_pte_range then goes much faster if it tests whether pte is dirty
    before locating and accessing each struct page cacheline; and it is hardly
    slowed by ptep_clear_flush_dirty repeating that test in the opposite case,
    when every pte actually is dirty.

    But beware, s390's pte_dirty always says false, since its dirty bit is kept
    in the storage key, located via the struct page address. So skip this
    optimization in its case: use a pte_maybe_dirty macro which just says true
    if page_test_and_clear_dirty is implemented.

    Signed-off-by: Abhijit Karmarkar
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Abhijit Karmarkar
     
  • Remember that ironic get_user_pages race? when the raised page_count on a
    page swapped out led do_wp_page to decide that it had to copy on write, so
    substituted a different page into userspace. 2.6.7 onwards have Andrea's
    solution, where try_to_unmap_one backs out if it finds page_count raised.

    Which works, but is unsatisfying (rmap.c has no other page_count heuristics),
    and was found a few months ago to hang an intensive page migration test. A
    year ago I was hesitant to engage page_mapcount, now it seems the right fix.

    So remove the page_count hack from try_to_unmap_one; and use activate_page in
    unuse_mm when dropping lock, to replace its secondary effect of helping
    swapoff to make progress in that case.

    Simplify can_share_swap_page (now called only on anonymous pages) to check
    page_mapcount + page_swapcount == 1: still needs the page lock to stabilize
    their (pessimistic) sum, but does not need swapper_space.tree_lock for that.

    In do_swap_page, move swap_free and unlock_page below page_add_anon_rmap, to
    keep sum on the high side, and correct when can_share_swap_page called.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A small optimization to do_wp_page's check for whether to avoid copy by
    reusing the page already mapped. It can never share a cached file page,
    nor can it share a reserved page (often the empty zero page), so it's a
    waste of time to lock and unlock in those cases. Which nowadays can both
    be neatly excluded by a preliminary PageAnon test.

    Christoph has reported that a preliminary page_count test proved valuable
    for scalability here, but PageAnon covers more common cases all at once.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since its birth, get_user_pages has been calling a misguided get_page_map
    function. follow_page has already returned NULL if the pfn is invalid, we
    cannot reach an invalid pfn from a validated struct page.

    Remove get_page_map, and the messy rewind in get_user_pages to cope with
    its failure. Oh, and could we please call that "struct page *page" like
    everywhere else, instead of "struct page *map"?

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Since free_pages_check complains if PG_reclaim or PG_slab is set, bad_page
    ought to clear them to avoid repetitive reports (Nikita noticed this too).
    Let prep_new_page check page_count and PG_slab as free_pages_check does.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Strict mbind's check for currently mapped pages being on node has been
    using a slow loop which re-evaluates pgd, pud, pmd, pte for each entry:
    replace that by a standard four-level page table walk like others in mm.
    Since mmap_sem is held for writing, page_table_lock can be taken at the
    inner level to limit latency.

    Signed-off-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Strict mbind's check that pages already mapped are on right node has been
    using pte_page without checking if pfn_valid, and without page_table_lock
    to prevent spurious failures when try_to_unmap_one intervenes between the
    pte_present and the pte_page.

    Signed-off-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • To improve shmem scalability, we allowed tmpfs instances which don't need
    their blocks or inodes limited not to count them, and not to allocate any
    sbinfo. Which was okay when the only use for the sbinfo was accounting
    blocks and inodes; but since then a couple of unrelated projects extending
    tmpfs want to store other data in the sbinfo. Whether either extension
    reaches mainline is beside the point: I'm guilty of a bad design decision,
    and should restore sbinfo to make any such future extensions easier.

    So, once again allocate a shmem_sb_info for every shmem/tmpfs instance, and
    now let max_blocks 0 indicate unlimited blocks, and max_inodes 0 unlimited
    inodes. Brent Casavant verified (many months ago) that this does not
    perceptibly impact the scalability (since the unlimited sbinfo cacheline is
    repeatedly accessed but only once dirtied).

    And merge shmem_set_size into its sole caller shmem_remount_fs.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Reduce size of the huge per_cpu_pageset structure in __initdata introduced
    into mm1 with the pageset localization patchset. Use one specially
    configured pageset per cpu for all zones and nodes during bootup.

    - Avoid duplication of pageset initialization code.
    - do the adding to the pageset list before potential free_pages_bulk
    in free_hot_cold_page (otherwise we would have to hold a page
    in a pageset during the period that the boot pagesets are in use).
    - remove mistaken __cpuinitdata attribute and revert back to __initdata
    for the boot pageset. A boot pageset is not necessary for cpu hotplug.

    Tested for UP SMP NUMA on x86_64 (2.6.12-rc6-mm1): UP SMP NUMA Tested on
    IA64 (2.6.12-rc5-mm2): NUMA (2.6.12-rc6-mm1 broken for IA64 because of
    sparsemem patches)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The pageset array can potentially acquire a huge amount of memory on large
    NUMA systems. F.e. on a system with 512 processors and 256 nodes there
    will be 256*512 pagesets. If each pageset only holds 5 pages then we are
    talking about 655360 pages.With a 16K page size on IA64 this results in
    potentially 10 Gigabytes of memory being trapped in pagesets. The typical
    cases are much less for smaller systems but there is still the potential of
    memory being trapped in off node pagesets. Off node memory may be rarely
    used if local memory is available and so we may potentially have memory in
    seldom used pagesets without this patch.

    The slab allocator flushes its per cpu caches every 2 seconds. The
    following patch flushes the off node pageset caches in the same way by
    tying into the slab flush.

    The patch also changes /proc/zoneinfo to include the number of pages
    currently in each pageset.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch provides more debug info when the system is OOM. It displays
    memory stats (basically sysrq-m info) from __alloc_pages() when page
    allocation fails and during OOM kill.

    Thanks to Dave Jones for coming up with the idea.

    Signed-off-by: Janet Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Janet Morgan
     
  • By making the offset argument of __read_page_state an unsigned long instead of
    unsigned, we can avoid forcing the compiler to sign extend a usually constant
    argument. This saves 1 instruction on x86-64.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin LaHaise
     
  • By making the offset argument of __mod_page_state an unsigned long instead
    of unsigned, we can avoid forcing the compiler to sign extend a usually
    constant argument. This saves 1 instruction on x86-64.

    Signed-off-by: Benjamin LaHaise
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin LaHaise
     
  • try_to_free_pages accepts a third argument, order, but hasn't used it since
    before 2.6.0. The following patch removes the argument and updates all the
    calls to try_to_free_pages.

    Signed-off-by: Darren Hart
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darren Hart
     
  • The topdown changes in 2.6.12-rc1 can cause large allocations with large
    stack limit to fail, despite there being space available. The
    mmap_base-len is only valid when len >= mmap_base. However, nothing in
    topdown allocator checks this. It's only (now) caught at higher level,
    which will cause allocation to simply fail. The following change restores
    the fallback to bottom-up path, which will allow large allocations with
    large stack limit to potentially still succeed.

    Signed-off-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     
  • This patch modifies the way pagesets in struct zone are managed.

    Each zone has a per-cpu array of pagesets. So any particular CPU has some
    memory in each zone structure which belongs to itself. Even if that CPU is
    not local to that zone.

    So the patch relocates the pagesets for each cpu to the node that is nearest
    to the cpu instead of allocating the pagesets in the (possibly remote) target
    zone. This means that the operations to manage pages on remote zone can be
    done with information available locally.

    We play a macro trick so that non-NUMA pmachines avoid the additional
    pointer chase on the page allocator fastpath.

    AIM7 benchmark on a 32 CPU SGI Altix

    w/o patches:
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.68 100 484.6769 12.01 1.97 Fri Mar 25 11:01:42 2005
    100 27140.46 89 271.4046 21.44 148.71 Fri Mar 25 11:02:04 2005
    200 30792.02 82 153.9601 37.80 296.72 Fri Mar 25 11:02:42 2005
    300 32209.27 81 107.3642 54.21 451.34 Fri Mar 25 11:03:37 2005
    400 34962.83 78 87.4071 66.59 588.97 Fri Mar 25 11:04:44 2005
    500 31676.92 75 63.3538 91.87 742.71 Fri Mar 25 11:06:16 2005
    600 36032.69 73 60.0545 96.91 885.44 Fri Mar 25 11:07:54 2005
    700 35540.43 77 50.7720 114.63 1024.28 Fri Mar 25 11:09:49 2005
    800 33906.70 74 42.3834 137.32 1181.65 Fri Mar 25 11:12:06 2005
    900 34120.67 73 37.9119 153.51 1325.26 Fri Mar 25 11:14:41 2005
    1000 34802.37 74 34.8024 167.23 1465.26 Fri Mar 25 11:17:28 2005

    with slab API changes and pageset patch:

    Tasks jobs/min jti jobs/min/task real cpu
    1 485.00 100 485.0000 12.00 1.96 Fri Mar 25 11:46:18 2005
    100 28000.96 89 280.0096 20.79 150.45 Fri Mar 25 11:46:39 2005
    200 32285.80 79 161.4290 36.05 293.37 Fri Mar 25 11:47:16 2005
    300 40424.15 84 134.7472 43.19 438.42 Fri Mar 25 11:47:59 2005
    400 39155.01 79 97.8875 59.46 590.05 Fri Mar 25 11:48:59 2005
    500 37881.25 82 75.7625 76.82 730.19 Fri Mar 25 11:50:16 2005
    600 39083.14 78 65.1386 89.35 872.79 Fri Mar 25 11:51:46 2005
    700 38627.83 77 55.1826 105.47 1022.46 Fri Mar 25 11:53:32 2005
    800 39631.94 78 49.5399 117.48 1169.94 Fri Mar 25 11:55:30 2005
    900 36903.70 79 41.0041 141.94 1310.78 Fri Mar 25 11:57:53 2005
    1000 36201.23 77 36.2012 160.77 1458.31 Fri Mar 25 12:00:34 2005

    Signed-off-by: Christoph Lameter
    Signed-off-by: Shobhit Dayal
    Signed-off-by: Shai Fultheim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • A lot of the code in arch/*/mm/hugetlbpage.c is quite similar. This patch
    attempts to consolidate a lot of the code across the arch's, putting the
    combined version in mm/hugetlb.c. There are a couple of uglyish hacks in
    order to covert all the hugepage archs, but the result is a very large
    reduction in the total amount of code. It also means things like hugepage
    lazy allocation could be implemented in one place, instead of six.

    Tested, at least a little, on ppc64, i386 and x86_64.

    Notes:
    - this patch changes the meaning of set_huge_pte() to be more
    analagous to set_pte()
    - does SH4 need s special huge_ptep_get_and_clear()??

    Acked-by: William Lee Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • When early zone reclaim is turned on the LRU is scanned more frequently when a
    zone is low on memory. This limits when the zone reclaim can be called by
    skipping the scan if another thread (either via kswapd or sync reclaim) is
    already reclaiming from the zone.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • When using the early zone reclaim, it was noticed that allocating new pages
    that should be spread across the whole system caused eviction of local pages.

    This adds a new GFP flag to prevent early reclaim from happening during
    certain allocation attempts. The example that is implemented here is for page
    cache pages. We want page cache pages to be spread across the whole system,
    and we don't want page cache pages to evict other pages to get local memory.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • This is the core of the (much simplified) early reclaim. The goal of this
    patch is to reclaim some easily-freed pages from a zone before falling back
    onto another zone.

    One of the major uses of this is NUMA machines. With the default allocator
    behavior the allocator would look for memory in another zone, which might be
    off-node, before trying to reclaim from the current zone.

    This adds a zone tuneable to enable early zone reclaim. It is selected on a
    per-zone basis and is turned on/off via syscall.

    Adding some extra throttling on the reclaim was also required (patch
    4/4). Without the machine would grind to a crawl when doing a "make -j"
    kernel build. Even with this patch the System Time is higher on
    average, but it seems tolerable. Here are some numbers for kernbench
    runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

    wall user sys %cpu ctx sw. sleeps
    ---- ---- --- ---- ------ ------
    No patch 1009 1384 847 258 298170 504402
    w/patch, no reclaim 880 1376 667 288 254064 396745
    w/patch & reclaim 1079 1385 926 252 291625 548873

    These numbers are the average of 2 runs of 3 "make -j" runs done right
    after system boot. Run-to-run variability for "make -j" is huge, so
    these numbers aren't terribly useful except to seee that with reclaim
    the benchmark still finishes in a reasonable amount of time.

    I also looked at the NUMA hit/miss stats for the "make -j" runs and the
    reclaim doesn't make any difference when the machine is thrashing away.

    Doing a "make -j8" on a single node that is filled with page cache pages
    takes 700 seconds with reclaim turned on and 735 seconds without reclaim
    (due to remote memory accesses).

    The simple zone_reclaim syscall program is at
    http://www.bork.org/~mort/sgi/zone_reclaim.c

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Here's the next round of these patches. These are totally different in
    an attempt to meet the "simpler" request after the last patches. For
    reference the earlier threads are:

    http://marc.theaimsgroup.com/?l=linux-kernel&m=110839604924587&w=2
    http://marc.theaimsgroup.com/?l=linux-mm&m=111461480721249&w=2

    This set of patches replaces my other vm- patches that are currently in
    -mm. So they're against 2.6.12-rc5-mm1 about half way through the -mm
    patchset.

    As I said already this patch is a lot simpler. The reclaim is turned on
    or off on a per-zone basis using a syscall. I haven't tested the x86
    syscall, so it might be wrong. It uses the existing reclaim/pageout
    code with the small addition of a may_swap flag to scan_control
    (patch 1/4).

    I also added __GFP_NORECLAIM (patch 3/4) so that certain allocation
    types can be flagged to never cause reclaim. This was a deficiency
    that was in all of my earlier patch sets. Previously, doing a big
    buffered read would fill one zone with page cache and then start to
    reclaim from that same zone, leaving the other zones untouched.

    Adding some extra throttling on the reclaim was also required (patch
    4/4). Without the machine would grind to a crawl when doing a "make -j"
    kernel build. Even with this patch the System Time is higher on
    average, but it seems tolerable. Here are some numbers for kernbench
    runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

    wall user sys %cpu ctx sw. sleeps
    ---- ---- --- ---- ------ ------
    No patch 1009 1384 847 258 298170 504402
    w/patch, no reclaim 880 1376 667 288 254064 396745
    w/patch & reclaim 1079 1385 926 252 291625 548873

    These numbers are the average of 2 runs of 3 "make -j" runs done right
    after system boot. Run-to-run variability for "make -j" is huge, so
    these numbers aren't terribly useful except to seee that with reclaim
    the benchmark still finishes in a reasonable amount of time.

    I also looked at the NUMA hit/miss stats for the "make -j" runs and the
    reclaim doesn't make any difference when the machine is thrashing away.

    Doing a "make -j8" on a single node that is filled with page cache pages
    takes 700 seconds with reclaim turned on and 735 seconds without reclaim
    (due to remote memory accesses).

    The simple zone_reclaim syscall program is at
    http://www.bork.org/~mort/sgi/zone_reclaim.c

    This patch:

    This adds an extra switch to the scan_control struct. It simply lets the
    reclaim code know if its allowed to swap pages out.

    This was required for a simple per-zone reclaimer. Without this addition
    pages would be swapped out as soon as a zone ran out of memory and the early
    reclaim kicked in.

    Signed-off-by: Martin Hicks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Hicks
     
  • Add /proc/zoneinfo file to display information about memory zones. Useful
    to analyze VM behaviour.

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     
  • This attempts to merge back the split maps. This code is mostly copied
    from Chrisw's mlock merging from post 2.6.11 trees. The only difference is
    in munmapped_error handling. Also passed prev to willneed/dontneed,
    eventhogh they do not handle it now, since I felt it will be cleaner,
    instead of handling prev in madvise_vma in some cases and in subfunction in
    some cases.

    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prasanna Meda