17 Oct, 2007

17 commits

  • Introduces new zone flag interface for testing and setting flags:

    int zone_test_and_set_flag(struct zone *zone, zone_flags_t flag)

    Instead of setting and clearing ZONE_RECLAIM_LOCKED each time shrink_zone() is
    called, this flag is test and set before starting zone reclaim. Zone reclaim
    starts in __alloc_pages() when a zone's watermark fails and the system is in
    zone_reclaim_mode. If it's already in reclaim, there's no need to start again
    so it is simply considered full for that allocation attempt.

    There is a change of behavior with regard to concurrent zone shrinking. It is
    now possible for try_to_free_pages() or kswapd to already be shrinking a
    particular zone when __alloc_pages() starts zone reclaim. In this case, it is
    possible for two concurrent threads to invoke shrink_zone() for a single zone.

    This change forbids a zone to be in zone reclaim twice, which was always the
    behavior, but allows for concurrent try_to_free_pages() or kswapd shrinking
    when starting zone reclaim.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • OOM killer synchronization should be done with zone granularity so that memory
    policy and cpuset allocations may have their corresponding zones locked and
    allow parallel kills for other OOM conditions that may exist elsewhere in the
    system. DMA allocations can be targeted at the zone level, which would not be
    possible if locking was done in nodes or globally.

    Synchronization shall be done with a variation of "trylocks." The goal is to
    put the current task to sleep and restart the failed allocation attempt later
    if the trylock fails. Otherwise, the OOM killer is invoked.

    Each zone in the zonelist that __alloc_pages() was called with is checked for
    the newly-introduced ZONE_OOM_LOCKED flag. If any zone has this flag present,
    the "trylock" to serialize the OOM killer fails and returns zero. Otherwise,
    all the zones have ZONE_OOM_LOCKED set and the try_set_zone_oom() function
    returns non-zero.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Convert the int all_unreclaimable member of struct zone to unsigned long
    flags. This can now be used to specify several different zone flags such as
    all_unreclaimable and reclaim_in_progress, which can now be removed and
    converted to a per-zone flag.

    Flags are set and cleared as follows:

    zone_set_flag(struct zone *zone, zone_flags_t flag)
    zone_clear_flag(struct zone *zone, zone_flags_t flag)

    Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
    which have the same semantics as the old zone->all_unreclaimable and
    zone->reclaim_in_progress, respectively. Also converts all current users that
    set or clear either flag to use the new interface.

    Helper functions are defined to test the flags:

    int zone_is_all_unreclaimable(const struct zone *zone)
    int zone_is_reclaim_locked(const struct zone *zone)

    All flag operators are of the atomic variety because there are currently
    readers that are implemented that do not take zone->lock.

    [akpm@linux-foundation.org: add needed include]
    Cc: Andrea Arcangeli
    Acked-by: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Implement generic chunk-of-pages isolation method by using page grouping ops.

    This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
    - MIGRATE_TYPES increases.
    - bitmap for migratetype is enlarged.

    pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
    By this, you can isolated *freed* pages from users. How-to-free pages is not
    a purpose of this patch. You may use reclaim and migrate codes to free pages.

    If start_isolate_page_range(start,end) is called,
    - migratetype of the range turns to be MIGRATE_ISOLATE if
    its type is MIGRATE_MOVABLE. (*) this check can be updated if other
    memory reclaiming works make progress.
    - MIGRATE_ISOLATE is not on migratetype fallback list.
    - All free pages and will-be-freed pages are isolated.
    To check all pages in the range are isolated or not, use test_pages_isolated(),
    To cancel isolation, use undo_isolate_page_range().

    Changes V6 -> V7
    - removed unnecessary #ifdef

    There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..

    Signed-off-by: Yasunori Goto
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
    The information is collected only on request so there is no runtime overhead.
    The statistics are in three parts:

    The first part prints information on the size of blocks that pages are
    being grouped on and looks like

    Page block order: 10
    Pages per block: 1024

    The second part is a more detailed version of /proc/buddyinfo and looks like

    Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10
    Node 0, zone DMA, type Unmovable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reclaimable 1 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 0 0
    Node 0, zone DMA, type Reserve 0 4 4 0 0 0 0 1 0 1 0
    Node 0, zone Normal, type Unmovable 111 8 4 4 2 3 1 0 0 0 0
    Node 0, zone Normal, type Reclaimable 293 89 8 0 0 0 0 0 0 0 0
    Node 0, zone Normal, type Movable 1 6 13 9 7 6 3 0 0 0 0
    Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 4

    The third part looks like

    Number of blocks type Unmovable Reclaimable Movable Reserve
    Node 0, zone DMA 0 1 2 1
    Node 0, zone Normal 3 17 94 4

    To walk the zones within a node with interrupts disabled, walk_zones_in_node()
    is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
    /proc/pagetypeinfo to reduce code duplication. It seems specific to what
    vmstat.c requires but could be broken out as a general utility function in
    mmzone.c if there were other other potential users.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently mobility grouping works at the MAX_ORDER_NR_PAGES level. This makes
    sense for the majority of users where this is also the huge page size.
    However, on platforms like ia64 where the huge page size is runtime
    configurable it is desirable to group at a lower order. On x86_64 and
    occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.

    This patch groups pages together based on the value of HUGETLB_PAGE_ORDER. It
    uses a compile-time constant if possible and a variable where the huge page
    size is runtime configurable.

    It is assumed that grouping should be done at the lowest sensible order and
    that the user would not want to override this. If this is not true,
    page_block order could be forced to a variable initialised via a boot-time
    kernel parameter.

    One potential issue with this patch is that IA64 now parses hugepagesz with
    early_param() instead of __setup(). __setup() is called after the memory
    allocator has been initialised and the pageblock bitmaps already setup. In
    tests on one IA64 there did not seem to be any problem with using
    early_param() and in fact may be more correct as it guarantees the parameter
    is handled before the parsing of hugepages=.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Grouping high-order atomic allocations together was intended to allow
    bursty users of atomic allocations to work such as e1000 in situations
    where their preallocated buffers were depleted. This did not work in at
    least one case with a wireless network adapter needing order-1 allocations
    frequently. To resolve that, the free pages used for min_free_kbytes were
    moved to separate contiguous blocks with the patch
    bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.

    It is felt that keeping the free pages in the same contiguous blocks should
    be sufficient for bursty short-lived high-order atomic allocations to
    succeed, maybe even with the e1000. Even if there is a failure, increasing
    the value of min_free_kbytes will free pages as contiguous bloks in
    contrast to the standard buddy allocator which makes no attempt to keep the
    minimum number of free pages contiguous.

    This patch backs out grouping high order atomic allocations together to
    determine if it is really needed or not. If a new report comes in about
    high-order atomic allocations failing, the feature can be reintroduced to
    determine if it fixes the problem or not. As a side-effect, this patch
    reduces by 1 the number of bits required to track the mobility type of
    pages within a MAX_ORDER_NR_PAGES block.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Grouping pages by mobility can be disabled at compile-time. This was
    considered undesirable by a number of people. However, in the current stack of
    patches, it is not a simple case of just dropping the configurable patch as it
    would cause merge conflicts. This patch backs out the configuration option.

    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The standard buddy allocator always favours the smallest block of pages.
    The effect of this is that the pages free to satisfy min_free_kbytes tends
    to be preserved since boot time at the same location of memory ffor a very
    long time and as a contiguous block. When an administrator sets the
    reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
    that remain free. This allows the occasional high atomic allocation to
    succeed up until the point the blocks are split. In practice, it is
    difficult to split these blocks but when they do split, the benefit of
    having min_free_kbytes for contiguous blocks disappears. Additionally,
    increasing min_free_kbytes once the system has been running for some time
    has no guarantee of creating contiguous blocks.

    On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
    blocks when there are no free pages of the appropriate type available. A
    side-effect of this is that all blocks in memory tends to be used up and
    the contiguous free blocks from boot time are not preserved like in the
    vanilla allocator. This can cause a problem if a new caller is unwilling
    to reclaim or does not reclaim for long enough.

    A failure scenario was found for a wireless network device allocating
    order-1 atomic allocations but the allocations were not intense or frequent
    enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
    This was reproduced on a desktop by booting with mem=256mb, forcing the
    driver to allocate at order-1, running a bittorrent client (downloading a
    debian ISO) and building a kernel with -j2.

    This patch addresses the problem on the desktop machine booted with
    mem=256mb. It works by setting aside a reserve of MAX_ORDER_NR_PAGES
    blocks, the number of which depends on the value of min_free_kbytes. These
    blocks are only fallen back to when there is no other free pages. Then the
    smallest possible page is used just like the normal buddy allocator instead
    of the largest possible page to preserve contiguous pages The pages in free
    lists in the reserve blocks are never taken for another migrate type. The
    results is that even if min_free_kbytes is set to a low value, contiguous
    blocks will be preserved in the MIGRATE_RESERVE blocks.

    This works better than the vanilla allocator because if min_free_kbytes is
    increased, a new reserve block will be chosen based on the location of
    reclaimable pages and the block will free up as contiguous pages. In the
    vanilla allocator, no effort is made to target a block of pages to free as
    contiguous pages and min_free_kbytes pages are scattered randomly.

    This effect has been observed on the test machine. min_free_kbytes was set
    initially low but it was kept as a contiguous free block within
    MIGRATE_RESERVE. min_free_kbytes was then set to a higher value and over a
    period of time, the free blocks were within the reserve and coalescing.
    How long it takes to free up depends on how quickly LRU is rotating.
    Amusingly, this means that more activity will free the blocks faster.

    This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
    effective than grouping contiguous free pages together. It all depends on
    whether the number of active atomic high allocations exceeds
    min_free_kbytes or not. If the number of active allocations exceeds
    min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
    should be set higher. Once there are no more reports of allocation
    failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
    see if the reports stay missing.

    Credit to Mariusz Kozlowski for discovering the problem, describing the
    failure scenario and testing patches and scenarios.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are problems in the use of SPARSEMEM and pageblock flags that causes
    problems on ia64.

    The first part of the problem is that units are incorrect in
    SECTION_BLOCKFLAGS_BITS computation. This results in a map_section's
    section_mem_map being treated as part of a bitmap which isn't good. This
    was evident with an invalid virtual address when mem_init attempted to free
    bootmem pages while relinquishing control from the bootmem allocator.

    The second part of the problem occurs because the pageblock flags bitmap is
    be located with the mem_section. The SECTIONS_PER_ROOT computation using
    sizeof (mem_section) may not be a power of 2 depending on the size of the
    bitmap. This renders masks and other such things not power of 2 base.
    This issue was seen with SPARSEMEM_EXTREME on ia64. This patch moves the
    bitmap outside of mem_section and uses a pointer instead in the
    mem_section. The bitmaps are allocated when the section is being
    initialised.

    Note that sparse_early_usemap_alloc() does not use alloc_remap() like
    sparse_early_mem_map_alloc(). The allocation required for the bitmap on
    x86, the only architecture that uses alloc_remap is typically smaller than
    a cache line. alloc_remap() pads out allocations to the cache size which
    would be a needless waste.

    Credit to Bob Picco for identifying the original problem and effecting a
    fix for the SECTION_BLOCKFLAGS_BITS calculation. Credit to Andy Whitcroft
    for devising the best way of allocating the bitmaps only when required for
    the section.

    [wli@holomorphy.com: warning fix]
    Signed-off-by: Bob Picco
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Mel Gorman
    Cc: "Luck, Tony"
    Signed-off-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In rare cases, the kernel needs to allocate a high-order block of pages
    without sleeping. For example, this is the case with e1000 cards configured
    to use jumbo frames. Migrating or reclaiming pages in this situation is not
    an option.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_HIGHATOMIC type are exactly what they sound
    like. Care is taken that pages of other migrate types do not use the same
    blocks as high-order atomic allocations.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch marks a number of allocations that are either short-lived such as
    network buffers or are reclaimable such as inode allocations. When something
    like updatedb is called, long-lived and unmovable kernel allocations tend to
    be spread throughout the address space which increases fragmentation.

    This patch groups these allocations together as much as possible by adding a
    new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be
    reclaimed on demand, but not moved. i.e. they can be migrated by deleting
    them and re-reading the information from elsewhere.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The grouping mechanism has some memory overhead and a more complex allocation
    path. This patch allows the strategy to be disabled for small memory systems
    or if it is known the workload is suffering because of the strategy. It also
    acts to show where the page groupings strategy interacts with the standard
    buddy allocator.

    Signed-off-by: Mel Gorman
    Signed-off-by: Joel Schopp
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds the core of the fragmentation reduction strategy. It works by
    grouping pages together based on their ability to migrate or be reclaimed.
    Basically, it works by breaking the list in zone->free_area list into
    MIGRATE_TYPES number of lists.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Here is the latest revision of the anti-fragmentation patches. Of particular
    note in this version is special treatment of high-order atomic allocations.
    Care is taken to group them together and avoid grouping pages of other types
    near them. Artifical tests imply that it works. I'm trying to get the
    hardware together that would allow setting up of a "real" test. If anyone
    already has a setup and test that can trigger the atomic-allocation problem,
    I'd appreciate a test of these patches and a report. The second major change
    is that these patches will apply cleanly with patches that implement
    anti-fragmentation through zones.

    kernbench shows effectively no performance difference varying between -0.2%
    and +2% on a variety of test machines. Success rates for huge page allocation
    are dramatically increased. For example, on a ppc64 machine, the vanilla
    kernel was only able to allocate 1% of memory as a hugepage and this was due
    to a single hugepage reserved as min_free_kbytes. With these patches applied,
    17% was allocatable as superpages. With reclaim-related fixes from Andy
    Whitcroft, it was 40% and further reclaim-related improvements should increase
    this further.

    Changelog Since V28
    o Group high-order atomic allocations together
    o It is no longer required to set min_free_kbytes to 10% of memory. A value
    of 16384 in most cases will be sufficient
    o Now applied with zone-based anti-fragmentation
    o Fix incorrect VM_BUG_ON within buffered_rmqueue()
    o Reorder the stack so later patches do not back out work from earlier patches
    o Fix bug were journal pages were being treated as movable
    o Bias placement of non-movable pages to lower PFNs
    o More agressive clustering of reclaimable pages in reactions to workloads
    like updatedb that flood the size of inode caches

    Changelog Since V27

    o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
    the mistaken impression that it was the 100% solution for high order
    allocations. Instead, it greatly increases the chances high-order
    allocations will succeed and lays the foundation for defragmentation and
    memory hot-remove to work properly
    o Redefine page groupings based on ability to migrate or reclaim instead of
    basing on reclaimability alone
    o Get rid of spurious inits
    o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
    searched for a page of the appropriate type
    o Added more explanation commentary
    o Fix up bug in pageblock code where bitmap was used before being initalised

    Changelog Since V26
    o Fix double init of lists in setup_pageset

    Changelog Since V25
    o Fix loop order of for_each_rclmtype_order so that order of loop matches args
    o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
    o Rename get_pageblock_type() to get_page_rclmtype()
    o Fix alignment problem in move_freepages()
    o Add mechanism for assigning flags to blocks of pages instead of page->flags
    o On fallback, do not examine the preferred list of free pages a second time

    The purpose of these patches is to reduce external fragmentation by grouping
    pages of related types together. When pages are migrated (or reclaimed under
    memory pressure), large contiguous pages will be freed.

    This patch works by categorising allocations by their ability to migrate;

    Movable - The pages may be moved with the page migration mechanism. These are
    generally userspace pages.

    Reclaimable - These are allocations for some kernel caches that are
    reclaimable or allocations that are known to be very short-lived.

    Unmovable - These are pages that are allocated by the kernel that
    are not trivially reclaimed. For example, the memory allocated for a
    loaded module would be in this category. By default, allocations are
    considered to be of this type

    HighAtomic - These are high-order allocations belonging to callers that
    cannot sleep or perform any IO. In practice, this is restricted to
    jumbo frame allocation for network receive. It is assumed that the
    allocations are short-lived

    Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
    there is one for each type of reclaimability. Once a 2^MAX_ORDER block of
    pages is split for a type of allocation, it is added to the free-lists for
    that type, in effect reserving it. Hence, over time, pages of the different
    types can be clustered together.

    When the preferred freelists are expired, the largest possible block is taken
    from an alternative list. Buddies that are split from that large block are
    placed on the preferred allocation-type freelists to mitigate fragmentation.

    This implementation gives best-effort for low fragmentation in all zones.
    Ideally, min_free_kbytes needs to be set to a value equal to 4 * (1 <<
    (MAX_ORDER-1)) pages in most cases. This would be 16384 on x86 and x86_64 for
    example.

    Our tests show that about 60-70% of physical memory can be allocated on a
    desktop after a few days uptime. In benchmarks and stress tests, we are
    finding that 80% of memory is available as contiguous blocks at the end of the
    test. To compare, a standard kernel was getting < 1% of memory as large pages
    on a desktop and about 8-12% of memory as large pages at the end of stress
    tests.

    Following this email are 12 patches that implement thie page grouping feature.
    The first patch introduces a mechanism for storing flags related to a whole
    block of pages. Then allocations are split between movable and all other
    allocations. Following that are patches to deal with per-cpu pages and make
    the mechanism configurable. The next patch moves free pages between lists
    when partially allocated blocks are used for pages of another migrate type.
    The second last patch groups reclaimable kernel allocations such as inode
    caches together. The final patch related to groupings keeps high-order atomic
    allocations.

    The last two patches are more concerned with control of fragmentation. The
    second last patch biases placement of non-movable allocations towards the
    start of memory. This is with a view of supporting memory hot-remove of DIMMs
    with higher PFNs in the future. The biasing could be enforced a lot heavier
    but it would cost. The last patch agressively clusters reclaimable pages like
    inode caches together.

    The fragmentation reduction strategy needs to track if pages within a block
    can be moved or reclaimed so that pages are freed to the appropriate list.
    This patch adds a bitmap for flags affecting a whole a MAX_ORDER block of
    pages.

    In non-SPARSEMEM configurations, the bitmap is stored in the struct zone and
    allocated during initialisation. SPARSEMEM statically allocates the bitmap in
    a struct mem_section so that bitmaps do not have to be resized during memory
    hotadd. This wastes a small amount of memory per unused section (usually
    sizeof(unsigned long)) but the complexity of dynamically allocating the memory
    is quite high.

    Additional credit to Andy Whitcroft who reviewed up an earlier implementation
    of the mechanism an suggested how to make it a *lot* cleaner.

    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
    first zone of a nodelist. That only works if the node has memory. A
    memoryless node will have its first node on another pgdat (node).

    GFP_THISNODE currently will return simply memory on the first pgdat. Thus it
    is returning memory on other nodes. GFP_THISNODE should fail if there is no
    local memory on a node.

    Add a new set of zonelists for each node that only contain the nodes that
    belong to the zones itself so that no fallback is possible.

    Then modify gfp_type to pickup the right zone based on the presence of
    __GFP_THISNODE.

    Drop the existing GFP_THISNODE checks from the page_allocators hot path.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We have flags to indicate whether a section actually has a valid mem_map
    associated with it. This is never set and we rely solely on the present bit
    to indicate a section is valid. By definition a section is not valid if it
    has no mem_map and there is a window during init where the present bit is set
    but there is no mem_map, during which pfn_valid() will return true
    incorrectly.

    Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid
    mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new
    present_section{,_nr} and pfn_present() interfaces for those users who care to
    know that a section is going to be valid.

    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: "Luck, Tony"
    Cc: Andi Kleen
    Cc: "David S. Miller"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

23 Aug, 2007

1 commit

  • The NUMA layer only supports NUMA policies for the highest zone. When
    ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
    ZONE_MOVABLE. The result is that policies are only applied to allocations
    like anonymous pages and page cache allocated from ZONE_MOVABLE when the
    zone is used.

    This patch applies policies to the two highest zones when the highest zone
    is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
    zone, it's always functionally equivalent.

    The patch has been tested on a variety of machines both NUMA and non-NUMA
    covering x86, x86_64 and ppc64. No abnormal results were seen in
    kernbench, tbench, dbench or hackbench. It passes regression tests from
    the numactl package with and without kernelcore= once numactl tests are
    patched to wait for vmstat counters to update.

    akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
    ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
    the mobility or get the other solution that Mel is working on. That solution
    would only use a single zonelist per node and filter on the fly. That may
    help performance and also help to make memory policies work better."

    Signed-off-by: Mel Gorman
    Acked-by: Lee Schermerhorn
    Tested-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Aug, 2007

1 commit

  • The arm26 port has been in a state where it was far from even compiling
    for quite some time.

    Ian Molton agreed with the removal.

    Signed-off-by: Adrian Bunk
    Cc: Ian Molton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

18 Jul, 2007

2 commits

  • When we are out of memory of a suitable size we enter reclaim. The current
    reclaim algorithm targets pages in LRU order, which is great for fairness at
    order-0 but highly unsuitable if you desire pages at higher orders. To get
    pages of higher order we must shoot down a very high proportion of memory;
    >95% in a lot of cases.

    This patch set adds a lumpy reclaim algorithm to the allocator. It targets
    groups of pages at the specified order anchored at the end of the active and
    inactive lists. This encourages groups of pages at the requested orders to
    move from active to inactive, and active to free lists. This behaviour is
    only triggered out of direct reclaim when higher order pages have been
    requested.

    This patch set is particularly effective when utilised with an
    anti-fragmentation scheme which groups pages of similar reclaimability
    together.

    This patch set is based on Peter Zijlstra's lumpy reclaim V2 patch which forms
    the foundation. Credit to Mel Gorman for sanitity checking.

    Mel said:

    The patches have an application with hugepage pool resizing.

    When lumpy-reclaim is used used with ZONE_MOVABLE, the hugepages pool can
    be resized with greater reliability. Testing on a desktop machine with 2GB
    of RAM showed that growing the hugepage pool with ZONE_MOVABLE on it's own
    was very slow as the success rate was quite low. Without lumpy-reclaim,
    each attempt to grow the pool by 100 pages would yield 1 or 2 hugepages.
    With lumpy-reclaim, getting 40 to 70 hugepages on each attempt was typical.

    [akpm@osdl.org: ia64 pfn_to_nid fixes and loop cleanup]
    [bunk@stusta.de: static declarations for internal functions]
    [a.p.zijlstra@chello.nl: initial lumpy V2 implementation]
    Signed-off-by: Andy Whitcroft
    Acked-by: Peter Zijlstra
    Acked-by: Mel Gorman
    Acked-by: Mel Gorman
    Cc: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
    that is only usable by allocations that specify both __GFP_HIGHMEM and
    __GFP_MOVABLE. This has the effect of keeping all non-movable pages within a
    single memory partition while allowing movable allocations to be satisfied
    from either partition. The patches may be applied with the list-based
    anti-fragmentation patches that groups pages together based on mobility.

    The size of the zone is determined by a kernelcore= parameter specified at
    boot-time. This specifies how much memory is usable by non-movable
    allocations and the remainder is used for ZONE_MOVABLE. Any range of pages
    within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

    When selecting a zone to take pages from for ZONE_MOVABLE, there are two
    things to consider. First, only memory from the highest populated zone is
    used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
    but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
    the amount of memory usable by the kernel will be spread evenly throughout
    NUMA nodes where possible. If the nodes are not of equal size, the amount of
    memory usable by the kernel on some nodes may be greater than others.

    By default, the zone is not as useful for hugetlb allocations because they are
    pinned and non-migratable (currently at least). A sysctl is provided that
    allows huge pages to be allocated from that zone. This means that the huge
    page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
    the system assuming that pages are not mlocked. Despite huge pages being
    non-movable, we do not introduce additional external fragmentation of note as
    huge pages are always the largest contiguous block we care about.

    Credit goes to Andy Whitcroft for catching a large variety of problems during
    review of the patches.

    This patch creates an additional zone, ZONE_MOVABLE. This zone is only usable
    by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE. Hot-added
    memory continues to be placed in their existing destination as there is no
    mechanism to redirect them to a specific zone.

    [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Signed-off-by: Yasunori Goto
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

1 commit

  • Make zonelist creation policy selectable from sysctl/boot option v6.

    This patch makes NUMA's zonelist (of pgdat) order selectable.
    Available order are Default(automatic)/ Node-based / Zone-based.

    [Default Order]
    The kernel selects Node-based or Zone-based order automatically.

    [Node-based Order]
    This policy treats the locality of memory as the most important parameter.
    Zonelist order is created by each zone's locality. This means lower zones
    (ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
    IOW. ZONE_DMA will be in the middle of zonelist.
    current 2.6.21 kernel uses this.

    Pros.
    * A user can expect local memory as much as possible.
    Cons.
    * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

    Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
    because of ZONE_DMA exhaution and you need the best locality.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    [Zone-based order]
    This policy treats the zone type as the most important parameter.
    Zonelist order is created by zone-type order. This means lower zone
    never be used bofere higher zone exhaustion.
    IOW. ZONE_DMA will be always at the tail of zonelist.

    Pros.
    * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
    Cons.
    * memory locality may not be best.

    (example)
    assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

    *node(0)'s memory allocation order:

    node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

    *node(1)'s memory allocation order:

    node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

    bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

    command:
    %echo N > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Node-based order.

    command:
    %echo Z > /proc/sys/vm/numa_zonelist_order

    Will rebuild zonelist in Zone-based order.

    Thanks to Lee Schermerhorn, he gives me much help and codes.

    [Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Cc: "jesse.barnes@intel.com"
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 May, 2007

1 commit

  • Currently the slab allocators contain callbacks into the page allocator to
    perform the draining of pagesets on remote nodes. This requires SLUB to have
    a whole subsystem in order to be compatible with SLAB. Moving node draining
    out of the slab allocators avoids a section of code in SLUB.

    Move the node draining so that is is done when the vm statistics are updated.
    At that point we are already touching all the cachelines with the pagesets of
    a processor.

    Add a expire counter there. If we have to update per zone or global vm
    statistics then assume that the pageset will require subsequent draining.

    The expire counter will be decremented on each vm stats update pass until it
    reaches zero. Then we will drain one batch from the pageset. The draining
    will cause vm counter updates which will then cause another expiration until
    the pcp is empty. So we will drain a batch every 3 seconds.

    Note that remote node draining is a somewhat esoteric feature that is required
    on large NUMA systems because otherwise significant portions of system memory
    can become trapped in pcp queues. The number of pcp is determined by the
    number of processors and nodes in a system. A system with 4 processors and 2
    nodes has 8 pcps which is okay. But a system with 1024 processors and 512
    nodes has 512k pcps with a high potential for large amount of memory being
    caught in them.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 May, 2007

1 commit

  • Generally we work under the assumption that memory the mem_map array is
    contigious and valid out to MAX_ORDER_NR_PAGES block of pages, ie. that if we
    have validated any page within this MAX_ORDER_NR_PAGES block we need not check
    any other. This is not true when CONFIG_HOLES_IN_ZONE is set and we must
    check each and every reference we make from a pfn.

    Add a pfn_valid_within() helper which should be used when scanning pages
    within a MAX_ORDER_NR_PAGES block when we have already checked the validility
    of the block normally with pfn_valid(). This can then be optimised away when
    we do not have holes within a MAX_ORDER_NR_PAGES block of pages.

    Signed-off-by: Andy Whitcroft
    Acked-by: Mel Gorman
    Acked-by: Bob Picco
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

12 Feb, 2007

5 commits

  • Make ZONE_DMA optional in core code.

    - ifdef all code for ZONE_DMA and related definitions following the example
    for ZONE_DMA32 and ZONE_HIGHMEM.

    - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
    0.

    - Modify the VM statistics to work correctly without a DMA zone.

    - Modify slab to not create DMA slabs if there is no ZONE_DMA.

    [akpm@osdl.org: cleanup]
    [jdike@addtoit.com: build fix]
    [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Values are readily available via ZVC per node and global sums.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The global and per zone counter sums are in arrays of longs. Reorder the ZVCs
    so that the most frequently used ZVCs are put into the same cacheline. That
    way calculations of the global, node and per zone vm state touches only a
    single cacheline. This is mostly important for 64 bit systems were one 128
    byte cacheline takes only 8 longs.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This is again simplifies some of the VM counter calculations through the use
    of the ZVC consolidated counters.

    [michal.k.k.piotrowski@gmail.com: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Michal Piotrowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The determination of the dirty ratio to determine writeback behavior is
    currently based on the number of total pages on the system.

    However, not all pages in the system may be dirtied. Thus the ratio is always
    too low and can never reach 100%. The ratio may be particularly skewed if
    large hugepage allocations, slab allocations or device driver buffers make
    large sections of memory not available anymore. In that case we may get into
    a situation in which f.e. the background writeback ratio of 40% cannot be
    reached anymore which leads to undesired writeback behavior.

    This patchset fixes that issue by determining the ratio based on the actual
    pages that may potentially be dirty. These are the pages on the active and
    the inactive list plus free pages.

    The problem with those counts has so far been that it is expensive to
    calculate these because counts from multiple nodes and multiple zones will
    have to be summed up. This patchset makes these counters ZVC counters. This
    means that a current sum per zone, per node and for the whole system is always
    available via global variables and not expensive anymore to calculate.

    The patchset results in some other good side effects:

    - Removal of the various functions that sum up free, active and inactive
    page counts

    - Cleanup of the functions that display information via the proc filesystem.

    This patch:

    The use of a ZVC for nr_inactive and nr_active allows a simplification of some
    counter operations. More ZVC functionality is used for sums etc in the
    following patches.

    [akpm@osdl.org: UP build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Jan, 2007

1 commit

  • Fix an oops experienced on the Cell architecture when init-time functions,
    early_*(), are called at runtime. It alters the call paths to make sure
    that the callers explicitly say whether the call is being made on behalf of
    a hotplug even, or happening at boot-time.

    It has been compile tested on ppc64, ia64, s390, i386 and x86_64.

    Acked-by: Arnd Bergmann
    Signed-off-by: Dave Hansen
    Cc: Yasunori Goto
    Acked-by: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: Martin Schwidefsky
    Acked-by: Heiko Carstens
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

08 Dec, 2006

3 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Rearrange the struct members in the 'struct zonelist_cache' structure, so
    as to put the readonly (once initialized) z_to_n[] array first, where it
    will come right after the zones[] array in struct zonelist.

    This pretty much eliminates the chance that the two frequently written
    elements of 'struct zonelist_cache', the fullzones bitmap and last_full_zap
    times, will end up on the same cache line as the performance sensitive,
    frequently read, never (after init) written zones[] array.

    Keeping frequently written data off frequently read cache lines is good for
    performance.

    Thanks to Rohit Seth for the suggestion.

    Signed-off-by: Paul Jackson
    Cc: Rohit Seth
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Optimize the critical zonelist scanning for free pages in the kernel memory
    allocator by caching the zones that were found to be full recently, and
    skipping them.

    Remembers the zones in a zonelist that were short of free memory in the
    last second. And it stashes a zone-to-node table in the zonelist struct,
    to optimize that conversion (minimize its cache footprint.)

    Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago. Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results. After all that
    discussion with David on why I didn't need them, I went and got
    some ;). I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably. At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real'). Good. See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct. That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

    There are two reasons I persued this alternative, over some earlier
    proposals that would have focused on optimizing the fake numa
    emulation case by caching the last useful zone:

    1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use. Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past. So this is not guaranteed. Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

    2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

    See comment in the patch below for a description of what it actually does.

    Technical details of note (or controversy):

    - See the use of "zlc_active" and "did_zlc_setup" below, to delay
    adding any work for this new mechanism until we've looked at the
    first zone in zonelist. I figured the odds of the first zone
    having the memory we needed were high enough that we should just
    look there, first, then get fancy only if we need to keep looking.

    - Some odd hackery was needed to add items to struct zonelist, while
    not tripping up the custom zonelists built by the mm/mempolicy.c
    code for MPOL_BIND. My usual wordy comments below explain this.
    Search for "MPOL_BIND".

    - Some per-node data in the struct zonelist is now modified frequently,
    with no locking. Multiple CPU cores on a node could hit and mangle
    this data. The theory is that this is just performance hint data,
    and the memory allocator will work just fine despite any such mangling.
    The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
    (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
    all be self correcting after at most a one second delay.

    - This still does a linear scan of the same lengths as before. All
    I've optimized is making the scan faster, not algorithmically
    shorter. It is now able to scan a compact array of 'unsigned
    short' in the case of many full nodes, so one cache line should
    cover quite a few nodes, rather than each node hitting another
    one or two new and distinct cache lines.

    - If both Andi and Nick don't find this too complicated, I will be
    (pleasantly) flabbergasted.

    - I removed the comment claiming we only use one cachline's worth of
    zonelist. We seem, at least in the fake numa case, to have put the
    lie to that claim.

    - I pay no attention to the various watermarks and such in this performance
    hint. A node could be marked full for one watermark, and then skipped
    over when searching for a page using a different watermark. I think
    that's actually quite ok, as it will tend to slightly increase the
    spreading of memory over other nodes, away from a memory stressed node.

    ===============

    Performance - some benchmark results and analysis:

    This benchmark runs a memory hog program that uses multiple
    threads to touch alot of memory as quickly as it can.

    Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
    the total 96 GBytes on the system, and using 1, 19, 37, or 55
    threads (on a 56 CPU system.) System, user and real (elapsed)
    timings were recorded for each run, shown in units of seconds,
    in the table below.

    Two kernels were tested - 2.6.18-mm3 and the same kernel with
    this zonelist caching patch added. The table also shows the
    percentage improvement the zonelist caching sys time is over
    (lower than) the stock *-mm kernel.

    number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
    GBs N ------------ -------------- ---------------- systime
    mem threads sys user real sys user real sys user real better
    12 1 153 24 177 151 24 176 -2 0 -1 1%
    12 19 99 22 8 99 22 8 0 0 0 0%
    12 37 111 25 6 112 25 6 1 0 0 -0%
    12 55 115 25 5 110 23 5 -5 -2 0 4%
    38 1 502 74 576 497 73 570 -5 -1 -6 0%
    38 19 426 78 48 373 76 39 -53 -2 -9 12%
    38 37 544 83 36 547 82 36 3 -1 0 -0%
    38 55 501 77 23 511 80 24 10 3 1 -1%
    64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
    64 19 1118 138 119 965 141 103 -153 3 -16 13%
    64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
    64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
    90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
    90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
    90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
    90 55 1948 210 104 1843 213 100 -105 3 -4 5%

    Notes:
    1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

    2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass. The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass. For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

    3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

    4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

    5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest. I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away. Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

    6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect. Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

    7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

    Conclusions:
    * This zonelist cache patch probably makes little difference one way or the
    other for most workloads on real numa hardware, if those workloads avoid
    heavy off node allocations.
    * For memory intensive workloads requiring substantial off-node allocations
    on real numa hardware, this patch improves both kernel and elapsed timings
    up to ten per-cent.
    * For fake numa systems, I'm optimistic, but will have to leave that up to
    Rohit Seth to actually test (once I get him a 2.6.18 backport.)

    Signed-off-by: Paul Jackson
    Cc: Rohit Seth
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

29 Oct, 2006

1 commit

  • The temp_priority field in zone is racy, as we can walk through a reclaim
    path, and just before we copy it into prev_priority, it can be overwritten
    (say with DEF_PRIORITY) by another reclaimer.

    The same bug is contained in both try_to_free_pages and balance_pgdat, but
    it is fixed slightly differently. In balance_pgdat, we keep a separate
    priority record per zone in a local array. In try_to_free_pages there is
    no need to do this, as the priority level is the same for all zones that we
    reclaim from.

    Impact of this bug is that temp_priority is copied into prev_priority, and
    setting this artificially high causes reclaimers to set distress
    artificially low. They then fail to reclaim mapped pages, when they are,
    in fact, under severe memory pressure (their priority may be as low as 0).
    This causes the OOM killer to fire incorrectly.

    From: Andrew Morton

    __zone_reclaim() isn't modifying zone->prev_priority. But zone->prev_priority
    is used in the decision whether or not to bring mapped pages onto the inactive
    list. Hence there's a risk here that __zone_reclaim() will fail because
    zone->prev_priority ir large (ie: low urgency) and lots of mapped pages end up
    stuck on the active list.

    Fix that up by decreasing (ie making more urgent) zone->prev_priority as
    __zone_reclaim() scans the zone's pages.

    This bug perhaps explains why ZONE_RECLAIM_PRIORITY was created. It should be
    possible to remove that now, and to just start out at DEF_PRIORITY?

    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Bligh
     

22 Oct, 2006

1 commit

  • Reintroduce NODES_SPAN_OTHER_NODES for powerpc

    Revert "[PATCH] Remove SPAN_OTHER_NODES config definition"
    This reverts commit f62859bb6871c5e4a8e591c60befc8caaf54db8c.
    Revert "[PATCH] mm: remove arch independent NODES_SPAN_OTHER_NODES"
    This reverts commit a94b3ab7eab4edcc9b2cb474b188f774c331adf7.

    Also update the comments to indicate that this is still required
    and where its used.

    Signed-off-by: Andy Whitcroft
    Cc: Paul Mackerras
    Cc: Mike Kravetz
    Cc: Benjamin Herrenschmidt
    Acked-by: Mel Gorman
    Acked-by: Will Schmidt
    Cc: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

27 Sep, 2006

4 commits

  • Add the node in order to optimize zone_to_nid.

    Signed-off-by: Christoph Lameter
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This moves the definition of struct page from mm.h to its own header file
    page-struct.h. This is a prereq to fix SetPageUptodate which is broken on
    s390:

    #define SetPageUptodate(_page)
    do {
    struct page *__page = (_page);
    if (!test_and_set_bit(PG_uptodate, &__page->flags))
    page_test_and_clear_dirty(_page);
    } while (0)

    _page gets used twice in this macro which can cause subtle bugs. Using
    __page for the page_test_and_clear_dirty call doesn't work since it causes
    yet another problem with the page_test_and_clear_dirty macro as well.

    In order to avoid all these problems caused by macros it seems to be a good
    idea to get rid of them and convert them to static inline functions.
    Because of header file include order it's necessary to have a seperate
    header file for the struct page definition.

    Cc: Martin Schwidefsky
    Signed-off-by: Heiko Carstens
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • The VM is supposed to minimise the number of pages which get written off the
    LRU (for IO scheduling efficiency, and for high reclaim-success rates). But
    we don't actually have a clear way of showing how true this is.

    So add `nr_vmscan_write' to /proc/vmstat and /proc/zoneinfo - the number of
    pages which have been written by the vm scanner in this zone and globally.

    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • At a basic level, architectures define structures to record where active
    ranges of page frames are located. Once located, the code to calculate zone
    sizes and holes in each architecture is very similar. Some of this zone and
    hole sizing code is difficult to read for no good reason. This set of patches
    eliminates the similar-looking architecture-specific code.

    The patches introduce a mechanism where architectures register where the
    active ranges of page frames are with add_active_range(). When all areas have
    been discovered, free_area_init_nodes() is called to initialise the pgdat and
    zones. The zone sizes and holes are then calculated in an architecture
    independent manner.

    Patch 1 introduces the mechanism for registering and initialising PFN ranges
    Patch 2 changes ppc to use the mechanism - 139 arch-specific LOC removed
    Patch 3 changes x86 to use the mechanism - 136 arch-specific LOC removed
    Patch 4 changes x86_64 to use the mechanism - 74 arch-specific LOC removed
    Patch 5 changes ia64 to use the mechanism - 52 arch-specific LOC removed
    Patch 6 accounts for mem_map as a memory hole as the pages are not reclaimable.
    It adjusts the watermarks slightly

    Tony Luck has successfully tested for ia64 on Itanium with tiger_defconfig,
    gensparse_defconfig and defconfig. Bob Picco has also tested and debugged on
    IA64. Jack Steiner successfully boot tested on a mammoth SGI IA64-based
    machine. These were on patches against 2.6.17-rc1 and release 3 of these
    patches but there have been no ia64-changes since release 3.

    There are differences in the zone sizes for x86_64 as the arch-specific code
    for x86_64 accounts the kernel image and the starting mem_maps as memory holes
    but the architecture-independent code accounts the memory as present.

    The big benefit of this set of patches is a sizable reduction of
    architecture-specific code, some of which is very hairy. There should be a
    greater reduction when other architectures use the same mechanisms for zone
    and hole sizing but I lack the hardware to test on.

    Additional credit;
    Dave Hansen for the initial suggestion and comments on early patches
    Andy Whitcroft for reviewing early versions and catching numerous
    errors
    Tony Luck for testing and debugging on IA64
    Bob Picco for fixing bugs related to pfn registration, reviewing a
    number of patch revisions, providing a number of suggestions
    on future direction and testing heavily
    Jack Steiner and Robin Holt for testing on IA64 and clarifying
    issues related to memory holes
    Yasunori for testing on IA64
    Andi Kleen for reviewing and feeding back about x86_64
    Christian Kujau for providing valuable information related to ACPI
    problems on x86_64 and testing potential fixes

    This patch:

    Define the structure to represent an active range of page frames within a node
    in an architecture independent manner. Architectures are expected to register
    active ranges of PFNs using add_active_range(nid, start_pfn, end_pfn) and call
    free_area_init_nodes() passing the PFNs of the end of each zone.

    Signed-off-by: Mel Gorman
    Signed-off-by: Bob Picco
    Cc: Dave Hansen
    Cc: Andy Whitcroft
    Cc: Andi Kleen
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Keith Mannthey"
    Cc: "Luck, Tony"
    Cc: KAMEZAWA Hiroyuki
    Cc: Yasunori Goto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

26 Sep, 2006

1 commit

  • Currently one can enable slab reclaim by setting an explicit option in
    /proc/sys/vm/zone_reclaim_mode. Slab reclaim is then used as a final
    option if the freeing of unmapped file backed pages is not enough to free
    enough pages to allow a local allocation.

    However, that means that the slab can grow excessively and that most memory
    of a node may be used by slabs. We have had a case where a machine with
    46GB of memory was using 40-42GB for slab. Zone reclaim was effective in
    dealing with pagecache pages. However, slab reclaim was only done during
    global reclaim (which is a bit rare on NUMA systems).

    This patch implements slab reclaim during zone reclaim. Zone reclaim
    occurs if there is a danger of an off node allocation. At that point we

    1. Shrink the per node page cache if the number of pagecache
    pages is more than min_unmapped_ratio percent of pages in a zone.

    2. Shrink the slab cache if the number of the nodes reclaimable slab pages
    (patch depends on earlier one that implements that counter)
    are more than min_slab_ratio (a new /proc/sys/vm tunable).

    The shrinking of the slab cache is a bit problematic since it is not node
    specific. So we simply calculate what point in the slab we want to reach
    (current per node slab use minus the number of pages that neeed to be
    allocated) and then repeately run the global reclaim until that is
    unsuccessful or we have reached the limit. I hope we will have zone based
    slab reclaim at some point which will make that easier.

    The default for the min_slab_ratio is 5%

    Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter