17 Oct, 2007

6 commits

  • This patch contains the following cleanups:
    - every file should include the headers containing the prototypes for
    its global functions
    - make the follosing needlessly global functions static:
    - migrate_to_node()
    - do_mbind()
    - sp_alloc()
    - mpol_rebind_policy()

    [akpm@linux-foundation.org: fix uninitialised var warning]
    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Here's a cut at fixing up uses of the online node map in generic code.

    mm/shmem.c:shmem_parse_mpol()

    Ensure nodelist is subset of nodes with memory.
    Use node_states[N_HIGH_MEMORY] as default for missing
    nodelist for interleave policy.

    mm/shmem.c:shmem_fill_super()

    initialize policy_nodes to node_states[N_HIGH_MEMORY]

    mm/page-writeback.c:highmem_dirtyable_memory()

    sum over nodes with memory

    mm/page_alloc.c:zlc_setup()

    allowednodes - use nodes with memory.

    mm/page_alloc.c:default_zonelist_order()

    average over nodes with memory.

    mm/page_alloc.c:find_next_best_node()

    skip nodes w/o memory.
    N_HIGH_MEMORY state mask may not be initialized at this time,
    unless we want to depend on early_calculate_totalpages() [see
    below]. Will ZONE_MOVABLE ever be configurable?

    mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

    spread kernelcore over nodes with memory.

    This required calling early_calculate_totalpages()
    unconditionally, and populating N_HIGH_MEMORY node
    state therein from nodes in the early_node_map[].
    If we can depend on this, we can eliminate the
    population of N_HIGH_MEMORY mask from __build_all_zonelists()
    and use the N_HIGH_MEMORY mask in find_next_best_node().

    mm/mempolicy.c:mpol_check_policy()

    Ensure nodes specified for policy are subset of
    nodes with memory.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Online nodes now may have no memory. The checks and initialization must
    therefore be changed to no longer use the online functions.

    This will correctly initialize the interleave on bootup to only target nodes
    with memory and will make sys_move_pages return an error when a page is to be
    moved to a memoryless node. Similarly we will get an error if MPOL_BIND and
    MPOL_INTERLEAVE is used on a memoryless node.

    These are somewhat new semantics. So far one could specify memoryless nodes
    and we would maybe do the right thing and just ignore the node (or we'd do
    something strange like with MPOL_INTERLEAVE). If we want to allow the
    specification of memoryless nodes via memory policies then we need to keep
    checking for online nodes.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on
    memoryless nodes will be redirected to nodes with memory. This results in an
    imbalance because the neighboring nodes to memoryless nodes will get
    significantly more interleave hits that the rest of the nodes on the system.

    We can avoid this imbalance by clearing the nodes in the interleave node set
    that have no memory. If we use the node map of the memory nodes instead of
    the online nodes then we have only the nodes we want.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Allow an application to query the memories allowed by its context.

    Updated numa_memory_policy.txt to mention that applications can use this to
    obtain allowed memories for constructing valid policies.

    TODO: update out-of-tree libnuma wrapper[s], or maybe add a new
    wrapper--e.g., numa_get_mems_allowed() ?

    Also, update numa syscall man pages.

    Tested with memtoy V>=0.13.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch cleans up duplicate includes in
    mm/

    Signed-off-by: Jesper Juhl
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     

20 Sep, 2007

1 commit

  • This patch proposes fixes to the reference counting of memory policy in the
    page allocation paths and in show_numa_map(). Extracted from my "Memory
    Policy Cleanups and Enhancements" series as stand-alone.

    Shared policy lookup [shmem] has always added a reference to the policy,
    but this was never unrefed after page allocation or after formatting the
    numa map data.

    Default system policy should not require additional ref counting, nor
    should the current task's task policy. However, show_numa_map() calls
    get_vma_policy() to examine what may be [likely is] another task's policy.
    The latter case needs protection against freeing of the policy.

    This patch adds a reference count to a mempolicy returned by
    get_vma_policy() when the policy is a vma policy or another task's
    mempolicy. Again, shared policy is already reference counted on lookup. A
    matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
    shared and vma policies, and in show_numa_map() for shared and another
    task's mempolicy. We can call __mpol_free() directly, saving an admittedly
    inexpensive inline NULL test, because we know we have a non-NULL policy.

    Handling policy ref counts for hugepages is a bit trickier.
    huge_zonelist() returns a zone list that might come from a shared or vma
    'BIND policy. In this case, we should hold the reference until after the
    huge page allocation in dequeue_hugepage(). The patch modifies
    huge_zonelist() to return a pointer to the mempolicy if it needs to be
    unref'd after allocation.

    Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

    w/o patch w/ refcount patch
    Avg Std Devn Avg Std Devn
    Real: 100.59 0.38 100.63 0.43
    User: 1209.60 0.37 1209.91 0.31
    System: 81.52 0.42 81.64 0.34

    Signed-off-by: Lee Schermerhorn
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

31 Aug, 2007

1 commit

  • Page migration currently does not check if the target of the move contains
    nodes that that are invalid (if root attempts to migrate pages)
    and may try to allocate from invalid nodes if these are specified
    leading to oopses.

    Return -EINVAL if an offline node is specified.

    Signed-off-by: Christoph Lameter
    Cc: Shaohua Li
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Aug, 2007

1 commit

  • The NUMA layer only supports NUMA policies for the highest zone. When
    ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
    ZONE_MOVABLE. The result is that policies are only applied to allocations
    like anonymous pages and page cache allocated from ZONE_MOVABLE when the
    zone is used.

    This patch applies policies to the two highest zones when the highest zone
    is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
    zone, it's always functionally equivalent.

    The patch has been tested on a variety of machines both NUMA and non-NUMA
    covering x86, x86_64 and ppc64. No abnormal results were seen in
    kernbench, tbench, dbench or hackbench. It passes regression tests from
    the numactl package with and without kernelcore= once numactl tests are
    patched to wait for vmstat counters to update.

    akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
    ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
    the mobility or get the other solution that Mel is working on. That solution
    would only use a single zonelist per node and filter on the fly. That may
    help performance and also help to make memory policies work better."

    Signed-off-by: Mel Gorman
    Acked-by: Lee Schermerhorn
    Tested-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

18 Jul, 2007

2 commits

  • Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
    as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
    can be used to satisfy hugepage allocations even when the system has been
    running a long time. This allows an administrator to resize the hugepage pool
    at runtime depending on the size of ZONE_MOVABLE.

    This patch adds a new sysctl called hugepages_treat_as_movable. When a
    non-zero value is written to it, future allocations for the huge page pool
    will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
    introduce additional external fragmentation of note as huge pages are always
    the largest contiguous block we care about.

    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is often known at allocation time whether a page may be migrated or not.
    This patch adds a flag called __GFP_MOVABLE and a new mask called
    GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
    using the page migration mechanism or reclaimed by syncing with backing
    storage and discarding.

    An API function very similar to alloc_zeroed_user_highpage() is added for
    __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
    flags used by alloc_zeroed_user_highpage() are not changed because it would
    change the semantics of an existing API. After this patch is applied there
    are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
    be marked deprecated if this patch is merged.

    Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
    shmem.c to keep all flag modifications to inode->mapping in the
    shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
    Hugh Dickens.

    Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
    concept. Credit to Hugh Dickens for catching issues with shmem swap vector
    and ramfs allocations.

    [akpm@linux-foundation.org: build fix]
    [hugh@veritas.com: __GFP_ZERO cleanup]
    Signed-off-by: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

2 commits

  • Enabling debugging fails to build due to the nodemask variable in
    do_mbind() having changed names, and then oopses on boot due to the
    assumption that the nodemask can be dereferenced -- which doesn't work out
    so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by
    numa_default_policy().

    This fixes it up, and switches from PDprintk() to pr_debug() while
    we're at it.

    Signed-off-by: Paul Mundt
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     
  • This converts the default system init memory policy to use a dynamically
    created node map instead of defaulting to all online nodes. Nodes of a
    certain size (>= 16MB) are judged to be suitable for interleave, and are added
    to the map. If all nodes are smaller in size, the largest one is
    automatically selected.

    Without this, tiny nodes find themselves out of memory before we even make it
    to userspace. Systems with large nodes will notice no change.

    Only the system init policy is effected by this change, the regular
    MPOL_DEFAULT policy is still switched to later on in the boot process as
    normal.

    Signed-off-by: Paul Mundt
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mundt
     

05 Mar, 2007

1 commit

  • Currently we do not check for vma flags if sys_move_pages is called to move
    individual pages. If sys_migrate_pages is called to move pages then we
    check for vm_flags that indicate a non migratable vma but that still
    includes VM_LOCKED and we can migrate mlocked pages.

    Extract the vma_migratable check from mm/mempolicy.c, fix it and put it
    into migrate.h so that is can be used from both locations.

    Problem was spotted by Lee Schermerhorn

    Signed-off-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

21 Feb, 2007

1 commit


12 Feb, 2007

1 commit

  • This patchset follows up on the earlier work in Andrew's tree to reduce the
    number of zones. The patches allow to go to a minimum of 2 zones. This one
    allows also to make ZONE_DMA optional and therefore the number of zones can be
    reduced to one.

    ZONE_DMA is usually used for ISA DMA devices. There are a number of reasons
    why we would not want to have ZONE_DMA

    1. Some arches do not need ZONE_DMA at all.

    2. With the advent of IOMMUs DMA zones are no longer needed.
    The necessity of DMA zones may drastically be reduced
    in the future. This patchset allows a compilation of
    a kernel without that overhead.

    3. Devices that require ISA DMA get rare these days. All
    my systems do not have any need for ISA DMA.

    4. The presence of an additional zone unecessarily complicates
    VM operations because it must be scanned and balancing
    logic must operate on its.

    5. With only ZONE_NORMAL one can reach the situation where
    we have only one zone. This will allow the unrolling of many
    loops in the VM and allows the optimization of varous
    code paths in the VM.

    6. Having only a single zone in a NUMA system results in a
    1-1 correspondence between nodes and zones. Various additional
    optimizations to critical VM paths become possible.

    Many systems today can operate just fine with a single zone. If you look at
    what is in ZONE_DMA then one usually sees that nothing uses it. The DMA slabs
    are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
    will be empty instead).

    On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty. Why
    constantly look at an empty zone in /proc/zoneinfo and empty slab in
    /proc/slabinfo? Non i386 also frequently have no need for ZONE_DMA and zones
    stay empty.

    The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).

    The RFC posted earlier (see
    http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
    of #ifdefs in them. An effort has been made to minize the number of #ifdefs
    and make this as compact as possible. The job was made much easier by the
    ongoing efforts of others to extract common arch specific functionality.

    I have been running this for awhile now on my desktop and finally Linux is
    using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:

    christoph@pentium940:~$ cat /proc/zoneinfo
    Node 0, zone Normal
    pages free 4435
    min 1448
    low 1810
    high 2172
    active 241786
    inactive 210170
    scanned 0 (a: 0 i: 0)
    spanned 524224
    present 524224
    nr_anon_pages 61680
    nr_mapped 14271
    nr_file_pages 390264
    nr_slab_reclaimable 27564
    nr_slab_unreclaimable 1793
    nr_page_table_pages 449
    nr_dirty 39
    nr_writeback 0
    nr_unstable 0
    nr_bounce 0
    cpu: 0 pcp: 0
    count: 156
    high: 186
    batch: 31
    cpu: 0 pcp: 1
    count: 9
    high: 62
    batch: 15
    vm stats threshold: 20
    cpu: 1 pcp: 0
    count: 177
    high: 186
    batch: 31
    cpu: 1 pcp: 1
    count: 12
    high: 62
    batch: 15
    vm stats threshold: 20
    all_unreclaimable: 0
    prev_priority: 12
    temp_priority: 12
    start_pfn: 0

    This patch:

    In two places in the VM we use ZONE_DMA to refer to the first zone. If
    ZONE_DMA is optional then other zones may be first. So simply replace
    ZONE_DMA with zone 0.

    This also fixes ZONETABLE_PGSHIFT. If we have only a single zone then
    ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
    number related to a pgdat. However, we still need a zonetable to index all
    the zones for each node if this is a NUMA system. Therefore define
    ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.

    [apw@shadowen.org: fix mismerge]
    Acked-by: Christoph Hellwig
    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Cc: Paul Mundt
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jan, 2007

1 commit

  • Currently one can specify an arbitrary node mask to mbind that includes
    nodes not allowed. If that is done with an interleave policy then we will
    go around all the nodes. Those outside of the currently allowed cpuset
    will be redirected to the border nodes. Interleave will then create
    imbalances at the borders of the cpuset.

    This patch restricts the nodes to the currently allowed cpuset.

    The RFC for this patch was discussed at
    http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2

    Signed-off-by: Christoph Lameter
    Cc: Paul Jackson
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Dec, 2006

1 commit


08 Dec, 2006

4 commits

  • - move some file_operations structs into the .rodata section

    - move static strings from policy_types[] array into the .rodata section

    - fix generic seq_operations usages, so that those structs may be defined
    as "const" as well

    [akpm@osdl.org: couple of fixes]
    Signed-off-by: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • NUMA node ids are passed as either int or unsigned int almost exclusivly
    page_to_nid and zone_to_nid both return unsigned long. This is a throw
    back to when page_to_nid was a #define and was thus exposing the real type
    of the page flags field.

    In addition to fixing up the definitions of page_to_nid and zone_to_nid I
    audited the users of these functions identifying the following incorrect
    uses:

    1) mm/page_alloc.c show_node() -- printk dumping the node id,
    2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
    against numa_node_id() which returns an int from cpu_to_node(), and
    3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
    uses bit_set which in generic code takes an int.

    Signed-off-by: Andy Whitcroft
    Cc: Christoph Lameter
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Optimize the critical zonelist scanning for free pages in the kernel memory
    allocator by caching the zones that were found to be full recently, and
    skipping them.

    Remembers the zones in a zonelist that were short of free memory in the
    last second. And it stashes a zone-to-node table in the zonelist struct,
    to optimize that conversion (minimize its cache footprint.)

    Recent changes:

    This differs in a significant way from a similar patch that I
    posted a week ago. Now, instead of having a nodemask_t of
    recently full nodes, I have a bitmask of recently full zones.
    This solves a problem that last weeks patch had, which on
    systems with multiple zones per node (such as DMA zone) would
    take seeing any of these zones full as meaning that all zones
    on that node were full.

    Also I changed names - from "zonelist faster" to "zonelist cache",
    as that seemed to better convey what we're doing here - caching
    some of the key zonelist state (for faster access.)

    See below for some performance benchmark results. After all that
    discussion with David on why I didn't need them, I went and got
    some ;). I wanted to verify that I had not hurt the normal case
    of memory allocation noticeably. At least for my one little
    microbenchmark, I found (1) the normal case wasn't affected, and
    (2) workloads that forced scanning across multiple nodes for
    memory improved up to 10% fewer System CPU cycles and lower
    elapsed clock time ('sys' and 'real'). Good. See details, below.

    I didn't have the logic in get_page_from_freelist() for various
    full nodes and zone reclaim failures correct. That should be
    fixed up now - notice the new goto labels zonelist_scan,
    this_zone_full, and try_next_zone, in get_page_from_freelist().

    There are two reasons I persued this alternative, over some earlier
    proposals that would have focused on optimizing the fake numa
    emulation case by caching the last useful zone:

    1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
    have seen real customer loads where the cost to scan the zonelist
    was a problem, due to many nodes being full of memory before
    we got to a node we could use. Or at least, I think we have.
    This was related to me by another engineer, based on experiences
    from some time past. So this is not guaranteed. Most likely, though.

    The following approach should help such real numa systems just as
    much as it helps fake numa systems, or any combination thereof.

    2) The effort to distinguish fake from real numa, using node_distance,
    so that we could cache a fake numa node and optimize choosing
    it over equivalent distance fake nodes, while continuing to
    properly scan all real nodes in distance order, was going to
    require a nasty blob of zonelist and node distance munging.

    The following approach has no new dependency on node distances or
    zone sorting.

    See comment in the patch below for a description of what it actually does.

    Technical details of note (or controversy):

    - See the use of "zlc_active" and "did_zlc_setup" below, to delay
    adding any work for this new mechanism until we've looked at the
    first zone in zonelist. I figured the odds of the first zone
    having the memory we needed were high enough that we should just
    look there, first, then get fancy only if we need to keep looking.

    - Some odd hackery was needed to add items to struct zonelist, while
    not tripping up the custom zonelists built by the mm/mempolicy.c
    code for MPOL_BIND. My usual wordy comments below explain this.
    Search for "MPOL_BIND".

    - Some per-node data in the struct zonelist is now modified frequently,
    with no locking. Multiple CPU cores on a node could hit and mangle
    this data. The theory is that this is just performance hint data,
    and the memory allocator will work just fine despite any such mangling.
    The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
    (a bitmask) and 'last_full_zap' (unsigned long jiffies). It should
    all be self correcting after at most a one second delay.

    - This still does a linear scan of the same lengths as before. All
    I've optimized is making the scan faster, not algorithmically
    shorter. It is now able to scan a compact array of 'unsigned
    short' in the case of many full nodes, so one cache line should
    cover quite a few nodes, rather than each node hitting another
    one or two new and distinct cache lines.

    - If both Andi and Nick don't find this too complicated, I will be
    (pleasantly) flabbergasted.

    - I removed the comment claiming we only use one cachline's worth of
    zonelist. We seem, at least in the fake numa case, to have put the
    lie to that claim.

    - I pay no attention to the various watermarks and such in this performance
    hint. A node could be marked full for one watermark, and then skipped
    over when searching for a page using a different watermark. I think
    that's actually quite ok, as it will tend to slightly increase the
    spreading of memory over other nodes, away from a memory stressed node.

    ===============

    Performance - some benchmark results and analysis:

    This benchmark runs a memory hog program that uses multiple
    threads to touch alot of memory as quickly as it can.

    Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
    the total 96 GBytes on the system, and using 1, 19, 37, or 55
    threads (on a 56 CPU system.) System, user and real (elapsed)
    timings were recorded for each run, shown in units of seconds,
    in the table below.

    Two kernels were tested - 2.6.18-mm3 and the same kernel with
    this zonelist caching patch added. The table also shows the
    percentage improvement the zonelist caching sys time is over
    (lower than) the stock *-mm kernel.

    number 2.6.18-mm3 zonelist-cache delta (< 0 good) percent
    GBs N ------------ -------------- ---------------- systime
    mem threads sys user real sys user real sys user real better
    12 1 153 24 177 151 24 176 -2 0 -1 1%
    12 19 99 22 8 99 22 8 0 0 0 0%
    12 37 111 25 6 112 25 6 1 0 0 -0%
    12 55 115 25 5 110 23 5 -5 -2 0 4%
    38 1 502 74 576 497 73 570 -5 -1 -6 0%
    38 19 426 78 48 373 76 39 -53 -2 -9 12%
    38 37 544 83 36 547 82 36 3 -1 0 -0%
    38 55 501 77 23 511 80 24 10 3 1 -1%
    64 1 917 125 1042 890 124 1014 -27 -1 -28 2%
    64 19 1118 138 119 965 141 103 -153 3 -16 13%
    64 37 1202 151 94 1136 150 81 -66 -1 -13 5%
    64 55 1118 141 61 1072 140 58 -46 -1 -3 4%
    90 1 1342 177 1519 1275 174 1450 -67 -3 -69 4%
    90 19 2392 199 192 2116 189 176 -276 -10 -16 11%
    90 37 3313 238 175 2972 225 145 -341 -13 -30 10%
    90 55 1948 210 104 1843 213 100 -105 3 -4 5%

    Notes:
    1) This test ran a memory hog program that started a specified number N of
    threads, and had each thread allocate and touch 1/N'th of
    the total memory to be used in the test run in a single loop,
    writing a constant word to memory, one store every 4096 bytes.
    Watching this test during some earlier trial runs, I would see
    each of these threads sit down on one CPU and stay there, for
    the remainder of the pass, a different CPU for each thread.

    2) The 'real' column is not comparable to the 'sys' or 'user' columns.
    The 'real' column is seconds wall clock time elapsed, from beginning
    to end of that test pass. The 'sys' and 'user' columns are total
    CPU seconds spent on that test pass. For a 19 thread test run,
    for example, the sum of 'sys' and 'user' could be up to 19 times the
    number of 'real' elapsed wall clock seconds.

    3) Tests were run on a fresh, single-user boot, to minimize the amount
    of memory already in use at the start of the test, and to minimize
    the amount of background activity that might interfere.

    4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.

    5) Notice that the 'real' time gets large for the single thread runs, even
    though the measured 'sys' and 'user' times are modest. I'm not sure what
    that means - probably something to do with it being slow for one thread to
    be accessing memory along ways away. Perhaps the fake numa system, running
    ostensibly the same workload, would not show this substantial degradation
    of 'real' time for one thread on many nodes -- lets hope not.

    6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
    ran quite efficiently, as one might expect. Each pair of threads needed
    to allocate and touch the memory on the node the two threads shared, a
    pleasantly parallizable workload.

    7) The intermediate thread count passes, when asking for alot of memory forcing
    them to go to a few neighboring nodes, improved the most with this zonelist
    caching patch.

    Conclusions:
    * This zonelist cache patch probably makes little difference one way or the
    other for most workloads on real numa hardware, if those workloads avoid
    heavy off node allocations.
    * For memory intensive workloads requiring substantial off-node allocations
    on real numa hardware, this patch improves both kernel and elapsed timings
    up to ten per-cent.
    * For fake numa systems, I'm optimistic, but will have to leave that up to
    Rohit Seth to actually test (once I get him a 2.6.18 backport.)

    Signed-off-by: Paul Jackson
    Cc: Rohit Seth
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

12 Oct, 2006

1 commit


01 Oct, 2006

1 commit


27 Sep, 2006

1 commit

  • This patch insures that the slab node lists in the NUMA case only contain
    slabs that belong to that specific node. All slab allocations use
    GFP_THISNODE when calling into the page allocator. If an allocation fails
    then we fall back in the slab allocator according to the zonelists appropriate
    for a certain context.

    This allows a replication of the behavior of alloc_pages and alloc_pages node
    in the slab layer.

    Currently allocations requested from the page allocator may be redirected via
    cpusets to other nodes. This results in remote pages on nodelists and that in
    turn results in interrupt latency issues during cache draining. Plus the slab
    is handing out memory as local when it is really remote.

    Fallback for slab memory allocations will occur within the slab allocator and
    not in the page allocator. This is necessary in order to be able to use the
    existing pools of objects on the nodes that we fall back to before adding more
    pages to a slab.

    The fallback function insures that the nodes we fall back to obey cpuset
    restrictions of the current context. We do not allocate objects from outside
    of the current cpuset context like before.

    Note that the implementation of locality constraints within the slab allocator
    requires importing logic from the page allocator. This is a mischmash that is
    not that great. Other allocators (uncached allocator, vmalloc, huge pages)
    face similar problems and have similar minimal reimplementations of the basic
    fallback logic of the page allocator. There is another way of implementing a
    slab by avoiding per node lists (see modular slab) but this wont work within
    the existing slab.

    V1->V2:
    - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
    - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
    #ifdef

    [akpm@osdl.org: build fix]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

26 Sep, 2006

5 commits

  • There are many places where we need to determine the node of a zone.
    Currently we use a difficult to read sequence of pointer dereferencing.
    Put that into an inline function and use throughout VM. Maybe we can find
    a way to optimize the lookup in the future.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • …mory policy restrictions

    Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes. This
    flag is essential if a kernel component requires memory to be located on a
    certain node. It will be needed for alloc_pages_node() to force allocation
    on the indicated node and for alloc_pages() to force allocation on the
    current node.

    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

    Christoph Lameter
     
  • I wonder why we need this bitmask indexing into zone->node_zonelists[]?

    We always start with the highest zone and then include all lower zones
    if we build zonelists.

    Are there really cases where we need allocation from ZONE_DMA or
    ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
    of highest_zone() makes that already impossible.

    If we go linear on the index then gfp_zone() == highest_zone() and a lot
    of definitions fall by the wayside.

    We can now revert back to the use of gfp_zone() in mempolicy.c ;-)

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After we have done this we can now do some typing cleanup.

    The memory policy layer keeps a policy_zone that specifies
    the zone that gets memory policies applied. This variable
    can now be of type enum zone_type.

    The check_highest_zone function and the build_zonelists funnctionm must
    then also take a enum zone_type parameter.

    Plus there are a number of loops over zones that also should use
    zone_type.

    We run into some troubles at some points with functions that need a
    zone_type variable to become -1. Fix that up.

    [pj@sgi.com: fix set_mempolicy() crash]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • There is a check in zonelist_policy that compares pieces of the bitmap
    obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
    zonelist_policy().

    The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
    The policy_zone is a zone number with the possible values of ZONE_DMA,
    ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
    of values.

    For some reason seemed to work before the zone reduction patchset (It
    definitely works on SGI boxes since we just have one zone and the check
    cannot fail).

    With the zone reduction patchset this check definitely fails on systems
    with two zones if the system actually has memory in both zones.

    This is because ZONE_NORMAL is selected using no __GFP flag at
    all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
    is set. __GFP_DMA is 0x01. So gfp_zone(gfpmask) == 1.

    policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
    populated.

    For ZONE_NORMAL gfp_zone() yields 0 which is <
    policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
    allocations!

    Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
    to DMA allocations!

    What we realy want in that place is to establish the highest allowable
    zone for a given gfp_mask. If the highest zone is higher or equal to the
    policy_zone then memory policies need to be applied. We have such
    a highest_zone() function in page_alloc.c.

    So move the highest_zone() function from mm/page_alloc.c into
    include/linux/gfp.h. On the way we simplify the function and use the new
    zone_type that was also introduced with the zone reduction patchset plus we
    also specify the right type for the gfp flags parameter.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

02 Sep, 2006

1 commit

  • Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the
    lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd
    offsets to the interleave functions. Take this difference from small pages
    into account when calculating the offset. This does add a 0-bit shift into
    the small-page path (via alloc_page_vma()), but I think that is negligible.
    Also add a BUG_ON to prevent the offset from growing due to a negative
    right-shift, which probably shouldn't be allowed anyways.

    Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
    expected.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Adam Litke
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

01 Jul, 2006

1 commit

  • The numa statistics are really event counters. But they are per node and
    so we have had special treatment for these counters through additional
    fields on the pcp structure. We can now use the per zone nature of the
    zoned VM counters to realize these.

    This will shrink the size of the pcp structure on NUMA systems. We will
    have some room to add additional per zone counters that will all still fit
    in the same cacheline.

    Bits Prior pcp size Size after patch We can add
    ------------------------------------------------------------------
    64 128 bytes (16 words) 80 bytes (10 words) 48
    32 76 bytes (19 words) 56 bytes (14 words) 8 (64 byte cacheline)
    72 (128 byte)

    Remove the special statistics for numa and replace them with zoned vm
    counters. This has the side effect that global sums of these events now
    show up in /proc/vmstat.

    Also take the opportunity to move the zone_statistics() function from
    page_alloc.c into vmstat.c.

    Discussions:
    V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2

    Signed-off-by: Christoph Lameter
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

27 Jun, 2006

1 commit

  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

26 Jun, 2006

1 commit

  • Hooks for calling vma specific migration functions

    With this patch a vma may define a vma->vm_ops->migrate function. That
    function may perform page migration on its own (some vmas may not contain page
    structs and therefore cannot be handled by regular page migration. Pages in a
    vma may require special preparatory treatment before migration is possible
    etc) . Only mmap_sem is held when the migration function is called. The
    migrate() function gets passed two sets of nodemasks describing the source and
    the target of the migration. The flags parameter either contains

    MPOL_MF_MOVE which means that only pages used exclusively by
    the specified mm should be moved

    or

    MPOL_MF_MOVE_ALL which means that pages shared with other processes
    should also be moved.

    The migration function returns 0 on success or an error condition. An error
    condition will prevent regular page migration from occurring.

    On its own this patch cannot be included since there are no users for this
    functionality. But it seems that the uncached allocator will need this
    functionality at some point.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Jun, 2006

4 commits

  • This patch inserts security_task_movememory hook calls into memory management
    code to enable security modules to mediate this operation between tasks.

    Since the last posting, the hook has been renamed following feedback from
    Christoph Lameter.

    Signed-off-by: David Quigley
    Acked-by: Stephen Smalley
    Signed-off-by: James Morris
    Cc: Andi Kleen
    Acked-by: Christoph Lameter
    Acked-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Quigley
     
  • move_pages() is used to move individual pages of a process. The function can
    be used to determine the location of pages and to move them onto the desired
    node. move_pages() returns status information for each page.

    long move_pages(pid, number_of_pages_to_move,
    addresses_of_pages[],
    nodes[] or NULL,
    status[],
    flags);

    The addresses of pages is an array of void * pointing to the
    pages to be moved.

    The nodes array contains the node numbers that the pages should be moved
    to. If a NULL is passed instead of an array then no pages are moved but
    the status array is updated. The status request may be used to determine
    the page state before issuing another move_pages() to move pages.

    The status array will contain the state of all individual page migration
    attempts when the function terminates. The status array is only valid if
    move_pages() completed successfullly.

    Possible page states in status[]:

    0..MAX_NUMNODES The page is now on the indicated node.

    -ENOENT Page is not present

    -EACCES Page is mapped by multiple processes and can only
    be moved if MPOL_MF_MOVE_ALL is specified.

    -EPERM The page has been mlocked by a process/driver and
    cannot be moved.

    -EBUSY Page is busy and cannot be moved. Try again later.

    -EFAULT Invalid address (no VMA or zero page).

    -ENOMEM Unable to allocate memory on target node.

    -EIO Unable to write back page. The page must be written
    back in order to move it since the page is dirty and the
    filesystem does not provide a migration function that
    would allow the moving of dirty pages.

    -EINVAL A dirty page cannot be moved. The filesystem does not provide
    a migration function and has no ability to write back pages.

    The flags parameter indicates what types of pages to move:

    MPOL_MF_MOVE Move pages that are only mapped by the process.

    MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
    Requires sufficient capabilities.

    Possible return codes from move_pages()

    -ENOENT No pages found that would require moving. All pages
    are either already on the target node, not present, had an
    invalid address or could not be moved because they were
    mapped by multiple processes.

    -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
    to migrate pages in a kernel thread.

    -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
    or an attempt to move a process belonging to another user.

    -EACCES One of the target nodes is not allowed by the current cpuset.

    -ENODEV One of the target nodes is not online.

    -ESRCH Process does not exist.

    -E2BIG Too many pages to move.

    -ENOMEM Not enough memory to allocate control array.

    -EFAULT Parameters could not be accessed.

    A test program for move_pages() may be found with the patches
    on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3

    From: Christoph Lameter

    Detailed results for sys_move_pages()

    Pass a pointer to an integer to get_new_page() that may be used to
    indicate where the completion status of a migration operation should be
    placed. This allows sys_move_pags() to report back exactly what happened to
    each page.

    Wish there would be a better way to do this. Looks a bit hacky.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Instead of passing a list of new pages, pass a function to allocate a new
    page. This allows the correct placement of MPOL_INTERLEAVE pages during page
    migration. It also further simplifies the callers of migrate pages.
    migrate_pages() becomes similar to migrate_pages_to() so drop
    migrate_pages_to(). The batching of new page allocations becomes unnecessary.

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    Signed-off-by: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Jes Sorensen
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Apr, 2006

1 commit