25 May, 2011

2 commits

  • When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h.
    However, in the NUMA case, the definition is always compiled.

    Since it is not strictly true that tmpfs is the only client, and since the
    symbol was always lurking around anyways, export mpol_to_str()
    unconditionally. Furthermore, this will allow us to move show_numa_map()
    out of mempolicy.c and into the procfs subsystem.

    Signed-off-by: Stephen Wilson
    Cc: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
    get_vma_policy() was marked static as all clients were local to
    mempolicy.c.

    However, the decision to generate /proc/pid/numa_maps in the numa memory
    policy code and outside the procfs subsystem introduces an artificial
    interdependency between the two systems. Exporting get_vma_policy() once
    again is the first step to clean up this interdependency.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     

10 Aug, 2010

1 commit

  • The oom killer presently kills current whenever there is no more memory
    free or reclaimable on its mempolicy's nodes. There is no guarantee that
    current is a memory-hogging task or that killing it will free any
    substantial amount of memory, however.

    In such situations, it is better to scan the tasklist for nodes that are
    allowed to allocate on current's set of nodes and kill the task with the
    highest badness() score. This ensures that the most memory-hogging task,
    or the one configured by the user with /proc/pid/oom_adj, is always
    selected in such scenarios.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 May, 2010

1 commit

  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

16 Dec, 2009

1 commit

  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 Jul, 2008

1 commit

  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

28 Apr, 2008

15 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Now that we're using "preferred local" policy for system default, we need to
    make this as fast as possible. Because of the variable size of the mempolicy
    structure [based on size of nodemasks], the preferred_node may be in a
    different cacheline from the mode. This can result in accessing an extra
    cacheline in the normal case of system default policy. Suspect this is the
    cause of an observed 2-3% slowdown in page fault testing relative to kernel
    without this patch series.

    To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
    flags member which is guaranteed [?] to be in the same cacheline as the mode
    itself.

    Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
    for both anon and shmem segments with system default and vma [preferred local]
    policy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • As part of yet another rework of mempolicy reference counting, we want to be
    able to identify shared policies efficiently, because they have an extra ref
    taken on lookup that needs to be removed when we're finished using the policy.

    Note: the extra ref is required because the policies are
    shared between tasks/processes and can be changed/freed
    by one task while another task is using them--e.g., for
    page allocation.

    Building on David Rientjes mempolicy "mode flags" enhancement, this patch
    indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
    member of the struct mempolicy added by David. MPOL_F_SHARED, and any future
    "internal mode flags" are reserved from bit zero up, as they will never be
    passed in the upper bits of the mode argument of a mempolicy API.

    I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
    rb-tree. Don't need/want to clear the flag when removing from the tree as the
    mempolicy is freed [unref'd] internally to the sp_delete() function. However,
    a task could hold another reference on this mempolicy from a prior lookup. We
    need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
    unref, eventually freeing, the mempolicy.

    A later patch in this series will introduce a function to conditionally unref
    [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the
    only reason] to unref/free a policy via the conditional free.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The terms 'policy' and 'mode' are both used in various places to describe the
    semantics of the value stored in the 'policy' member of struct mempolicy.
    Furthermore, the term 'policy' is used to refer to that member, to the entire
    struct mempolicy and to the more abstract concept of the tuple consisting of a
    "mode" and an optional node or set of nodes. Recently, we have added "mode
    flags" that are passed in the upper bits of the 'mode' [or sometimes,
    'policy'] member of the numa APIs.

    I'd like to resolve this confusion, which perhaps only exists in my mind, by
    renaming the 'policy' member to 'mode' throughout, and fixing up the
    Documentation. Man pages will be updated separately.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Removes forward definition of vm_area_struct in linux/mempolicy.h. We already
    get it from the linux/slab.h -> linux/gfp.h include.

    Removes the unused mpol_set_vma_default() macro from linux/mempolicy.h.

    Removes the extern definition of default_policy since it is only referenced,
    as it should be, in mm/mempolicy.c.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
    MPOL_INTERLEAVE, are better declared as part of an enum since they are
    sequentially numbered and cannot be combined.

    The policy member of struct mempolicy is also converted from type short to
    type unsigned short. A negative policy does not have any legitimate meaning,
    so it is possible to change its type in preparation for adding optional mode
    flags later.

    The equivalent member of struct shmem_sb_info is also changed from int to
    unsigned short.

    For compatibility, the policy formal to get_mempolicy() remains as a pointer
    to an int:

    int get_mempolicy(int *policy, unsigned long *nmask,
    unsigned long maxnode, unsigned long addr,
    unsigned long flags);

    although the only possible values is the range of type unsigned short.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Oct, 2007

1 commit

  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

2 commits

  • This patch contains the following cleanups:
    - every file should include the headers containing the prototypes for
    its global functions
    - make the follosing needlessly global functions static:
    - migrate_to_node()
    - do_mbind()
    - sp_alloc()
    - mpol_rebind_policy()

    [akpm@linux-foundation.org: fix uninitialised var warning]
    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Allow an application to query the memories allowed by its context.

    Updated numa_memory_policy.txt to mention that applications can use this to
    obtain allowed memories for constructing valid policies.

    TODO: update out-of-tree libnuma wrapper[s], or maybe add a new
    wrapper--e.g., numa_get_mems_allowed() ?

    Also, update numa syscall man pages.

    Tested with memtoy V>=0.13.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

20 Sep, 2007

1 commit

  • This patch proposes fixes to the reference counting of memory policy in the
    page allocation paths and in show_numa_map(). Extracted from my "Memory
    Policy Cleanups and Enhancements" series as stand-alone.

    Shared policy lookup [shmem] has always added a reference to the policy,
    but this was never unrefed after page allocation or after formatting the
    numa map data.

    Default system policy should not require additional ref counting, nor
    should the current task's task policy. However, show_numa_map() calls
    get_vma_policy() to examine what may be [likely is] another task's policy.
    The latter case needs protection against freeing of the policy.

    This patch adds a reference count to a mempolicy returned by
    get_vma_policy() when the policy is a vma policy or another task's
    mempolicy. Again, shared policy is already reference counted on lookup. A
    matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
    shared and vma policies, and in show_numa_map() for shared and another
    task's mempolicy. We can call __mpol_free() directly, saving an admittedly
    inexpensive inline NULL test, because we know we have a non-NULL policy.

    Handling policy ref counts for hugepages is a bit trickier.
    huge_zonelist() returns a zone list that might come from a shared or vma
    'BIND policy. In this case, we should hold the reference until after the
    huge page allocation in dequeue_hugepage(). The patch modifies
    huge_zonelist() to return a pointer to the mempolicy if it needs to be
    unref'd after allocation.

    Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

    w/o patch w/ refcount patch
    Avg Std Devn Avg Std Devn
    Real: 100.59 0.38 100.63 0.43
    User: 1209.60 0.37 1209.91 0.31
    System: 81.52 0.42 81.64 0.34

    Signed-off-by: Lee Schermerhorn
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

23 Aug, 2007

1 commit

  • The NUMA layer only supports NUMA policies for the highest zone. When
    ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
    ZONE_MOVABLE. The result is that policies are only applied to allocations
    like anonymous pages and page cache allocated from ZONE_MOVABLE when the
    zone is used.

    This patch applies policies to the two highest zones when the highest zone
    is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
    zone, it's always functionally equivalent.

    The patch has been tested on a variety of machines both NUMA and non-NUMA
    covering x86, x86_64 and ppc64. No abnormal results were seen in
    kernbench, tbench, dbench or hackbench. It passes regression tests from
    the numactl package with and without kernelcore= once numactl tests are
    patched to wait for vmstat counters to update.

    akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
    ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge
    the mobility or get the other solution that Mel is working on. That solution
    would only use a single zonelist per node and filter on the fly. That may
    help performance and also help to make memory policies work better."

    Signed-off-by: Mel Gorman
    Acked-by: Lee Schermerhorn
    Tested-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Jul, 2007

1 commit

  • Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
    as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
    can be used to satisfy hugepage allocations even when the system has been
    running a long time. This allows an administrator to resize the hugepage pool
    at runtime depending on the size of ZONE_MOVABLE.

    This patch adds a new sysctl called hugepages_treat_as_movable. When a
    non-zero value is written to it, future allocations for the huge page pool
    will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
    introduce additional external fragmentation of note as huge pages are always
    the largest contiguous block we care about.

    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Oct, 2006

1 commit

  • Mistyped an ifdef CONFIG_CPUSETS - fixed.

    I doubt that anyone ever noticed. The impact of this typo was
    that if someone:
    1) was using MPOL_BIND to force off node allocations
    2) while using cpusets to constrain memory placement
    3) when that cpuset was migrating that jobs memory
    4) while the tasks in that job were actively forking
    then there was a rare chance that future allocations using
    that MPOL_BIND policy would be node local, not off node.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

26 Sep, 2006

1 commit

  • After we have done this we can now do some typing cleanup.

    The memory policy layer keeps a policy_zone that specifies
    the zone that gets memory policies applied. This variable
    can now be of type enum zone_type.

    The check_highest_zone function and the build_zonelists funnctionm must
    then also take a enum zone_type parameter.

    Plus there are a number of loops over zones that also should use
    zone_type.

    We run into some troubles at some points with functions that need a
    zone_type variable to become -1. Fix that up.

    [pj@sgi.com: fix set_mempolicy() crash]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

21 Jun, 2006

1 commit

  • * git://git.infradead.org/hdrcleanup-2.6: (63 commits)
    [S390] __FD_foo definitions.
    Switch to __s32 types in joystick.h instead of C99 types for consistency.
    Add to headers included for userspace in
    Move inclusion of out of user scope in asm-x86_64/mtrr.h
    Remove struct fddi_statistics from user view in
    Move user-visible parts of drivers/s390/crypto/z90crypt.h to include/asm-s390
    Revert include/media changes: Mauro says those ioctls are only used in-kernel(!)
    Include and use __uXX types in
    Use __uXX types in , include too
    Remove private struct dx_hash_info from public view in
    Include and use __uXX types in
    Use __uXX types in for struct divert_blk et al.
    Use __u32 for elf_addr_t in , not u32. It's user-visible.
    Remove PPP_FCS from user view in , remove __P mess entirely
    Use __uXX types in user-visible structures in
    Don't use 'u32' in user-visible struct ip_conntrack_old_tuple.
    Use __uXX types for S390 DASD volume label definitions which are user-visible
    S390 BIODASDREADCMB ioctl should use __u64 not u64 type.
    Remove unneeded inclusion of from
    Fix private integer types used in V4L2 ioctls.
    ...

    Manually resolve conflict in include/linux/mtd/physmap.h

    Linus Torvalds
     

09 Jun, 2006

1 commit

  • From: Ralf Baechle

    uses struct mm_struct and relies on a definition or
    declaration somehow magically being dragged in which may result in a
    build:

    [...]
    CC mm/mempolicy.o
    In file included from mm/mempolicy.c:69:
    include/linux/mempolicy.h:150: warning: ‘struct mm_struct’ declared inside parameter list
    include/linux/mempolicy.h:150: warning: its scope is only this definition or declaration, which is probably not what you want
    include/linux/mempolicy.h:175: warning: ‘struct mm_struct’ declared inside parameter list
    mm/mempolicy.c:622: error: conflicting types for ‘do_migrate_pages’
    include/linux/mempolicy.h:175: error: previous declaration of ‘do_migrate_pages’ was here
    mm/mempolicy.c:1661: error: conflicting types for ‘mpol_rebind_mm’
    include/linux/mempolicy.h:150: error: previous declaration of ‘mpol_rebind_mm’ was here
    make[1]: *** [mm/mempolicy.o] Error 1
    make: *** [mm] Error 2
    [ralf@denk linux-ip35]$

    Including is a step into direction of include hell so
    fixed by adding a forward declaration of struct mm_struct instead.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

26 Apr, 2006

1 commit


24 Mar, 2006

1 commit

  • The hooks in the slab cache allocator code path for support of NUMA
    mempolicies and cpuset memory spreading are in an important code path. Many
    systems will use neither feature.

    This patch optimizes those hooks down to a single check of some bits in the
    current tasks task_struct flags. For non NUMA systems, this hook and related
    code is already ifdef'd out.

    The optimization is done by using another task flag, set if the task is using
    a non-default NUMA mempolicy. Taking this flag bit along with the
    PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
    memory spreading' patch set, one can check for the combination of any of these
    special case memory placement mechanisms with a single test of the current
    tasks task_struct flags.

    This patch also tightens up the code, to save a few bytes of kernel text
    space, and moves some of it out of line. Due to the nested inlines called
    from multiple places, we were ending up with three copies of this code, which
    once we get off the main code path (for local node allocation) seems a bit
    wasteful of instruction memory.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

19 Jan, 2006

1 commit

  • This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
    imbalance in memory allocation during bootup.

    The slab allocator in 2.6.13 is not numa aware and simply calls
    alloc_pages(). This means that memory policies may control the behavior of
    alloc_pages(). During bootup the memory policy is set to MPOL_INTERLEAVE
    resulting in the spreading out of allocations during bootup over all
    available nodes. The slab allocator in 2.6.13 has only a single list of
    slab pages. As a result the per cpu slab cache and the spinlock controlled
    page lists may contain slab entries from off node memory. The slab
    allocator in 2.6.13 makes no effort to discern the locality of an entry on
    its lists.

    The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
    explicitly by calling alloc_pages_node(). The NUMA slab allocator manages
    slab entries by having lists of available slab pages for each node. The
    per cpu slab cache can only contain slab entries associated with the node
    local to the processor. This guarantees that the default allocation mode
    of the slab allocator always assigns local memory if available.

    Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
    anymore. In 2.6.14 all node unspecific slab allocations are performed on
    the boot processor. This means that most of key data structures are
    allocated on one node. Most processors will have to refer to these
    structures making the boot node a potential bottleneck. This may reduce
    performance and cause unnecessary memory pressure on the boot node.

    This patch implements NUMA policies in the slab layer. There is the need
    of explicit application of NUMA memory policies by the slab allcator itself
    since the NUMA slab allocator does no longer let the page_allocator control
    locality.

    The check for policies is made directly at the beginning of __cache_alloc
    using current->mempolicy. The memory policy is already frequently checked
    by the page allocator (alloc_page_vma() and alloc_page_current()). So it
    is highly likely that the cacheline is present. For MPOL_INTERLEAVE
    kmalloc() will spread out each request to one node after another so that an
    equal distribution of allocations can be obtained during bootup.

    It is not possible to push the policy check to lower layers of the NUMA
    slab allocator since the per cpu caches are now only containing slab
    entries from the current node. If the policy says that the local node is
    not to be preferred or forbidden then there is no point in checking the
    slab cache or local list of slab pages. The allocation better be directed
    immediately to the lists containing slab entries for the allowed set of
    nodes.

    This way of applying policy also fixes another strange behavior in 2.6.13.
    alloc_pages() is controlled by the memory allocation policy of the current
    process. It could therefore be that one process is running with
    MPOL_INTERLEAVE and would f.e. obtain a new page following that policy
    since no slab entries are in the lists anymore. A page can typically be
    used for multiple slab entries but lets say that the current process is
    only using one. The other entries are then added to the slab lists. These
    are now non local entries in the slab lists despite of the possible
    availability of local pages that would provide faster access and increase
    the performance of the application.

    Another process without MPOL_INTERLEAVE may now run and expect a local slab
    entry from kmalloc(). However, there are still these free slab entries
    from the off node page obtained from the other process via MPOL_INTERLEAVE
    in the cache. The process will then get an off node slab entry although
    other slab entries may be available that are local to that process. This
    means that the policy if one process may contaminate the locality of the
    slab caches for other processes.

    This patch in effect insures that a per process policy is followed for the
    allocation of slab entries and that there cannot be a memory policy
    influence from one process to another. A process with default policy will
    always get a local slab entry if one is available. And the process using
    memory policies will get its memory arranged as requested. Off-node slab
    allocation will require the use of spinlocks and will make the use of per
    cpu caches not possible. A process using memory policies to redirect
    allocations offnode will have to cope with additional lock overhead in
    addition to the latency added by the need to access a remote slab entry.

    Changes V1->V2
    - Remove #ifdef CONFIG_NUMA by moving forward declaration into
    prior #ifdef CONFIG_NUMA section.

    - Give the function determining the node number to use a saner
    name.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Jan, 2006

1 commit

  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

09 Jan, 2006

5 commits

  • Fix more of longstanding bug in cpuset/mempolicy interaction.

    NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
    to just the Memory Nodes allowed by that cpuset. The kernel maintains
    internal state for each mempolicy, tracking what nodes are used for the
    MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.

    When a tasks cpuset memory placement changes, whether because the cpuset
    changed, or because the task was attached to a different cpuset, then the
    tasks mempolicies have to be rebound to the new cpuset placement, so as to
    preserve the cpuset-relative numbering of the nodes in that policy.

    An earlier fix handled such mempolicy rebinding for mempolicies attached to a
    task.

    This fix rebinds mempolicies attached to vma's (address ranges in a tasks
    address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
    updating vma's, the rebinding of vma mempolicies has to be done when the
    cpuset memory placement is changed, at which time mmap_sem can be safely
    acquired. The tasks mempolicy is rebound later, when the task next attempts
    to allocate memory and notices that its task->cpuset_mems_generation is
    out-of-date with its cpusets mems_generation.

    Because walking the tasklist to find all tasks attached to a changing cpuset
    requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
    affected tasks while doing the tasklist scan. In general, one cannot acquire
    a semaphore (which can sleep) while already holding a spinlock (such as
    tasklist_lock). So a list of mm references has to be built up during the
    tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
    acquired, and the vma's in that mm rebound.

    Once the tasklist lock is dropped, affected tasks may fork new tasks, before
    their mm's are rebound. A kernel global 'cpuset_being_rebound' is set to
    point to the cpuset being rebound (there can only be one; cpuset modifications
    are done under a global 'manage_sem' semaphore), and the mpol_copy code that
    is used to copy a tasks mempolicies during fork catches such forking tasks,
    and ensures their children are also rebound.

    When a task is moved to a different cpuset, it is easier, as there is only one
    task involved. It's mm->vma's are scanned, using the same
    mpol_rebind_policy() as used above.

    It may happen that both the mpol_copy hook and the update done via the
    tasklist scan update the same mm twice. This is ok, as the mempolicies of
    each vma in an mm keep track of what mems_allowed they are relative to, and
    safely no-op a second request to rebind to the same nodes.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Cleanup, reorganize and make more robust the mempolicy.c code to rebind
    mempolicies relative to the containing cpuset after a tasks memory placement
    changes.

    The real motivator for this cleanup patch is to lay more groundwork for the
    upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
    after the containing cpuset memory placement changes.

    NUMA mempolicies are constrained by the cpuset their task is a member of.
    When either (1) a task is moved to a different cpuset, or (2) the 'mems'
    mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
    node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
    be recalculated, relative to their new cpuset placement.

    The old code used an unreliable method of determining what was the old
    mems_allowed constraining the mempolicy. It just looked at the tasks
    mems_allowed value. This sort of worked with the present code, that just
    rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
    referring to the old nodes. But in an upcoming patch, the vma mempolicies
    will be rebound as well. Then the order in which the various task and vma
    mempolicies are updated will no longer be deterministic, and one can no longer
    count on the task->mems_allowed holding the old value for as long as needed.
    It's not even clear if the current code was guaranteed to work reliably for
    task mempolicies.

    So I added a mems_allowed field to each mempolicy, stating exactly what
    mems_allowed the policy is relative to, and updated synchronously and reliably
    anytime that the mempolicy is rebound.

    Also removed a useless wrapper routine, numa_policy_rebind(), and had its
    caller, cpuset_update_task_memory_state(), call directly to the rewritten
    policy_rebind() routine, and made that rebind routine extern instead of
    static, and added a "mpol_" prefix to its name, making it
    mpol_rebind_policy().

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Since the numa_maps functionality is now in mempolicy.c we no longer need to
    export get_vma_policy().

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add a boolean "memory_migrate" to each cpuset, represented by a file
    containing "0" or "1" in each directory below /dev/cpuset.

    It defaults to false (file contains "0"). It can be set true by writing
    "1" to the file.

    If true, then anytime that a task is attached to the cpuset so marked, the
    pages of that task will be moved to that cpuset, preserving, to the extent
    practical, the cpuset-relative placement of the pages.

    Also anytime that a cpuset so marked has its memory placement changed (by
    writing to its "mems" file), the tasks in that cpuset will have their pages
    moved to the cpusets new nodes, preserving, to the extent practical, the
    cpuset-relative placement of the moved pages.

    Signed-off-by: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • sys_migrate_pages implementation using swap based page migration

    This is the original API proposed by Ray Bryant in his posts during the first
    half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

    The intent of sys_migrate is to migrate memory of a process. A process may
    have migrated to another node. Memory was allocated optimally for the prior
    context. sys_migrate_pages allows to shift the memory to the new node.

    sys_migrate_pages is also useful if the processes available memory nodes have
    changed through cpuset operations to manually move the processes memory. Paul
    Jackson is working on an automated mechanism that will allow an automatic
    migration if the cpuset of a process is changed. However, a user may decide
    to manually control the migration.

    This implementation is put into the policy layer since it uses concepts and
    functions that are also needed for mbind and friends. The patch also provides
    a do_migrate_pages function that may be useful for cpusets to automatically
    move memory. sys_migrate_pages does not modify policies in contrast to Ray's
    implementation.

    The current code here is based on the swap based page migration capability and
    thus is not able to preserve the physical layout relative to it containing
    nodeset (which may be a cpuset). When direct page migration becomes available
    then the implementation needs to be changed to do a isomorphic move of pages
    between different nodesets. The current implementation simply evicts all
    pages in source nodeset that are not in the target nodeset.

    Patch supports ia64, i386 and x86_64.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter