13 Aug, 2008

1 commit


25 Jul, 2008

1 commit

  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

05 Jul, 2008

1 commit

  • Flags considered internal to the mempolicy kernel code are stored as part
    of the "flags" member of struct mempolicy.

    Before exposing a policy type to userspace via get_mempolicy(), these
    internal flags must be masked. Flags exposed to userspace, however,
    should still be returned to the user.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

28 Apr, 2008

24 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to
    "local allocation". Because shared policies are at the same "scope" level
    [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT
    means "fall back to current task policy".

    This patch extends the memory policy string parsing function to display
    "local" for MPOL_PREFERRED + MPOL_F_LOCAL. This allows one to specify local
    allocation as the default policy for shared memory areas via the tmpfs mpol
    mount option, regardless of the current task's policy.

    Also, "local" is now displayed for this policy. This patch allows us to
    accept the same input format as the display.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mpol-to-str() formats memory policies into printable strings. Currently this
    is only used to display "numa_maps". A subsequent patch will use
    mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to
    remove essentially duplicate code in mm/shmem.c. This patch cleans up
    mpol_to_str() generally and in preparation for that patch.

    1) show_numa_maps() is not checking the return code from mpol_to_str().
    There's not a lot we can do in this context if mpol_to_str() did return the
    error [insufficient space in buffer]. Proposed "solution": just check,
    under DEBUG_VM, that callers are providing sufficient buffer space for the
    policy, flags, and a few nodes. This way, we'll get some display.
    show_numa_maps() is providing a 50-byte buffer, so it won't trip this
    check. 50-bytes should be sufficient unless one has a large number of
    nodes in a very sparse nodemask.

    2) The display of the new mode flags ["static" & "relative"] was set up to
    display multiple flags, separated by a "bar" '|'. However, this support is
    incomplete--e.g., need_bar was never incremented; and currently, these two
    flags are mutually exclusive. So remove the "bar" support, for now, and
    only display one flag.

    3) Use snprint() to format flags, so as not to overflow the buffer. Not
    that it's ever happed, AFAIK.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Now that we're using "preferred local" policy for system default, we need to
    make this as fast as possible. Because of the variable size of the mempolicy
    structure [based on size of nodemasks], the preferred_node may be in a
    different cacheline from the mode. This can result in accessing an extra
    cacheline in the normal case of system default policy. Suspect this is the
    cause of an observed 2-3% slowdown in page fault testing relative to kernel
    without this patch series.

    To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
    flags member which is guaranteed [?] to be in the same cacheline as the mode
    itself.

    Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
    for both anon and shmem segments with system default and vma [preferred local]
    policy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
    v.preferred_node < 0 -- i.e., "local allocation":

    1) [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
    to fetch the nodemask associated with a policy. Currently,
    get_policy_nodemask() returns the set of nodes with memory, when
    the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
    Change to return an empty nodemask, as this is what was specified
    to achieve "local allocation".

    2) When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset. However, when the policy is MPOL_PREFERRED, and the
    preferred_node is
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
    [set_mempolicy(), mbind() and internal versions], the kernel simply installs a
    NULL struct mempolicy pointer in the appropriate context: task policy, vma
    policy, or shared policy. This causes any use of that policy to "fall back"
    to the next most specific policy scope.

    The only use of MPOL_DEFAULT to mean "local allocation" is in the system
    default policy. This requires extra checks/cases for MPOL_DEFAULT in many
    mempolicy.c functions.

    There is another, "preferred" way to specify local allocation via the APIs.
    That is using the MPOL_PREFERRED policy mode with an empty nodemask.
    Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
    All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
    node local to the cpu where the allocation occurs.

    System default policy, except during boot, is hard-coded to "local
    allocation". By using the MPOL_PREFERRED mode with a negative value of
    preferred node for system default policy, MPOL_DEFAULT will never occur in the
    'policy' member of a struct mempolicy. Thus, we can remove all checks for
    MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
    paths.

    In slab_node() return local node id when policy pointer is NULL. No need to
    set a pol value to take the switch default. Replace switch default with
    BUG()--i.e., shouldn't happen.

    With this patch MPOL_DEFAULT is only used in the APIs, including internal
    calls to do_set_mempolicy() and in the display of policy in
    /proc//numa_maps. It always means "fall back" to the the next most
    specific policy scope. This simplifies the description of memory policies
    quite a bit, with no visible change in behavior.

    get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
    the requested policy [task or vma/shared] is NULL. These are the values one
    would supply via set_mempolicy() or mbind() to achieve that condition--default
    behavior.

    This patch updates Documentation to reflect this change.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • As part of yet another rework of mempolicy reference counting, we want to be
    able to identify shared policies efficiently, because they have an extra ref
    taken on lookup that needs to be removed when we're finished using the policy.

    Note: the extra ref is required because the policies are
    shared between tasks/processes and can be changed/freed
    by one task while another task is using them--e.g., for
    page allocation.

    Building on David Rientjes mempolicy "mode flags" enhancement, this patch
    indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
    member of the struct mempolicy added by David. MPOL_F_SHARED, and any future
    "internal mode flags" are reserved from bit zero up, as they will never be
    passed in the upper bits of the mode argument of a mempolicy API.

    I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
    rb-tree. Don't need/want to clear the flag when removing from the tree as the
    mempolicy is freed [unref'd] internally to the sp_delete() function. However,
    a task could hold another reference on this mempolicy from a prior lookup. We
    need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
    unref, eventually freeing, the mempolicy.

    A later patch in this series will introduce a function to conditionally unref
    [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the
    only reason] to unref/free a policy via the conditional free.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The terms 'policy' and 'mode' are both used in various places to describe the
    semantics of the value stored in the 'policy' member of struct mempolicy.
    Furthermore, the term 'policy' is used to refer to that member, to the entire
    struct mempolicy and to the more abstract concept of the tuple consisting of a
    "mode" and an optional node or set of nodes. Recently, we have added "mode
    flags" that are passed in the upper bits of the 'mode' [or sometimes,
    'policy'] member of the numa APIs.

    I'd like to resolve this confusion, which perhaps only exists in my mind, by
    renaming the 'policy' member to 'mode' throughout, and fixing up the
    Documentation. Man pages will be updated separately.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • get_vma_policy() is not handling fallback to task policy correctly when the
    get_policy() vm_op returns NULL. The NULL overwrites the 'pol' variable that
    was holding the fallback task mempolicy. So, it was falling back directly to
    system default policy.

    Fix get_vma_policy() to use only non-NULL policy returned from the vma
    get_policy op.

    shm_get_policy() was falling back to current task's mempolicy if the "backing
    file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
    the vma policy is null. This is incorrect for show_numa_maps() which is
    likely querying the numa_maps of some task other than current. Remove this
    fallback.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • A read of /proc//numa_maps holds the target task's mmap_sem for read
    while examining each vma's mempolicy. A vma's mempolicy can fall back to the
    task's policy. However, the task could be changing it's task policy and free
    the one that the show_numa_maps() is examining.

    To prevent this, grab the mmap_sem for write when updating task mempolicy.
    Pointed out to me by Christoph Lameter and extracted and reworked from
    Christoph's alternative mempol reference counting patch.

    This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
    races between task's sharing an mm_struct [a.k.a. threads] setting and
    querying a mempolicy for a particular address.

    Note: this is necessary, but not sufficient, to allow us to stop taking an
    extra reference on "other task's mempolicy" in get_vma_policy. Subsequent
    patches will complete this update, allowing us to simplify the tests for
    whether we need to unref a mempolicy at various points in the code.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for
    MPOL_PREFERRED policies that were created with an empty nodemask (for purely
    local allocations). They'll never be invalidated because the allowed mems of
    a task changes or need to be rebound relative to a cpuset's placement.

    Also fixes a bug identified by Lee Schermerhorn that disallowed empty
    nodemasks to be passed to MPOL_PREFERRED to specify local allocations. [A
    different, somewhat incomplete, patch already existed in 25-rc5-mm1.]

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Randy Dunlap
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Create a mempolicy_operations structure that currently points to two
    functions[*] for the various modes:

    int (*create)(struct mempolicy *, const nodemask_t *);
    void (*rebind)(struct mempolicy *, const nodemask_t *);

    This splits the implementation for the various modes out of two large
    functions, mpol_new() and mpol_rebind_policy(). Eventually it may be
    beneficial to add additional functions to accomodate the existing switch()
    statements in mm/mempolicy.c.

    [*] The ->create() function for MPOL_DEFAULT is currently NULL since no
    struct mempolicy is dynamically allocated.

    [Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests]
    Signed-off-by: David Rientjes
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Lee Schermerhorn
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid
    having to declare function prototypes.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
    MPOL_INTERLEAVE, are better declared as part of an enum since they are
    sequentially numbered and cannot be combined.

    The policy member of struct mempolicy is also converted from type short to
    type unsigned short. A negative policy does not have any legitimate meaning,
    so it is possible to change its type in preparation for adding optional mode
    flags later.

    The equivalent member of struct shmem_sb_info is also changed from int to
    unsigned short.

    For compatibility, the policy formal to get_mempolicy() remains as a pointer
    to an int:

    int get_mempolicy(int *policy, unsigned long *nmask,
    unsigned long maxnode, unsigned long addr,
    unsigned long flags);

    although the only possible values is the range of type unsigned short.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Mar, 2008

1 commit

  • Address 3 known bugs in the current memory policy reference counting method.
    I have a series of patches to rework the reference counting to reduce overhead
    in the allocation path. However, that series will require testing in -mm once
    I repost it.

    1) alloc_page_vma() does not release the extra reference taken for
    vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in
    leaking mempolicy structures. This is probably occurring, but not being
    noticed.

    Fix: add the conditional release of the reference.

    2) hugezonelist unconditionally releases a reference on the mempolicy when
    mode == MPOL_INTERLEAVE. This can result in decrementing the reference
    count for system default policy [should have no ill effect] or premature
    freeing of task policy. If this occurred, the next allocation using task
    mempolicy would use the freed structure and probably BUG out.

    Fix: add the necessary check to the release.

    3) The current reference counting method assumes that vma 'get_policy()'
    methods automatically add an extra reference a non-NULL returned mempolicy.
    This is true for shmem_get_policy() used by tmpfs mappings, including
    regular page shm segments. However, SHM_HUGETLB shm's, backed by
    hugetlbfs, just use the vma policy without the extra reference. This
    results in freeing of the vma policy on the first allocation, with reuse of
    the freed mempolicy structure on subsequent allocations.

    Fix: Rather than add another condition to the conditional reference
    release, which occur in the allocation path, just add a reference when
    returning the vma policy in shm_get_policy() to match the assumptions.

    Signed-off-by: Lee Schermerhorn
    Cc: Greg KH
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

15 Feb, 2008

1 commit

  • seq_path() is always called with a dentry and a vfsmount from a struct path.
    Make seq_path() take it directly as an argument.

    Signed-off-by: Jan Blunck
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

12 Feb, 2008

1 commit

  • Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
    presence of memoryless nodes. This patch attempts to fix that problem.

    Some background:

    numactl --interleave=all calls set_mempolicy(2) with a fully populated
    [out to MAXNUMNODES] nodemask. set_mempolicy() [in do_set_mempolicy()]
    calls contextualize_policy() which requires that the nodemask be a
    subset of the current task's mems_allowed; else EINVAL will be returned.

    A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
    i.e., nodes with memory. So, a fully populated nodemask will be
    declared invalid if it includes memoryless nodes.

    NOTE: the same thing will occur when running in a cpuset
    with restricted mem_allowed--for the same reason:
    node mask contains dis-allowed nodes.

    mbind(2), on the other hand, just masks off any nodes in the nodemask
    that are not included in the caller's mems_allowed.

    In each case [mbind() and set_mempolicy()], mpol_check_policy() will
    complain [again, resulting in EINVAL] if the nodemask contains any
    memoryless nodes. This is somewhat redundant as mpol_new() will remove
    memoryless nodes for interleave policy, as will bind_zonelist()--called
    by mpol_new() for BIND policy.

    Proposed fix:

    1) modify contextualize_policy logic to:
    a) remember whether the incoming node mask is empty.
    b) if not, restrict the nodemask to allowed nodes, as is
    currently done in-line for mbind(). This guarantees
    that the resulting mask includes only nodes with memory.

    NOTE: this is a [benign, IMO] change in behavior for
    set_mempolicy(). Dis-allowed nodes will be
    silently ignored, rather than returning an error.

    c) fold this code into mpol_check_policy(), replace 2 calls to
    contextualize_policy() to call mpol_check_policy() directly
    and remove contextualize_policy().

    2) In existing mpol_check_policy() logic, after "contextualization":
    a) MPOL_DEFAULT: require that in coming mask "was_empty"
    b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
    contains at least one node.
    c) add a case for MPOL_PREFERRED: if in coming was not empty
    and resulting mask IS empty, user specified invalid nodes.
    Return EINVAL.
    c) remove the now redundant check for memoryless nodes

    3) remove the now redundant masking of policy nodes for interleave
    policy from mpol_new().

    4) Now that mpol_check_policy() contextualizes the nodemask, remove
    the in-line nodes_and() from sys_mbind(). I believe that this
    restores mbind() to the behavior before the memoryless-nodes
    patch series. E.g., we'll no longer treat an invalid nodemask
    with MPOL_PREFERRED as local allocation.

    [ Patch history:

    v1 -> v2:
    - Communicate whether or not incoming node mask was empty to
    mpol_check_policy() for better error checking.
    - As suggested by David Rientjes, remove the now unused
    cpuset_nodes_subset_current_mems_allowed() from cpuset.h

    v2 -> v3:
    - As suggested by Kosaki Motohito, fold the "contextualization"
    of policy nodemask into mpol_check_policy(). Looks a little
    cleaner. ]

    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

15 Nov, 2007

1 commit

  • We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
    mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas. For
    anon-regions, we just fail to migrate any pages beyond the 1st vma in the
    range.

    This occurs because do_mbind() collects a list of pages to migrate by
    calling check_range(). check_range() walks the task's mm, spanning vmas as
    necessary, to collect the migratable pages into a list. Then, do_mbind()
    calls migrate_pages() passing the list of pages, a function to allocate new
    pages based on vma policy [new_vma_page()], and a pointer to the first vma
    of the range.

    For each page in the list, new_vma_page() calls page_address_in_vma()
    passing the page and the vma [first in range] to obtain the address to get
    for alloc_page_vma(). The page address is needed to get interleaving
    policy correct. If the pages in the list come from multiple vmas,
    eventually, new_page_address() will pass that page to page_address_in_vma()
    with the incorrect vma. For !PageAnon pages, this will result in a bug
    check in rmap.c:vma_address(). For anon pages, vma_address() will just
    return EFAULT and fail the migration.

    This patch modifies new_vma_page() to check the return value from
    page_address_in_vma(). If the return value is EFAULT, new_vma_page()
    searchs forward via vm_next for the vma that maps the page--i.e., that does
    not return EFAULT. This assumes that the pages in the list handed to
    migrate_pages() is in address order. This is currently case. The patch
    documents this assumption in a new comment block for new_vma_page().

    If new_vma_page() cannot locate the vma mapping the page in a forward
    search in the mm, it will pass a NULL vma to alloc_page_vma(). This will
    result in the allocation using the task policy, if any, else system default
    policy. This situation is unlikely, but the patch documents this behavior
    with a comment.

    Note, this patch results in restarting from the first vma in a multi-vma
    range each time new_vma_page() is called. If this is not acceptable, we
    can make the vma argument a pointer, both in new_vma_page() and it's caller
    unmap_and_move() so that the value held by the loop in migrate_pages()
    always passes down the last vma in which a page was found. This will
    require changes to all new_page_t functions passed to migrate_pages(). Is
    this necessary?

    For this patch to work, we can't bug check in vma_address() for pages
    outside the argument vma. This patch removes the BUG_ON(). All other
    callers [besides new_vma_page()] already check the return status.

    Tested on x86_64, 4 node NUMA platform.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

20 Oct, 2007

3 commits

  • The find_task_by_something is a set of macros are used to find task by pid
    depending on what kind of pid is proposed - global or virtual one. All of
    them are wrappers above the most generic one - find_task_by_pid_type_ns() -
    and just substitute some args for it.

    It turned out, that dereferencing the current->nsproxy->pid_ns construction
    and pushing one more argument on the stack inline cause kernel text size to
    grow.

    This patch moves all this stuff out-of-line into kernel/pid.c. Together
    with the next patch it saves a bit less than 400 bytes from the .text
    section.

    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Remove the filesystem support logic from the cpusets system and makes cpusets
    a cgroup subsystem

    The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
    passed through to the cgroup filesystem with the appropriate options to
    emulate the old cpuset filesystem behaviour.

    Signed-off-by: Paul Menage
    Cc: Serge E. Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Paul Jackson
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: Srivatsa Vaddagiri
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     

17 Oct, 2007

6 commits

  • This patch contains the following cleanups:
    - every file should include the headers containing the prototypes for
    its global functions
    - make the follosing needlessly global functions static:
    - migrate_to_node()
    - do_mbind()
    - sp_alloc()
    - mpol_rebind_policy()

    [akpm@linux-foundation.org: fix uninitialised var warning]
    Signed-off-by: Adrian Bunk
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Here's a cut at fixing up uses of the online node map in generic code.

    mm/shmem.c:shmem_parse_mpol()

    Ensure nodelist is subset of nodes with memory.
    Use node_states[N_HIGH_MEMORY] as default for missing
    nodelist for interleave policy.

    mm/shmem.c:shmem_fill_super()

    initialize policy_nodes to node_states[N_HIGH_MEMORY]

    mm/page-writeback.c:highmem_dirtyable_memory()

    sum over nodes with memory

    mm/page_alloc.c:zlc_setup()

    allowednodes - use nodes with memory.

    mm/page_alloc.c:default_zonelist_order()

    average over nodes with memory.

    mm/page_alloc.c:find_next_best_node()

    skip nodes w/o memory.
    N_HIGH_MEMORY state mask may not be initialized at this time,
    unless we want to depend on early_calculate_totalpages() [see
    below]. Will ZONE_MOVABLE ever be configurable?

    mm/page_alloc.c:find_zone_movable_pfns_for_nodes()

    spread kernelcore over nodes with memory.

    This required calling early_calculate_totalpages()
    unconditionally, and populating N_HIGH_MEMORY node
    state therein from nodes in the early_node_map[].
    If we can depend on this, we can eliminate the
    population of N_HIGH_MEMORY mask from __build_all_zonelists()
    and use the N_HIGH_MEMORY mask in find_next_best_node().

    mm/mempolicy.c:mpol_check_policy()

    Ensure nodes specified for policy are subset of
    nodes with memory.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Shaohua Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Online nodes now may have no memory. The checks and initialization must
    therefore be changed to no longer use the online functions.

    This will correctly initialize the interleave on bootup to only target nodes
    with memory and will make sys_move_pages return an error when a page is to be
    moved to a memoryless node. Similarly we will get an error if MPOL_BIND and
    MPOL_INTERLEAVE is used on a memoryless node.

    These are somewhat new semantics. So far one could specify memoryless nodes
    and we would maybe do the right thing and just ignore the node (or we'd do
    something strange like with MPOL_INTERLEAVE). If we want to allow the
    specification of memoryless nodes via memory policies then we need to keep
    checking for online nodes.

    Signed-off-by: Christoph Lameter
    Acked-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on
    memoryless nodes will be redirected to nodes with memory. This results in an
    imbalance because the neighboring nodes to memoryless nodes will get
    significantly more interleave hits that the rest of the nodes on the system.

    We can avoid this imbalance by clearing the nodes in the interleave node set
    that have no memory. If we use the node map of the memory nodes instead of
    the online nodes then we have only the nodes we want.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Tested-by: Lee Schermerhorn
    Acked-by: Bob Picco
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Allow an application to query the memories allowed by its context.

    Updated numa_memory_policy.txt to mention that applications can use this to
    obtain allowed memories for constructing valid policies.

    TODO: update out-of-tree libnuma wrapper[s], or maybe add a new
    wrapper--e.g., numa_get_mems_allowed() ?

    Also, update numa syscall man pages.

    Tested with memtoy V>=0.13.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch cleans up duplicate includes in
    mm/

    Signed-off-by: Jesper Juhl
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl