29 Oct, 2009

2 commits

  • If migrate_prep is failed, new variable is leaked. This patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If mbind() receives an invalid address, do_mbind leaks a page. The
    following test program detects this leak.

    This patch fixes it.

    migrate_efault.c
    =======================================
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static unsigned long pagesize;

    static void* make_hole_mapping(void)
    {

    void* addr;

    addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
    MAP_ANON|MAP_PRIVATE, 0, 0);
    if (addr == MAP_FAILED)
    return NULL;

    /* make page populate */
    memset(addr, 0, pagesize*3);

    /* make memory hole */
    munmap(addr+pagesize, pagesize);

    return addr;
    }

    int main(int argc, char** argv)
    {
    void* addr;
    int ch;
    int node;
    struct bitmask *nmask = numa_allocate_nodemask();
    int err;
    int node_set = 0;

    while ((ch = getopt(argc, argv, "n:")) != -1){
    switch (ch){
    case 'n':
    node = strtol(optarg, NULL, 0);
    numa_bitmask_setbit(nmask, node);
    node_set = 1;
    break;
    default:
    ;
    }
    }
    argc -= optind;
    argv += optind;

    if (!node_set)
    numa_bitmask_setbit(nmask, 0);

    pagesize = getpagesize();

    addr = make_hole_mapping();

    err = mbind(addr, pagesize*3, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL);
    if (err)
    perror("mbind ");

    return 0;
    }
    =======================================

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

08 Aug, 2009

1 commit

  • At first, init_task's mems_allowed is initialized as this.
    init_task->mems_allowed == node_state[N_POSSIBLE]

    And cpuset's top_cpuset mask is initialized as this
    top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]

    Before 2.6.29:
    policy's mems_allowed is initialized as this.

    1. update tasks->mems_allowed by its cpuset->mems_allowed.
    2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)

    Updating task's mems_allowed in reference to top_cpuset's one.
    cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.

    In 2.6.30: After commit 58568d2a8215cb6f55caf2332017d7bdff954e1c
    ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
    is initialized as this.

    1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)

    Here, if task is in top_cpuset, task->mems_allowed is not updated from
    init's one. Assume user excutes command as #numactrl --interleave=all
    ,....

    policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)

    Then, policy's mems_allowd can includes a possible node, which has no pgdat.

    MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
    directly.

    NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL

    Then, what's we need is making policy->mems_allowed be aware of
    N_HIGH_MEMORY. This patch does that. But to do so, extra nodemask will
    be on statck. Because I know cpumask has a new interface of
    CPUMASK_ALLOC(), I added it to node.

    This patch stands on old behavior. But I feel this fix itself is just a
    Band-Aid. But to do fundametal fix, we have to take care of memory
    hotplug and it takes time. (task->mems_allowd should be N_HIGH_MEMORY, I
    think.)

    mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
    should be includes only online nodes.

    In old behavior, this is guaranteed by frequent reference to cpuset's
    code. Now, most of them are removed and mempolicy has to check it by
    itself.

    To do check, a few nodemask_t will be used for calculating nodemask. But,
    size of nodemask_t can be big and it's not good to allocate them on stack.

    Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
    NODEMASK_ALLOC/FREE shoudl be there.

    [akpm@linux-foundation.org: cleanups & tweaks]
    Tested-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

17 Jun, 2009

2 commits

  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fix allocating page cache/slab object on the unallowed node when memory
    spread is set by updating tasks' mems_allowed after its cpuset's mems is
    changed.

    In order to update tasks' mems_allowed in time, we must modify the code of
    memory policy. Because the memory policy is applied in the process's
    context originally. After applying this patch, one task directly
    manipulates anothers mems_allowed, and we use alloc_lock in the
    task_struct to protect mems_allowed and memory policy of the task.

    But in the fast path, we didn't use lock to protect them, because adding a
    lock may lead to performance regression. But if we don't add a lock,the
    task might see no nodes when changing cpuset's mems_allowed to some
    non-overlapping set. In order to avoid it, we set all new allowed nodes,
    then clear newly disallowed ones.

    [lee.schermerhorn@hp.com:
    The rework of mpol_new() to extract the adjusting of the node mask to
    apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
    with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
    allocation. Fix this by adding the check for MPOL_PREFERRED and empty
    node mask to mpol_new_mpolicy().

    Remove the now unneeded 'nodes = NULL' from mpol_new().

    Note that mpol_new_mempolicy() is always called with a non-NULL
    'nodes' parameter now that it has been removed from mpol_new().
    Therefore, we don't need to test nodes for NULL before testing it for
    'empty'. However, just to be extra paranoid, add a VM_BUG_ON() to
    verify this assumption.]
    [lee.schermerhorn@hp.com:

    I don't think the function name 'mpol_new_mempolicy' is descriptive
    enough to differentiate it from mpol_new().

    This function applies cpuset set context, usually constraining nodes
    to those allowed by the cpuset. However, when the 'RELATIVE_NODES flag
    is set, it also translates the nodes. So I settled on
    'mpol_set_nodemask()', because the comment block for mpol_new() mentions
    that we need to call this function to "set nodes".

    Some additional minor line length, whitespace and typo cleanup.]
    Signed-off-by: Miao Xie
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Paul Menage
    Cc: Nick Piggin
    Cc: Yasunori Goto
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

14 Jan, 2009

1 commit


14 Nov, 2008

4 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: Al Viro
    Cc: linux-audit@redhat.com
    Cc: containers@lists.linux-foundation.org
    Cc: linux-mm@kvack.org
    Signed-off-by: James Morris

    David Howells
     

07 Nov, 2008

1 commit

  • Move the migrate_prep outside the mmap_sem for the following system calls

    1. sys_move_pages
    2. sys_migrate_pages
    3. sys_mbind()

    It really does not matter when we flush the lru. The system is free to
    add pages onto the lru even during migration which will make the page
    migration either skip the page (mbind, migrate_pages) or return a busy
    state (move_pages).

    Fixes this lockdep warning (and potential deadlock):

    Some VM place has
    mmap_sem -> kevent_wq via lru_add_drain_all()

    net/core/dev.c::dev_ioctl() has
    rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault.

    linkwatch_event has
    kevent_wq -> rtnl_lock

    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Reported-by: Heiko Carstens
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 Oct, 2008

2 commits

  • When the system contains lots of mlocked or otherwise unevictable pages,
    the pageout code (kswapd) can spend lots of time scanning over these
    pages. Worse still, the presence of lots of unevictable pages can confuse
    kswapd into thinking that more aggressive pageout modes are required,
    resulting in all kinds of bad behaviour.

    Infrastructure to manage pages excluded from reclaim--i.e., hidden from
    vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
    maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
    them from vmscan.

    Kosaki Motohiro added the support for the memory controller unevictable
    lru list.

    Pages on the unevictable list have both PG_unevictable and PG_lru set.
    Thus, PG_unevictable is analogous to and mutually exclusive with
    PG_active--it specifies which LRU list the page is on.

    The unevictable infrastructure is enabled by a new mm Kconfig option
    [CONFIG_]UNEVICTABLE_LRU.

    A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
    not a page may be evictable. Subsequent patches will add the various
    !evictable tests. We'll want to keep these tests light-weight for use in
    shrink_active_list() and, possibly, the fault path.

    To avoid races between tasks putting pages [back] onto an LRU list and
    tasks that might be moving the page from non-evictable to evictable state,
    the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
    -- tests the "evictability" of a page after placing it on the LRU, before
    dropping the reference. If the page has become unevictable,
    putback_lru_page() will redo the 'putback', thus moving the page to the
    unevictable list. This way, we avoid "stranding" evictable pages on the
    unevictable list.

    [akpm@linux-foundation.org: fix fallout from out-of-order merge]
    [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
    [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
    [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
    [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
    [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
    [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
    [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Debugged-by: Benjamin Kidwell
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • On large memory systems, the VM can spend way too much time scanning
    through pages that it cannot (or should not) evict from memory. Not only
    does it use up CPU time, but it also provokes lock contention and can
    leave large systems under memory presure in a catatonic state.

    This patch series improves VM scalability by:

    1) putting filesystem backed, swap backed and unevictable pages
    onto their own LRUs, so the system only scans the pages that it
    can/should evict from memory

    2) switching to two handed clock replacement for the anonymous LRUs,
    so the number of pages that need to be scanned when the system
    starts swapping is bound to a reasonable number

    3) keeping unevictable pages off the LRU completely, so the
    VM does not waste CPU time scanning them. ramfs, ramdisk,
    SHM_LOCKED shared memory segments and mlock()ed VMA pages
    are keept on the unevictable list.

    This patch:

    isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

    It is tough, because we don't need that function without memory migration
    so there is a valid argument to have it in migrate.c. However a
    subsequent patch needs to make use of it in the core mm, so we can happily
    move it to vmscan.c.

    Also, make the function a little more generic by not requiring that it
    adds an isolated page to a given list. Callers can do that.

    Note that we now have '__isolate_lru_page()', that does
    something quite different, visible outside of vmscan.c
    for use with memory controller. Methinks we need to
    rationalize these names/purposes. --lts

    [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
    Signed-off-by: Nick Piggin
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

13 Aug, 2008

1 commit


25 Jul, 2008

1 commit

  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

05 Jul, 2008

1 commit

  • Flags considered internal to the mempolicy kernel code are stored as part
    of the "flags" member of struct mempolicy.

    Before exposing a policy type to userspace via get_mempolicy(), these
    internal flags must be masked. Flags exposed to userspace, however,
    should still be returned to the user.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

28 Apr, 2008

24 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to
    "local allocation". Because shared policies are at the same "scope" level
    [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT
    means "fall back to current task policy".

    This patch extends the memory policy string parsing function to display
    "local" for MPOL_PREFERRED + MPOL_F_LOCAL. This allows one to specify local
    allocation as the default policy for shared memory areas via the tmpfs mpol
    mount option, regardless of the current task's policy.

    Also, "local" is now displayed for this policy. This patch allows us to
    accept the same input format as the display.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mpol-to-str() formats memory policies into printable strings. Currently this
    is only used to display "numa_maps". A subsequent patch will use
    mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to
    remove essentially duplicate code in mm/shmem.c. This patch cleans up
    mpol_to_str() generally and in preparation for that patch.

    1) show_numa_maps() is not checking the return code from mpol_to_str().
    There's not a lot we can do in this context if mpol_to_str() did return the
    error [insufficient space in buffer]. Proposed "solution": just check,
    under DEBUG_VM, that callers are providing sufficient buffer space for the
    policy, flags, and a few nodes. This way, we'll get some display.
    show_numa_maps() is providing a 50-byte buffer, so it won't trip this
    check. 50-bytes should be sufficient unless one has a large number of
    nodes in a very sparse nodemask.

    2) The display of the new mode flags ["static" & "relative"] was set up to
    display multiple flags, separated by a "bar" '|'. However, this support is
    incomplete--e.g., need_bar was never incremented; and currently, these two
    flags are mutually exclusive. So remove the "bar" support, for now, and
    only display one flag.

    3) Use snprint() to format flags, so as not to overflow the buffer. Not
    that it's ever happed, AFAIK.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Now that we're using "preferred local" policy for system default, we need to
    make this as fast as possible. Because of the variable size of the mempolicy
    structure [based on size of nodemasks], the preferred_node may be in a
    different cacheline from the mode. This can result in accessing an extra
    cacheline in the normal case of system default policy. Suspect this is the
    cause of an observed 2-3% slowdown in page fault testing relative to kernel
    without this patch series.

    To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
    flags member which is guaranteed [?] to be in the same cacheline as the mode
    itself.

    Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
    for both anon and shmem segments with system default and vma [preferred local]
    policy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
    v.preferred_node < 0 -- i.e., "local allocation":

    1) [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
    to fetch the nodemask associated with a policy. Currently,
    get_policy_nodemask() returns the set of nodes with memory, when
    the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
    Change to return an empty nodemask, as this is what was specified
    to achieve "local allocation".

    2) When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset. However, when the policy is MPOL_PREFERRED, and the
    preferred_node is
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
    [set_mempolicy(), mbind() and internal versions], the kernel simply installs a
    NULL struct mempolicy pointer in the appropriate context: task policy, vma
    policy, or shared policy. This causes any use of that policy to "fall back"
    to the next most specific policy scope.

    The only use of MPOL_DEFAULT to mean "local allocation" is in the system
    default policy. This requires extra checks/cases for MPOL_DEFAULT in many
    mempolicy.c functions.

    There is another, "preferred" way to specify local allocation via the APIs.
    That is using the MPOL_PREFERRED policy mode with an empty nodemask.
    Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
    All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
    node local to the cpu where the allocation occurs.

    System default policy, except during boot, is hard-coded to "local
    allocation". By using the MPOL_PREFERRED mode with a negative value of
    preferred node for system default policy, MPOL_DEFAULT will never occur in the
    'policy' member of a struct mempolicy. Thus, we can remove all checks for
    MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
    paths.

    In slab_node() return local node id when policy pointer is NULL. No need to
    set a pol value to take the switch default. Replace switch default with
    BUG()--i.e., shouldn't happen.

    With this patch MPOL_DEFAULT is only used in the APIs, including internal
    calls to do_set_mempolicy() and in the display of policy in
    /proc//numa_maps. It always means "fall back" to the the next most
    specific policy scope. This simplifies the description of memory policies
    quite a bit, with no visible change in behavior.

    get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
    the requested policy [task or vma/shared] is NULL. These are the values one
    would supply via set_mempolicy() or mbind() to achieve that condition--default
    behavior.

    This patch updates Documentation to reflect this change.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • As part of yet another rework of mempolicy reference counting, we want to be
    able to identify shared policies efficiently, because they have an extra ref
    taken on lookup that needs to be removed when we're finished using the policy.

    Note: the extra ref is required because the policies are
    shared between tasks/processes and can be changed/freed
    by one task while another task is using them--e.g., for
    page allocation.

    Building on David Rientjes mempolicy "mode flags" enhancement, this patch
    indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
    member of the struct mempolicy added by David. MPOL_F_SHARED, and any future
    "internal mode flags" are reserved from bit zero up, as they will never be
    passed in the upper bits of the mode argument of a mempolicy API.

    I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
    rb-tree. Don't need/want to clear the flag when removing from the tree as the
    mempolicy is freed [unref'd] internally to the sp_delete() function. However,
    a task could hold another reference on this mempolicy from a prior lookup. We
    need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
    unref, eventually freeing, the mempolicy.

    A later patch in this series will introduce a function to conditionally unref
    [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the
    only reason] to unref/free a policy via the conditional free.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The terms 'policy' and 'mode' are both used in various places to describe the
    semantics of the value stored in the 'policy' member of struct mempolicy.
    Furthermore, the term 'policy' is used to refer to that member, to the entire
    struct mempolicy and to the more abstract concept of the tuple consisting of a
    "mode" and an optional node or set of nodes. Recently, we have added "mode
    flags" that are passed in the upper bits of the 'mode' [or sometimes,
    'policy'] member of the numa APIs.

    I'd like to resolve this confusion, which perhaps only exists in my mind, by
    renaming the 'policy' member to 'mode' throughout, and fixing up the
    Documentation. Man pages will be updated separately.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • get_vma_policy() is not handling fallback to task policy correctly when the
    get_policy() vm_op returns NULL. The NULL overwrites the 'pol' variable that
    was holding the fallback task mempolicy. So, it was falling back directly to
    system default policy.

    Fix get_vma_policy() to use only non-NULL policy returned from the vma
    get_policy op.

    shm_get_policy() was falling back to current task's mempolicy if the "backing
    file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
    the vma policy is null. This is incorrect for show_numa_maps() which is
    likely querying the numa_maps of some task other than current. Remove this
    fallback.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • A read of /proc//numa_maps holds the target task's mmap_sem for read
    while examining each vma's mempolicy. A vma's mempolicy can fall back to the
    task's policy. However, the task could be changing it's task policy and free
    the one that the show_numa_maps() is examining.

    To prevent this, grab the mmap_sem for write when updating task mempolicy.
    Pointed out to me by Christoph Lameter and extracted and reworked from
    Christoph's alternative mempol reference counting patch.

    This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
    races between task's sharing an mm_struct [a.k.a. threads] setting and
    querying a mempolicy for a particular address.

    Note: this is necessary, but not sufficient, to allow us to stop taking an
    extra reference on "other task's mempolicy" in get_vma_policy. Subsequent
    patches will complete this update, allowing us to simplify the tests for
    whether we need to unref a mempolicy at various points in the code.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for
    MPOL_PREFERRED policies that were created with an empty nodemask (for purely
    local allocations). They'll never be invalidated because the allowed mems of
    a task changes or need to be rebound relative to a cpuset's placement.

    Also fixes a bug identified by Lee Schermerhorn that disallowed empty
    nodemasks to be passed to MPOL_PREFERRED to specify local allocations. [A
    different, somewhat incomplete, patch already existed in 25-rc5-mm1.]

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: Randy Dunlap
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Create a mempolicy_operations structure that currently points to two
    functions[*] for the various modes:

    int (*create)(struct mempolicy *, const nodemask_t *);
    void (*rebind)(struct mempolicy *, const nodemask_t *);

    This splits the implementation for the various modes out of two large
    functions, mpol_new() and mpol_rebind_policy(). Eventually it may be
    beneficial to add additional functions to accomodate the existing switch()
    statements in mm/mempolicy.c.

    [*] The ->create() function for MPOL_DEFAULT is currently NULL since no
    struct mempolicy is dynamically allocated.

    [Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests]
    Signed-off-by: David Rientjes
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Lee Schermerhorn
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid
    having to declare function prototypes.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
    MPOL_INTERLEAVE, are better declared as part of an enum since they are
    sequentially numbered and cannot be combined.

    The policy member of struct mempolicy is also converted from type short to
    type unsigned short. A negative policy does not have any legitimate meaning,
    so it is possible to change its type in preparation for adding optional mode
    flags later.

    The equivalent member of struct shmem_sb_info is also changed from int to
    unsigned short.

    For compatibility, the policy formal to get_mempolicy() remains as a pointer
    to an int:

    int get_mempolicy(int *policy, unsigned long *nmask,
    unsigned long maxnode, unsigned long addr,
    unsigned long flags);

    although the only possible values is the range of type unsigned short.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Introduce a node_zonelist() helper function. It is used to lookup the
    appropriate zonelist given a node and a GFP mask. The patch on its own is a
    cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
    necessary, it can be merged with the next patch in this set without problems.

    Reviewed-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman