10 Oct, 2014

3 commits

  • - get_vma_policy(task) is not safe if task != current, remove this
    argument.

    - get_vma_policy() no longer has callers outside of mempolicy.c,
    make it static.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Extract the code which looks for vma's policy from get_vma_policy()
    into the new helper, __get_vma_policy(). Export get_task_policy().

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. vma_policy_mof(task) is simply not safe unless task == current,
    it can race with do_exit()->mpol_put(). Remove this arg and update
    its single caller.

    2. vma can not be NULL, remove this check and simplify the code.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Jun, 2014

1 commit

  • Currently hugepage migration is available for all archs which support
    pmd-level hugepage, but testing is done only for x86_64 and there're
    bugs for other archs. So to avoid breaking such archs, this patch
    limits the availability strictly to x86_64 until developers of other
    archs get interested in enabling this feature.

    Simply disabling hugepage migration on non-x86_64 archs is not enough to
    fix the reported problem where sys_move_pages() hits the BUG_ON() in
    follow_page(FOLL_GET), so let's fix this by checking if hugepage
    migration is supported in vma_migratable().

    Signed-off-by: Naoya Horiguchi
    Reported-by: Michael Ellerman
    Tested-by: Michael Ellerman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: Tony Luck
    Cc: Russell King
    Cc: Martin Schwidefsky
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: David Miller
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

08 Apr, 2014

2 commits

  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Jan, 2014

1 commit

  • Mempolicies only exist for CONFIG_NUMA configurations. Therefore, a
    certain class of functions are unneeded in configurations where
    CONFIG_NUMA is disabled such as functions that duplicate existing
    mempolicies, lookup existing policies, set certain mempolicy traits, or
    test mempolicies for certain attributes.

    Remove the unneeded functions so that any future callers get a compile-
    time error and protect their code with CONFIG_NUMA as required.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Nov, 2013

1 commit

  • mpol_to_str() should not fail. Currently, it either fails because the
    string buffer is too small or because a string hasn't been defined for a
    mempolicy mode.

    If a new mempolicy mode is introduced and no string is defined for it,
    just warn and return "unknown".

    If the buffer is too small, just truncate the string and return, the
    same behavior as snprintf().

    This also fixes a bug where there was no NULL-byte termination when doing
    *p++ = '=' and *p++ ':' and maxlen has been reached.

    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Chen Gang
    Cc: Rik van Riel
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2013

1 commit

  • There is a 90% regression observed with a large Oracle performance test
    on a 4 node system. Profiles indicated that the overhead was due to
    contention on sp_lock when looking up shared memory policies. These
    policies do not have the appropriate flags to allow them to be
    automatically balanced so trapping faults on them is pointless. This
    patch skips VMAs that do not have MPOL_F_MOF set.

    [riel@redhat.com: Initial patch]

    Signed-off-by: Mel Gorman
    Reported-and-tested-by: Joe Mario
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-32-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

12 Sep, 2013

2 commits

  • Enable hugepage migration from migrate_pages(2), move_pages(2), and
    mbind(2).

    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Jan, 2013

2 commits

  • Sasha was fuzzing with trinity and reported the following problem:

    BUG: sleeping function called from invalid context at kernel/mutex.c:269
    in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
    2 locks held by trinity-main/6361:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0
    #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0
    Pid: 6361, comm: trinity-main Tainted: G W
    3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
    Call Trace:
    __might_sleep+0x1c3/0x1e0
    mutex_lock_nested+0x29/0x50
    mpol_shared_policy_lookup+0x2e/0x90
    shmem_get_policy+0x2e/0x30
    get_vma_policy+0x5a/0xa0
    mpol_misplaced+0x41/0x1d0
    handle_pte_fault+0x465/0x6a0

    This was triggered by a different version of automatic NUMA balancing
    but in theory the current version is vunerable to the same problem.

    do_numa_page
    -> numa_migrate_prep
    -> mpol_misplaced
    -> get_vma_policy
    -> shmem_get_policy

    It's very unlikely this will happen as shared pages are not marked
    pte_numa -- see the page_mapcount() check in change_pte_range() -- but
    it is possible.

    To address this, this patch restores sp->lock as originally implemented
    by Kosaki Motohiro. In the path where get_vma_policy() is called, it
    should not be calling sp_alloc() so it is not necessary to treat the PTL
    specially.

    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove the unused argument (formerly no_context) from mpol_parse_str()
    and from mpol_to_str().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

1 commit

  • This patch provides a new function to test whether a page resides
    on a node that is appropriate for the mempolicy for the vma and
    address where the page is supposed to be mapped. This involves
    looking up the node where the page belongs. So, the function
    returns that node so that it may be used to allocated the page
    without consulting the policy again.

    A subsequent patch will call this function from the fault path.
    Because of this, I don't want to go ahead and allocate the page, e.g.,
    via alloc_page_vma() only to have to free it if it has the correct
    policy. So, I just mimic the alloc_page_vma() node computation
    logic--sort of.

    Note: we could use this function to implement a MPOL_MF_STRICT
    behavior when migrating pages to match mbind() mempolicy--e.g.,
    to ensure that pages in an interleaved range are reinterleaved
    rather than left where they are when they reside on any page in
    the interleave nodemask.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Linus Torvalds
    [ Added MPOL_F_LAZY to trigger migrate-on-fault;
    simplified code now that we don't have to bother
    with special crap for interleaved ]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Lee Schermerhorn
     

07 Dec, 2012

1 commit

  • This fixes a regression in 3.7-rc, which has since gone into stable.

    Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
    imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
    refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
    on expecting alloc_page_vma() to drop the refcount it had acquired.
    This deserves a rework: but for now fix the leak in shmem_alloc_page().

    Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
    the same refcounting there as in shmem_alloc_page(), delete its onstack
    mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
    those were invented to let swapin_readahead() make an unknown number of
    calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
    alloc_pages_vma() has kept refcount in balance, so now no problem.

    Reported-and-tested-by: Tommi Rantala
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Oct, 2012

1 commit


09 Oct, 2012

2 commits

  • shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
    be dereferenced if sp->lock is not held and 2) another thread can modify
    sp_node between spin_unlock for allocating a new sp node and next
    spin_lock. The bug was introduced before 2.6.12-rc2.

    Kosaki's original patch for this problem was to allocate an sp node and
    policy within shared_policy_replace and initialise it when the lock is
    reacquired. I was not keen on this approach because it partially
    duplicates sp_alloc(). As the paths were sp->lock is taken are not that
    performance critical this patch converts sp->lock to sp->mutex so it can
    sleep when calling sp_alloc().

    [kosaki.motohiro@jp.fujitsu.com: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

20 Jun, 2012

1 commit

  • slab_node() could access current->mempolicy from interrupt context.
    However there's a race condition during exit where the mempolicy
    is first freed and then the pointer zeroed.

    Using this from interrupts seems bogus anyways. The interrupt
    will interrupt a random process and therefore get a random
    mempolicy. Many times, this will be idle's, which noone can change.

    Just disable this here and always use local for slab
    from interrupts. I also cleaned up the callers of slab_node a bit
    which always passed the same argument.

    I believe the original mempolicy code did that in fact,
    so it's likely a regression.

    v2: send version with correct logic
    v3: simplify. fix typo.
    Reported-by: Arun Sharma
    Cc: penberg@kernel.org
    Cc: cl@linux.com
    Signed-off-by: Andi Kleen
    [tdmackey@twitter.com: Rework control flow based on feedback from
    cl@linux.com, fix logic, and cleanup current task_struct reference]
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Mackey
    Signed-off-by: Pekka Enberg

    Andi Kleen
     

30 May, 2012

1 commit

  • s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it
    duplicates the argument's type.

    Done in a fit of irritation over 80-col issues :(

    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Jan, 2012

1 commit


25 May, 2011

2 commits

  • When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h.
    However, in the NUMA case, the definition is always compiled.

    Since it is not strictly true that tmpfs is the only client, and since the
    symbol was always lurking around anyways, export mpol_to_str()
    unconditionally. Furthermore, this will allow us to move show_numa_map()
    out of mempolicy.c and into the procfs subsystem.

    Signed-off-by: Stephen Wilson
    Cc: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     
  • In commit 48fce3429d ("mempolicies: unexport get_vma_policy()")
    get_vma_policy() was marked static as all clients were local to
    mempolicy.c.

    However, the decision to generate /proc/pid/numa_maps in the numa memory
    policy code and outside the procfs subsystem introduces an artificial
    interdependency between the two systems. Exporting get_vma_policy() once
    again is the first step to clean up this interdependency.

    Signed-off-by: Stephen Wilson
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Alexey Dobriyan
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Wilson
     

10 Aug, 2010

1 commit

  • The oom killer presently kills current whenever there is no more memory
    free or reclaimable on its mempolicy's nodes. There is no guarantee that
    current is a memory-hogging task or that killing it will free any
    substantial amount of memory, however.

    In such situations, it is better to scan the tasklist for nodes that are
    allowed to allocate on current's set of nodes and kill the task with the
    highest badness() score. This ensures that the most memory-hogging task,
    or the one configured by the user with /proc/pid/oom_adj, is always
    selected in such scenarios.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 May, 2010

1 commit

  • Nick Piggin reported that the allocator may see an empty nodemask when
    changing cpuset's mems[1]. It happens only on the kernel that do not do
    atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG)

    But I found that there is also a problem on the kernel that can do atomic
    nodemask_t stores. The problem is that the allocator can't find a node to
    alloc page when changing cpuset's mems though there is a lot of free
    memory. The reason is like this:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    I can use the attached program reproduce it by the following step:

    # mkdir /dev/cpuset
    # mount -t cpuset cpuset /dev/cpuset
    # mkdir /dev/cpuset/1
    # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
    # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
    # echo $$ > /dev/cpuset/1/tasks
    # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog &
    = max(nr_cpus - 1, 1)
    # killall -s SIGUSR1 cpuset_mem_hog
    # ./change_mems.sh

    several hours later, oom will happen though there is a lot of free memory.

    This patchset fixes this problem by expanding the nodes range first(set
    newly allowed bits) and shrink it lazily(clear newly disallowed bits). So
    we use a variable to tell the write-side task that read-side task is
    reading nodemask, and the write-side task clears newly disallowed nodes
    after read-side task ends the current memory allocation.

    This patch:

    In order to fix no node to alloc memory, when we want to update mempolicy
    and mems_allowed, we expand the set of nodes first (set all the newly
    nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
    mempolicy's rebind functions may breaks the expanding.

    So we restructure the mempolicy's rebind functions and split the rebind
    work to two steps, just like the update of cpuset's mems: The 1st step:
    expand the set of the mempolicy's nodes. The 2nd step: shrink the set of
    the mempolicy's nodes. It is used when there is no real lock to protect
    the mempolicy in the read-side. Otherwise we can do rebind work at once.

    In order to implement it, we define

    enum mpol_rebind_step {
    MPOL_REBIND_ONCE,
    MPOL_REBIND_STEP1,
    MPOL_REBIND_STEP2,
    MPOL_REBIND_NSTEP,
    };

    If the mempolicy needn't be updated by two steps, we can pass
    MPOL_REBIND_ONCE to the rebind functions. Or we can pass
    MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
    MPOL_REBIND_STEP2 to do the second step work.

    Besides that, it maybe long time between these two step and we have to
    release the lock that protects mempolicy and mems_allowed. If we hold the
    lock once again, we must check whether the current mempolicy is under the
    rebinding (the first step has been done) or not, because the task may
    alloc a new mempolicy when we don't hold the lock. So we defined the
    following flag to identify it:

    #define MPOL_F_REBINDING (1 << 2)

    The new functions will be used in the next patch.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

16 Dec, 2009

1 commit

  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 Jul, 2008

1 commit

  • We'd like to support CONFIG_MEMORY_HOTREMOVE on s390, which depends on
    CONFIG_MIGRATION. So far, CONFIG_MIGRATION is only available with NUMA
    support.

    This patch makes CONFIG_MIGRATION selectable for architectures that define
    ARCH_ENABLE_MEMORY_HOTREMOVE. When MIGRATION is enabled w/o NUMA, the
    kernel won't compile because migrate_vmas() does not know about
    vm_ops->migrate() and vma_migratable() does not know about policy_zone.
    To fix this, those two functions can be restricted to '#ifdef CONFIG_NUMA'
    because they are not being used w/o NUMA. vma_migratable() is moved over
    from migrate.h to mempolicy.h.

    [kosaki.motohiro@jp.fujitsu.com: build fix]
    Acked-by: Christoph Lameter
    Signed-off-by: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: KOSAKI Motorhiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

28 Apr, 2008

12 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • mm/shmem.c currently contains functions to parse and display memory policy
    strings for the tmpfs 'mpol' mount option. Move this to mm/mempolicy.c with
    the rest of the mempolicy support. With subsequent patches, we'll be able to
    remove knowledge of the details [mode, flags, policy, ...] completely from
    shmem.c

    1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
    mm/mempolicy.c. Rework to use the policy_types[] array [used by
    mpol_to_str()] to look up mode by name.

    2) use mpol_to_str() to format policy for shmem_show_mpol(). mpol_to_str()
    expects a pointer to a struct mempolicy, so temporarily construct one.
    This will be replaced with a reference to a struct mempolicy in the tmpfs
    superblock in a subsequent patch.

    NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
    sign '=' as the nodemask delimiter to match mpol_parse_str() and the
    tmpfs/shmem mpol mount option formatting that now uses mpol_to_str(). This
    is a user visible change to numa_maps, but then the addition of the mode
    flags already changed the display. It makes sense to me to have the mounts
    and numa_maps display the policy in the same format. However, if anyone
    objects strongly, I can pass the desired nodemask delimeter as an arg to
    mpol_to_str().

    Note 2: Like show_numa_map(), I don't check the return code from
    mpol_to_str(). I do use a longer buffer than the one provided by
    show_numa_map(), which seems to have sufficed so far.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Now that we're using "preferred local" policy for system default, we need to
    make this as fast as possible. Because of the variable size of the mempolicy
    structure [based on size of nodemasks], the preferred_node may be in a
    different cacheline from the mode. This can result in accessing an extra
    cacheline in the normal case of system default policy. Suspect this is the
    cause of an observed 2-3% slowdown in page fault testing relative to kernel
    without this patch series.

    To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
    flags member which is guaranteed [?] to be in the same cacheline as the mode
    itself.

    Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
    for both anon and shmem segments with system default and vma [preferred local]
    policy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • As part of yet another rework of mempolicy reference counting, we want to be
    able to identify shared policies efficiently, because they have an extra ref
    taken on lookup that needs to be removed when we're finished using the policy.

    Note: the extra ref is required because the policies are
    shared between tasks/processes and can be changed/freed
    by one task while another task is using them--e.g., for
    page allocation.

    Building on David Rientjes mempolicy "mode flags" enhancement, this patch
    indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
    member of the struct mempolicy added by David. MPOL_F_SHARED, and any future
    "internal mode flags" are reserved from bit zero up, as they will never be
    passed in the upper bits of the mode argument of a mempolicy API.

    I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
    rb-tree. Don't need/want to clear the flag when removing from the tree as the
    mempolicy is freed [unref'd] internally to the sp_delete() function. However,
    a task could hold another reference on this mempolicy from a prior lookup. We
    need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
    unref, eventually freeing, the mempolicy.

    A later patch in this series will introduce a function to conditionally unref
    [mpol_free] a policy. The MPOL_F_SHARED flag is one reason [currently the
    only reason] to unref/free a policy via the conditional free.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • The terms 'policy' and 'mode' are both used in various places to describe the
    semantics of the value stored in the 'policy' member of struct mempolicy.
    Furthermore, the term 'policy' is used to refer to that member, to the entire
    struct mempolicy and to the more abstract concept of the tuple consisting of a
    "mode" and an optional node or set of nodes. Recently, we have added "mode
    flags" that are passed in the upper bits of the 'mode' [or sometimes,
    'policy'] member of the numa APIs.

    I'd like to resolve this confusion, which perhaps only exists in my mind, by
    renaming the 'policy' member to 'mode' throughout, and fixing up the
    Documentation. Man pages will be updated separately.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Removes forward definition of vm_area_struct in linux/mempolicy.h. We already
    get it from the linux/slab.h -> linux/gfp.h include.

    Removes the unused mpol_set_vma_default() macro from linux/mempolicy.h.

    Removes the extern definition of default_policy since it is only referenced,
    as it should be, in mm/mempolicy.c.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
    nodemasks passed via set_mempolicy() or mbind() should be considered relative
    to the current task's mems_allowed.

    When the mempolicy is created, the passed nodemask is folded and mapped onto
    the current task's mems_allowed. For example, consider a task using
    set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
    nodemask of 1-3. If current's mems_allowed is 4-7, the effected nodemask is
    5-7 (the second, third, and fourth node of mems_allowed).

    If the same task is attached to a cpuset, the mempolicy nodemask is rebound
    each time the mems are changed. Some possible rebinds and results are:

    mems result
    1-3 1-3
    1-7 2-4
    1,5-6 1,5-6
    1,5-7 5-7

    Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
    to the resultant nodemask from the relative remap.

    In the MPOL_PREFERRED case, the preferred node is remapped from the currently
    effected nodemask to the relative nodemask.

    This mempolicy mode flag was conceived of by Paul Jackson .

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
    node remap when the policy is rebound.

    Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
    a union with cpuset_mems_allowed:

    struct mempolicy {
    ...
    union {
    nodemask_t cpuset_mems_allowed;
    nodemask_t user_nodemask;
    } w;
    }

    that stores the the nodemask that the user passed when he or she created the
    mempolicy via set_mempolicy() or mbind(). When using MPOL_F_STATIC_NODES,
    which is passed with any mempolicy mode, the user's passed nodemask
    intersected with the VMA or task's allowed nodes is always used when
    determining the preferred node, setting the MPOL_BIND zonelist, or creating
    the interleave nodemask. This happens whenever the policy is rebound,
    including when a task's cpuset assignment changes or the cpuset's mems are
    changed.

    This creates an interesting side-effect in that it allows the mempolicy
    "intent" to lie dormant and uneffected until it has access to the node(s) that
    it desires. For example, if you currently ask for an interleaved policy over
    a set of nodes that you do not have access to, the mempolicy is not created
    and the task continues to use the previous policy. With this change, however,
    it is possible to create the same mempolicy; it is only effected when access
    to nodes in the nodemask is acquired.

    It is also possible to mount tmpfs with the static nodemask behavior when
    specifying a node or nodemask. To do this, simply add "=static" immediately
    following the mempolicy mode at mount time:

    mount -o remount mpol=interleave=static:1-3

    Also removes mpol_check_policy() and folds its logic into mpol_new() since it
    is now obsoleted. The unused vma_mpol_equal() is also removed.

    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes