12 Sep, 2013

6 commits

  • new_vma_page() is called only by page migration called from do_mbind(),
    where pages to be migrated are queued into a pagelist by
    queue_pages_range(). queue_pages_range() confirms that a queued page
    belongs to some vma, so !vma case is not supposed to be happen. This
    patch adds BUG_ON() to catch this unexpected case.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The function check_range() (and its family) is not well-named, because it
    does not only checking something, but moving pages from list to list to do
    page migration for them. So queue_pages_*range is more desirable name.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to
    migrate hugepage with mbind(2) after applying the enablement patch which
    comes later in this series.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend check_range() to handle vma with VM_HUGETLB set. We will be able
    to migrate hugepage with migrate_pages(2) after applying the enablement
    patch which comes later in this series.

    Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for
    example), we simply skip it now.

    Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by
    pmd/pud. This is not true in some architectures implementing hugepage
    with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge
    simply return 0 in such arch and page walker simply ignores such
    hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If node == NUMA_NO_NODE, pol is NULL, we should return NULL instead of
    do "if (!pol->mode)" check.

    [akpm@linux-foundation.org: reorganise code]
    Signed-off-by: Jianguo Wu
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     
  • Simple cleanup. Every user of vma_set_policy() does the same work, this
    looks a bit annoying imho. And the new trivial helper which does
    mpol_dup() + vma_set_policy() to simplify the callers.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Aug, 2013

1 commit

  • vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
    is doubly wrong:

    1. This leaks vma->vm_policy if it is not NULL and not equal to
    next->vm_policy.

    This can happen if vma_merge() expands "area", not prev (case 8).

    2. This sets the wrong policy if vma_merge() joins prev and area,
    area is the vma the caller needs to update and it still has the
    old policy.

    Revert commit 1444f92c8498 ("mm: merging memory blocks resets
    mempolicy") which introduced these problems.

    Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
    the problem that commit tried to address.

    Signed-off-by: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Cc: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Mar, 2013

2 commits

  • Currently, n_new is wrongly initialized. start and end parameter are
    inverted. Let's fix it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Hillf Danton
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • n->end is accessed in sp_insert(). Thus it should be update
    before calling sp_insert(). This mistake may make kernel panic.

    Signed-off-by: Hillf Danton
    Signed-off-by: KOSAKI Motohiro
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Feb, 2013

5 commits

  • Make a sweep through mm/ and convert code that uses -1 directly to using
    the more appropriate NUMA_NO_NODE.

    Signed-off-by: David Rientjes
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration of KSM pages is now safe: remove the PageKsm restrictions from
    mempolicy.c and migrate.c.

    But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
    are irrelevant to KSM: it looks as if that code was preventing hotremove
    migration of KSM pages, unless they happened to be in swapcache.

    There is some question as to whether enforcing a NUMA mempolicy migration
    ought to migrate KSM pages, mapped into entirely unrelated processes; but
    moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
    and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
    any area where this is a worry.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
    based on a nodemask as well as a gfp_mask"). but it does not match its
    comments, because it does not check the zone which > policy_zone.

    Also in commit b377fd3982ad ("Apply memory policies to top two highest
    zones when highest zone is ZONE_MOVABLE"), this commits told us, if
    highest zone is ZONE_MOVABLE, we should also apply memory policies to
    it. so ZONE_MOVABLE should be valid zone for policies.
    is_valid_nodemask() need to be changed to match it.

    Fix: check all zones, even its zoneid > policy_zone. Use
    nodes_intersects() instead open code to check it.

    Reported-by: Wen Congyang
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

03 Jan, 2013

3 commits

  • Sasha was fuzzing with trinity and reported the following problem:

    BUG: sleeping function called from invalid context at kernel/mutex.c:269
    in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
    2 locks held by trinity-main/6361:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0
    #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0
    Pid: 6361, comm: trinity-main Tainted: G W
    3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
    Call Trace:
    __might_sleep+0x1c3/0x1e0
    mutex_lock_nested+0x29/0x50
    mpol_shared_policy_lookup+0x2e/0x90
    shmem_get_policy+0x2e/0x30
    get_vma_policy+0x5a/0xa0
    mpol_misplaced+0x41/0x1d0
    handle_pte_fault+0x465/0x6a0

    This was triggered by a different version of automatic NUMA balancing
    but in theory the current version is vunerable to the same problem.

    do_numa_page
    -> numa_migrate_prep
    -> mpol_misplaced
    -> get_vma_policy
    -> shmem_get_policy

    It's very unlikely this will happen as shared pages are not marked
    pte_numa -- see the page_mapcount() check in change_pte_range() -- but
    it is possible.

    To address this, this patch restores sp->lock as originally implemented
    by Kosaki Motohiro. In the path where get_vma_policy() is called, it
    should not be calling sp_alloc() so it is not necessary to treat the PTL
    specially.

    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove the unused argument (formerly no_context) from mpol_parse_str()
    and from mpol_to_str().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
    mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts
    or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
    in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
    pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
    worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic.

    Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
    when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when
    no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
    which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
    With slab poisoning, you can then rely on mpol_to_str() to set the bit
    for node 0x6b6b, probably in the next page above the caller's stack.

    mpol_parse_str() is only called from shmem_parse_options(): no_context
    is always true, so call it unused for now, and remove !no_context code.
    Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
    expect. Then mpol_to_str() can ignore its no_context argument also,
    the mpol being appropriately initialized whether contextualized or not.
    Rename its no_context unused too, and let subsequent patch remove them
    (that's not needed for stable backporting, which would involve rejects).

    I don't understand why MPOL_LOCAL is described as a pseudo-policy:
    it's a reasonable policy which suffers from a confusing implementation
    in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be
    much more robust if MPOL_LOCAL were recognized in switch statements
    throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
    empty) nodes mask like everyone else, instead of its preferred_node
    variant (I presume an optimization from the days before MPOL_LOCAL).
    But that would take me too long to get right and fully tested.

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Dec, 2012

1 commit


11 Dec, 2012

11 commits

  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • …ask<->node relationships

    Note: This two-stage filter was taken directly from the sched/numa patch
    "sched, numa, mm: Add the scanning page fault machinery" but is
    only a partial extraction. As the end result is not necessarily
    recognisable, the signed-offs-by had to be removed. Will be added
    back if requested.

    While it is desirable that all threads in a process run on its home
    node, this is not always possible or necessary. There may be more
    threads than exist within the node or the node might over-subscribed
    with unrelated processes.

    This can cause a situation whereby a page gets migrated off its home
    node because the threads clearing pte_numa were running off-node. This
    patch uses page->last_nid to build a two-stage filter before pages get
    migrated to avoid problems with short or unlikely task<->node
    relationships.

    Signed-off-by: Mel Gorman <mgorman@suse.de>

    Mel Gorman
     
  • This is the simplest possible policy that still does something of note.
    When a pte_numa is faulted, it is moved immediately. Any replacement
    policy must at least do better than this and in all likelihood this
    policy regresses normal workloads.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
    explicitly request lazy migration is a good idea but the actual
    API has not been well reviewed and once released we have to support it.
    For now this patch prevents an application using the services. This
    will need to be revisited.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch converts change_prot_numa() to use change_protection(). As
    pte_numa and friends check the PTE bits directly it is necessary for
    change_protection() to use pmd_mknuma(). Hence the required
    modifications to change_protection() are a little clumsy but the
    end result is that most of the numa page table helpers are just one or
    two instructions.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: Once again there is a lot of patch stealing and the end result
    is sufficiently different that I had to drop the signed-offs.
    Will re-add if the original authors are ok with that.

    This patch adds another mbind() flag to request "lazy migration". The
    flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
    pages are marked PROT_NONE. The pages will be migrated in the fault
    path on "first touch", if the policy dictates at that time.

    "Lazy Migration" will allow testing of migrate-on-fault via mbind().
    Also allows applications to specify that only subsequently touched
    pages be migrated to obey new policy, instead of all pages in range.
    This can be useful for multi-threaded applications working on a
    large shared data area that is initialized by an initial thread
    resulting in all pages on one [or a few, if overflowed] nodes.
    After PROT_NONE, the pages in regions assigned to the worker threads
    will be automatically migrated local to the threads on 1st touch.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Lee Schermerhorn
     
  • This patch provides a new function to test whether a page resides
    on a node that is appropriate for the mempolicy for the vma and
    address where the page is supposed to be mapped. This involves
    looking up the node where the page belongs. So, the function
    returns that node so that it may be used to allocated the page
    without consulting the policy again.

    A subsequent patch will call this function from the fault path.
    Because of this, I don't want to go ahead and allocate the page, e.g.,
    via alloc_page_vma() only to have to free it if it has the correct
    policy. So, I just mimic the alloc_page_vma() node computation
    logic--sort of.

    Note: we could use this function to implement a MPOL_MF_STRICT
    behavior when migrating pages to match mbind() mempolicy--e.g.,
    to ensure that pages in an interleaved range are reinterleaved
    rather than left where they are when they reside on any page in
    the interleave nodemask.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Linus Torvalds
    [ Added MPOL_F_LAZY to trigger migrate-on-fault;
    simplified code now that we don't have to bother
    with special crap for interleaved ]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Lee Schermerhorn
     
  • This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
    to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
    flags, mbind() will map the pages PROT_NONE so that they will be
    migrated on the next touch.

    This allows an application to prepare for a new phase of operation
    where different regions of shared storage will be assigned to
    worker threads, w/o changing policy. Note that we could just use
    "default" policy in this case. However, this also allows an
    application to request that pages be migrated, only if necessary,
    to follow any arbitrary policy that might currently apply to a
    range of pages, without knowing the policy, or without specifying
    multiple mbind()s for ranges with different policies.

    [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

    Bug-Reported-by: Reported-by: Fengguang Wu
    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Lee Schermerhorn
     
  • Make MPOL_LOCAL a real and exposed policy such that applications that
    relied on the previous default behaviour can explicitly request it.

    Requested-by: Christoph Lameter
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Peter Zijlstra
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

07 Dec, 2012

1 commit

  • This fixes a regression in 3.7-rc, which has since gone into stable.

    Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
    imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
    refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
    on expecting alloc_page_vma() to drop the refcount it had acquired.
    This deserves a rework: but for now fix the leak in shmem_alloc_page().

    Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
    the same refcounting there as in shmem_alloc_page(), delete its onstack
    mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
    those were invented to let swapin_readahead() make an unknown number of
    calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
    alloc_pages_vma() has kept refcount in balance, so now no problem.

    Reported-and-tested-by: Tommi Rantala
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Oct, 2012

1 commit

  • When reading /proc/pid/numa_maps, it's possible to return the contents of
    the stack where the mempolicy string should be printed if the policy gets
    freed from beneath us.

    This happens because mpol_to_str() may return an error the
    stack-allocated buffer is then printed without ever being stored.

    There are two possible error conditions in mpol_to_str():

    - if the buffer allocated is insufficient for the string to be stored,
    and

    - if the mempolicy has an invalid mode.

    The first error condition is not triggered in any of the callers to
    mpol_to_str(): at least 50 bytes is always allocated on the stack and this
    is sufficient for the string to be written. A future patch should convert
    this into BUILD_BUG_ON() since we know the maximum strlen possible, but
    that's not -rc material.

    The second error condition is possible if a race occurs in dropping a
    reference to a task's mempolicy causing it to be freed during the read().
    The slab poison value is then used for the mode and mpol_to_str() returns
    -EINVAL.

    This race is only possible because get_vma_policy() believes that
    mm->mmap_sem protects task->mempolicy, which isn't true. The exit path
    does not hold mm->mmap_sem when dropping the reference or setting
    task->mempolicy to NULL: it uses task_lock(task) instead.

    Thus, it's required for the caller of a task mempolicy to hold
    task_lock(task) while grabbing the mempolicy and reading it. Callers with
    a vma policy store their mempolicy earlier and can simply increment the
    reference count so it's guaranteed not to be freed.

    Reported-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2012

6 commits

  • Revert commit 0def08e3acc2 because check_range can't fail in
    migrate_to_node with considering current usecases.

    Quote from Johannes

    : I think it makes sense to revert. Not because of the semantics, but I
    : just don't see how check_range() could even fail for this callsite:
    :
    : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
    : find_vma()
    :
    : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
    : and so can not fail
    :
    : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
    : continue until addr == end, so we never fail with -EIO

    And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
    which might pass to MPOL_MF_STRICT.

    Suggested-by: Johannes Weiner
    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vasiliy Kulikov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier
    related damage v3") introduced a potential memory corruption.
    shmem_alloc_page() uses a pseudo vma and it has one significant unique
    combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.

    get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
    and mpol_cond_put() DOES decrease a policy ref when a policy has
    MPOL_F_SHARED. Therefore, when a cpuset update race occurs,
    alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
    reference count and frees the policy prematurely.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When shared_policy_replace() fails to allocate new->policy is not freed
    correctly by mpol_set_shared_policy(). The problem is that shared
    mempolicy code directly call kmem_cache_free() in multiple places where
    it is easy to make a mistake.

    This patch creates an sp_free wrapper function and uses it. The bug was
    introduced pre-git age (IOW, before 2.6.12-rc2).

    [mgorman@suse.de: Editted changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
    be dereferenced if sp->lock is not held and 2) another thread can modify
    sp_node between spin_unlock for allocating a new sp node and next
    spin_lock. The bug was introduced before 2.6.12-rc2.

    Kosaki's original patch for this problem was to allocate an sp node and
    policy within shared_policy_replace and initialise it when the lock is
    reacquired. I was not keen on this approach because it partially
    duplicates sp_alloc(). As the paths were sp->lock is taken are not that
    performance critical this patch converts sp->lock to sp->mutex so it can
    sleep when calling sp_alloc().

    [kosaki.motohiro@jp.fujitsu.com: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Dave Jones' system call fuzz testing tool "trinity" triggered the
    following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
    __slab_alloc+0x3d3/0x445
    kmem_cache_alloc+0x29d/0x2b0
    mpol_new+0xa3/0x140
    sys_mbind+0x142/0x620
    system_call_fastpath+0x16/0x1b

    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
    __slab_free+0x2e/0x1de
    kmem_cache_free+0x25a/0x260
    __mpol_put+0x27/0x30
    remove_vma+0x68/0x90
    exit_mmap+0x118/0x140
    mmput+0x73/0x110
    exit_mm+0x108/0x130
    do_exit+0x162/0xb90
    do_group_exit+0x4f/0xc0
    sys_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

    The problem is that the structure is being prematurely freed due to a
    reference count imbalance. In the following case mbind(addr, len) should
    replace the memory policies of both vma1 and vma2 and thus they will
    become to share the same mempolicy and the new mempolicy will have the
    MPOL_F_SHARED flag.

    +-------------------+-------------------+
    | vma1 | vma2(shmem) |
    +-------------------+-------------------+
    | |
    addr addr+len

    alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
    maintaining the mempolicy reference count. The current rule is that
    get_vma_policy() only increments refcount for shmem VMA and
    mpol_conf_put() only decrements refcount if the policy has
    MPOL_F_SHARED.

    In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
    The reference count will be decreased even though was not increased
    whenever alloc_page_vma() is called. This has been broken since commit
    [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.

    There is another serious bug with the sharing of memory policies.
    Currently, mempolicy rebind logic (it is called from cpuset rebinding)
    ignores a refcount of mempolicy and override it forcibly. Thus, any
    mempolicy sharing may cause mempolicy corruption. The bug was
    introduced by commit [68860ec1: cpusets: automatic numa mempolicy
    rebinding].

    Ideally, the shared policy handling would be rewritten to either
    properly handle COW of the policy structures or at least reference count
    MPOL_F_SHARED based exclusively on information within the policy.
    However, this patch takes the easier approach of disabling any policy
    sharing between VMAs. Each new range allocated with sp_alloc will
    allocate a new policy, set the reference count to 1 and drop the
    reference count of the old policy. This increases the memory footprint
    but is not expected to be a major problem as mbind() is unlikely to be
    used for fine-grained ranges. It is also inefficient because it means
    we allocate a new policy even in cases where mbind_range() could use the
    new_policy passed to it. However, it is more straight-forward and the
    change should be invisible to the user.

    [mgorman@suse.de: Edited changelog]
    Reported-by: Dave Jones ,
    Cc: Christoph Lameter ,
    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle
    vma->vm_policy linkages") removed vma->vm_policy updates code but it is
    the purpose of mbind_range(). Now, mbind_range() is virtually a no-op
    and while it does not allow memory corruption it is not the right fix.
    This patch is a revert.

    [mgorman@suse.de: Edited changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro