04 Aug, 2013

1 commit

  • commit 3964acd0dbec123aa0a621973a2a0580034b4788 upstream.

    vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this
    is doubly wrong:

    1. This leaks vma->vm_policy if it is not NULL and not equal to
    next->vm_policy.

    This can happen if vma_merge() expands "area", not prev (case 8).

    2. This sets the wrong policy if vma_merge() joins prev and area,
    area is the vma the caller needs to update and it still has the
    old policy.

    Revert commit 1444f92c8498 ("mm: merging memory blocks resets
    mempolicy") which introduced these problems.

    Change mbind_range() to recheck mpol_equal() after vma_merge() to fix
    the problem that commit tried to address.

    Signed-off-by: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Cc: Steven T Hampson
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

09 Mar, 2013

2 commits

  • Currently, n_new is wrongly initialized. start and end parameter are
    inverted. Let's fix it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Hillf Danton
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • n->end is accessed in sp_insert(). Thus it should be update
    before calling sp_insert(). This mistake may make kernel panic.

    Signed-off-by: Hillf Danton
    Signed-off-by: KOSAKI Motohiro
    Cc: Sasha Levin
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Dave Jones
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Feb, 2013

5 commits

  • Make a sweep through mm/ and convert code that uses -1 directly to using
    the more appropriate NUMA_NO_NODE.

    Signed-off-by: David Rientjes
    Reviewed-by: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration of KSM pages is now safe: remove the PageKsm restrictions from
    mempolicy.c and migrate.c.

    But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
    are irrelevant to KSM: it looks as if that code was preventing hotremove
    migration of KSM pages, unless they happened to be in swapcache.

    There is some question as to whether enforcing a NUMA mempolicy migration
    ought to migrate KSM pages, mapped into entirely unrelated processes; but
    moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
    and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
    any area where this is a worry.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
    based on a nodemask as well as a gfp_mask"). but it does not match its
    comments, because it does not check the zone which > policy_zone.

    Also in commit b377fd3982ad ("Apply memory policies to top two highest
    zones when highest zone is ZONE_MOVABLE"), this commits told us, if
    highest zone is ZONE_MOVABLE, we should also apply memory policies to
    it. so ZONE_MOVABLE should be valid zone for policies.
    is_valid_nodemask() need to be changed to match it.

    Fix: check all zones, even its zoneid > policy_zone. Use
    nodes_intersects() instead open code to check it.

    Reported-by: Wen Congyang
    Signed-off-by: Lai Jiangshan
    Signed-off-by: Tang Chen
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Jiang Liu
    Cc: Jianguo Wu
    Cc: Kamezawa Hiroyuki
    Cc: Lai Jiangshan
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

03 Jan, 2013

3 commits

  • Sasha was fuzzing with trinity and reported the following problem:

    BUG: sleeping function called from invalid context at kernel/mutex.c:269
    in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main
    2 locks held by trinity-main/6361:
    #0: (&mm->mmap_sem){++++++}, at: [] __do_page_fault+0x1e4/0x4f0
    #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [] handle_pte_fault+0x3f7/0x6a0
    Pid: 6361, comm: trinity-main Tainted: G W
    3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74
    Call Trace:
    __might_sleep+0x1c3/0x1e0
    mutex_lock_nested+0x29/0x50
    mpol_shared_policy_lookup+0x2e/0x90
    shmem_get_policy+0x2e/0x30
    get_vma_policy+0x5a/0xa0
    mpol_misplaced+0x41/0x1d0
    handle_pte_fault+0x465/0x6a0

    This was triggered by a different version of automatic NUMA balancing
    but in theory the current version is vunerable to the same problem.

    do_numa_page
    -> numa_migrate_prep
    -> mpol_misplaced
    -> get_vma_policy
    -> shmem_get_policy

    It's very unlikely this will happen as shared pages are not marked
    pte_numa -- see the page_mapcount() check in change_pte_range() -- but
    it is possible.

    To address this, this patch restores sp->lock as originally implemented
    by Kosaki Motohiro. In the path where get_vma_policy() is called, it
    should not be calling sp_alloc() so it is not necessary to treat the PTL
    specially.

    Signed-off-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Remove the unused argument (formerly no_context) from mpol_parse_str()
    and from mpol_to_str().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA
    mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts
    or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often
    in a page table (causing "Bad swap" or "Bad page map" warning or "Bad
    pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere
    worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic.

    Recent NUMA enhancements are not to blame: this dates back to 2.6.35,
    when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when
    no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(),
    which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags.
    With slab poisoning, you can then rely on mpol_to_str() to set the bit
    for node 0x6b6b, probably in the next page above the caller's stack.

    mpol_parse_str() is only called from shmem_parse_options(): no_context
    is always true, so call it unused for now, and remove !no_context code.
    Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might
    expect. Then mpol_to_str() can ignore its no_context argument also,
    the mpol being appropriately initialized whether contextualized or not.
    Rename its no_context unused too, and let subsequent patch remove them
    (that's not needed for stable backporting, which would involve rejects).

    I don't understand why MPOL_LOCAL is described as a pseudo-policy:
    it's a reasonable policy which suffers from a confusing implementation
    in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be
    much more robust if MPOL_LOCAL were recognized in switch statements
    throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly
    empty) nodes mask like everyone else, instead of its preferred_node
    variant (I presume an optimization from the days before MPOL_LOCAL).
    But that would take me too long to get right and fully tested.

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

2 commits

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • Pass vma instead of mm and add address parameter.

    In most cases we already have vma on the stack. We provides
    split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

    This change is preparation to huge zero pmd splitting implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: "H. Peter Anvin"
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Dec, 2012

1 commit


11 Dec, 2012

11 commits

  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • …ask<->node relationships

    Note: This two-stage filter was taken directly from the sched/numa patch
    "sched, numa, mm: Add the scanning page fault machinery" but is
    only a partial extraction. As the end result is not necessarily
    recognisable, the signed-offs-by had to be removed. Will be added
    back if requested.

    While it is desirable that all threads in a process run on its home
    node, this is not always possible or necessary. There may be more
    threads than exist within the node or the node might over-subscribed
    with unrelated processes.

    This can cause a situation whereby a page gets migrated off its home
    node because the threads clearing pte_numa were running off-node. This
    patch uses page->last_nid to build a two-stage filter before pages get
    migrated to avoid problems with short or unlikely task<->node
    relationships.

    Signed-off-by: Mel Gorman <mgorman@suse.de>

    Mel Gorman
     
  • This is the simplest possible policy that still does something of note.
    When a pte_numa is faulted, it is moved immediately. Any replacement
    policy must at least do better than this and in all likelihood this
    policy regresses normal workloads.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to
    explicitly request lazy migration is a good idea but the actual
    API has not been well reviewed and once released we have to support it.
    For now this patch prevents an application using the services. This
    will need to be revisited.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch converts change_prot_numa() to use change_protection(). As
    pte_numa and friends check the PTE bits directly it is necessary for
    change_protection() to use pmd_mknuma(). Hence the required
    modifications to change_protection() are a little clumsy but the
    end result is that most of the numa page table helpers are just one or
    two instructions.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: Once again there is a lot of patch stealing and the end result
    is sufficiently different that I had to drop the signed-offs.
    Will re-add if the original authors are ok with that.

    This patch adds another mbind() flag to request "lazy migration". The
    flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
    pages are marked PROT_NONE. The pages will be migrated in the fault
    path on "first touch", if the policy dictates at that time.

    "Lazy Migration" will allow testing of migrate-on-fault via mbind().
    Also allows applications to specify that only subsequently touched
    pages be migrated to obey new policy, instead of all pages in range.
    This can be useful for multi-threaded applications working on a
    large shared data area that is initialized by an initial thread
    resulting in all pages on one [or a few, if overflowed] nodes.
    After PROT_NONE, the pages in regions assigned to the worker threads
    will be automatically migrated local to the threads on 1st touch.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Lee Schermerhorn
     
  • This patch provides a new function to test whether a page resides
    on a node that is appropriate for the mempolicy for the vma and
    address where the page is supposed to be mapped. This involves
    looking up the node where the page belongs. So, the function
    returns that node so that it may be used to allocated the page
    without consulting the policy again.

    A subsequent patch will call this function from the fault path.
    Because of this, I don't want to go ahead and allocate the page, e.g.,
    via alloc_page_vma() only to have to free it if it has the correct
    policy. So, I just mimic the alloc_page_vma() node computation
    logic--sort of.

    Note: we could use this function to implement a MPOL_MF_STRICT
    behavior when migrating pages to match mbind() mempolicy--e.g.,
    to ensure that pages in an interleaved range are reinterleaved
    rather than left where they are when they reside on any page in
    the interleave nodemask.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Linus Torvalds
    [ Added MPOL_F_LAZY to trigger migrate-on-fault;
    simplified code now that we don't have to bother
    with special crap for interleaved ]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Lee Schermerhorn
     
  • This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
    to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
    flags, mbind() will map the pages PROT_NONE so that they will be
    migrated on the next touch.

    This allows an application to prepare for a new phase of operation
    where different regions of shared storage will be assigned to
    worker threads, w/o changing policy. Note that we could just use
    "default" policy in this case. However, this also allows an
    application to request that pages be migrated, only if necessary,
    to follow any arbitrary policy that might currently apply to a
    range of pages, without knowing the policy, or without specifying
    multiple mbind()s for ranges with different policies.

    [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

    Bug-Reported-by: Reported-by: Fengguang Wu
    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Lee Schermerhorn
     
  • Make MPOL_LOCAL a real and exposed policy such that applications that
    relied on the previous default behaviour can explicitly request it.

    Requested-by: Christoph Lameter
    Reviewed-by: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Peter Zijlstra
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

07 Dec, 2012

1 commit

  • This fixes a regression in 3.7-rc, which has since gone into stable.

    Commit 00442ad04a5e ("mempolicy: fix a memory corruption by refcount
    imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the
    refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went
    on expecting alloc_page_vma() to drop the refcount it had acquired.
    This deserves a rework: but for now fix the leak in shmem_alloc_page().

    Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use
    the same refcounting there as in shmem_alloc_page(), delete its onstack
    mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() -
    those were invented to let swapin_readahead() make an unknown number of
    calls to alloc_pages_vma() with one mempolicy; but since 00442ad04a5e,
    alloc_pages_vma() has kept refcount in balance, so now no problem.

    Reported-and-tested-by: Tommi Rantala
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Oct, 2012

1 commit

  • When reading /proc/pid/numa_maps, it's possible to return the contents of
    the stack where the mempolicy string should be printed if the policy gets
    freed from beneath us.

    This happens because mpol_to_str() may return an error the
    stack-allocated buffer is then printed without ever being stored.

    There are two possible error conditions in mpol_to_str():

    - if the buffer allocated is insufficient for the string to be stored,
    and

    - if the mempolicy has an invalid mode.

    The first error condition is not triggered in any of the callers to
    mpol_to_str(): at least 50 bytes is always allocated on the stack and this
    is sufficient for the string to be written. A future patch should convert
    this into BUILD_BUG_ON() since we know the maximum strlen possible, but
    that's not -rc material.

    The second error condition is possible if a race occurs in dropping a
    reference to a task's mempolicy causing it to be freed during the read().
    The slab poison value is then used for the mode and mpol_to_str() returns
    -EINVAL.

    This race is only possible because get_vma_policy() believes that
    mm->mmap_sem protects task->mempolicy, which isn't true. The exit path
    does not hold mm->mmap_sem when dropping the reference or setting
    task->mempolicy to NULL: it uses task_lock(task) instead.

    Thus, it's required for the caller of a task mempolicy to hold
    task_lock(task) while grabbing the mempolicy and reading it. Callers with
    a vma policy store their mempolicy earlier and can simply increment the
    reference count so it's guaranteed not to be freed.

    Reported-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2012

6 commits

  • Revert commit 0def08e3acc2 because check_range can't fail in
    migrate_to_node with considering current usecases.

    Quote from Johannes

    : I think it makes sense to revert. Not because of the semantics, but I
    : just don't see how check_range() could even fail for this callsite:
    :
    : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
    : find_vma()
    :
    : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
    : and so can not fail
    :
    : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
    : continue until addr == end, so we never fail with -EIO

    And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
    which might pass to MPOL_MF_STRICT.

    Suggested-by: Johannes Weiner
    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vasiliy Kulikov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier
    related damage v3") introduced a potential memory corruption.
    shmem_alloc_page() uses a pseudo vma and it has one significant unique
    combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.

    get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
    and mpol_cond_put() DOES decrease a policy ref when a policy has
    MPOL_F_SHARED. Therefore, when a cpuset update race occurs,
    alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
    reference count and frees the policy prematurely.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When shared_policy_replace() fails to allocate new->policy is not freed
    correctly by mpol_set_shared_policy(). The problem is that shared
    mempolicy code directly call kmem_cache_free() in multiple places where
    it is easy to make a mistake.

    This patch creates an sp_free wrapper function and uses it. The bug was
    introduced pre-git age (IOW, before 2.6.12-rc2).

    [mgorman@suse.de: Editted changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
    be dereferenced if sp->lock is not held and 2) another thread can modify
    sp_node between spin_unlock for allocating a new sp node and next
    spin_lock. The bug was introduced before 2.6.12-rc2.

    Kosaki's original patch for this problem was to allocate an sp node and
    policy within shared_policy_replace and initialise it when the lock is
    reacquired. I was not keen on this approach because it partially
    duplicates sp_alloc(). As the paths were sp->lock is taken are not that
    performance critical this patch converts sp->lock to sp->mutex so it can
    sleep when calling sp_alloc().

    [kosaki.motohiro@jp.fujitsu.com: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Dave Jones' system call fuzz testing tool "trinity" triggered the
    following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
    __slab_alloc+0x3d3/0x445
    kmem_cache_alloc+0x29d/0x2b0
    mpol_new+0xa3/0x140
    sys_mbind+0x142/0x620
    system_call_fastpath+0x16/0x1b

    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
    __slab_free+0x2e/0x1de
    kmem_cache_free+0x25a/0x260
    __mpol_put+0x27/0x30
    remove_vma+0x68/0x90
    exit_mmap+0x118/0x140
    mmput+0x73/0x110
    exit_mm+0x108/0x130
    do_exit+0x162/0xb90
    do_group_exit+0x4f/0xc0
    sys_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

    The problem is that the structure is being prematurely freed due to a
    reference count imbalance. In the following case mbind(addr, len) should
    replace the memory policies of both vma1 and vma2 and thus they will
    become to share the same mempolicy and the new mempolicy will have the
    MPOL_F_SHARED flag.

    +-------------------+-------------------+
    | vma1 | vma2(shmem) |
    +-------------------+-------------------+
    | |
    addr addr+len

    alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
    maintaining the mempolicy reference count. The current rule is that
    get_vma_policy() only increments refcount for shmem VMA and
    mpol_conf_put() only decrements refcount if the policy has
    MPOL_F_SHARED.

    In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
    The reference count will be decreased even though was not increased
    whenever alloc_page_vma() is called. This has been broken since commit
    [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.

    There is another serious bug with the sharing of memory policies.
    Currently, mempolicy rebind logic (it is called from cpuset rebinding)
    ignores a refcount of mempolicy and override it forcibly. Thus, any
    mempolicy sharing may cause mempolicy corruption. The bug was
    introduced by commit [68860ec1: cpusets: automatic numa mempolicy
    rebinding].

    Ideally, the shared policy handling would be rewritten to either
    properly handle COW of the policy structures or at least reference count
    MPOL_F_SHARED based exclusively on information within the policy.
    However, this patch takes the easier approach of disabling any policy
    sharing between VMAs. Each new range allocated with sp_alloc will
    allocate a new policy, set the reference count to 1 and drop the
    reference count of the old policy. This increases the memory footprint
    but is not expected to be a major problem as mbind() is unlikely to be
    used for fine-grained ranges. It is also inefficient because it means
    we allocate a new policy even in cases where mbind_range() could use the
    new_policy passed to it. However, it is more straight-forward and the
    change should be invisible to the user.

    [mgorman@suse.de: Edited changelog]
    Reported-by: Dave Jones ,
    Cc: Christoph Lameter ,
    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle
    vma->vm_policy linkages") removed vma->vm_policy updates code but it is
    the purpose of mbind_range(). Now, mbind_range() is virtually a no-op
    and while it does not allow memory corruption it is not the right fix.
    This patch is a revert.

    [mgorman@suse.de: Edited changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

07 Sep, 2012

1 commit

  • Trivially triggerable, found by trinity:

    kernel BUG at mm/mempolicy.c:2546!
    Process trinity-child2 (pid: 23988, threadinfo ffff88010197e000, task ffff88007821a670)
    Call Trace:
    show_numa_map+0xd5/0x450
    show_pid_numa_map+0x13/0x20
    traverse+0xf2/0x230
    seq_read+0x34b/0x3e0
    vfs_read+0xac/0x180
    sys_pread64+0xa2/0xc0
    system_call_fastpath+0x1a/0x1f
    RIP: mpol_to_str+0x156/0x360

    Cc: stable@vger.kernel.org
    Signed-off-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Dave Jones
     

31 Jul, 2012

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "Most of the changes included are from Christoph Lameter's "common
    slab" patch series that unifies common parts of SLUB, SLAB, and SLOB
    allocators. The unification is needed for Glauber Costa's "kmem
    memcg" work that will hopefully appear for v3.7.

    The rest of the changes are fixes and speedups by various people."

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (32 commits)
    mm: Fix build warning in kmem_cache_create()
    slob: Fix early boot kernel crash
    mm, slub: ensure irqs are enabled for kmemcheck
    mm, sl[aou]b: Move kmem_cache_create mutex handling to common code
    mm, sl[aou]b: Use a common mutex definition
    mm, sl[aou]b: Common definition for boot state of the slab allocators
    mm, sl[aou]b: Extract common code for kmem_cache_create()
    slub: remove invalid reference to list iterator variable
    mm: Fix signal SIGFPE in slabinfo.c.
    slab: move FULL state transition to an initcall
    slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro"
    mm, slab: Build fix for recent kmem_cache changes
    slab: rename gfpflags to allocflags
    slub: refactoring unfreeze_partials()
    slub: use __cmpxchg_double_slab() at interrupt disabled place
    slab/mempolicy: always use local policy from interrupt context
    slab: Get rid of obj_size macro
    mm, sl[aou]b: Extract common fields from struct kmem_cache
    slab: Remove some accessors
    slab: Use page struct fields instead of casting
    ...

    Linus Torvalds
     

21 Jun, 2012

1 commit

  • If the range passed to mbind() is not allocated on nodes set in the
    nodemask, it migrates the pages to respect the constraint.

    The final formal of migrate_pages() is a mode of type enum migrate_mode,
    not a boolean. do_mbind() is currently passing "true" which is the
    equivalent of MIGRATE_SYNC_LIGHT. This should instead be MIGRATE_SYNC
    for synchronous page migration.

    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Jun, 2012

1 commit

  • slab_node() could access current->mempolicy from interrupt context.
    However there's a race condition during exit where the mempolicy
    is first freed and then the pointer zeroed.

    Using this from interrupts seems bogus anyways. The interrupt
    will interrupt a random process and therefore get a random
    mempolicy. Many times, this will be idle's, which noone can change.

    Just disable this here and always use local for slab
    from interrupts. I also cleaned up the callers of slab_node a bit
    which always passed the same argument.

    I believe the original mempolicy code did that in fact,
    so it's likely a regression.

    v2: send version with correct logic
    v3: simplify. fix typo.
    Reported-by: Arun Sharma
    Cc: penberg@kernel.org
    Cc: cl@linux.com
    Signed-off-by: Andi Kleen
    [tdmackey@twitter.com: Rework control flow based on feedback from
    cl@linux.com, fix logic, and cleanup current task_struct reference]
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Acked-by: KOSAKI Motohiro
    Signed-off-by: David Mackey
    Signed-off-by: Pekka Enberg

    Andi Kleen
     

30 May, 2012

2 commits

  • s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it
    duplicates the argument's type.

    Done in a fit of irritation over 80-col issues :(

    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • While running an application that moves tasks from one cpuset to another
    I noticed that it takes much longer and moves many more pages than
    expected.

    The reason for this is do_migrate_pages() does its best to preserve the
    relative node differential from the first node of the cpuset because the
    application may have been written with that in mind. If memory was
    interleaved on the nodes of the source cpuset by an application
    do_migrate_pages() will try its best to maintain that interleaving on
    the nodes of the destination cpuset. This means copying the memory from
    all source nodes to the destination nodes even if the source and
    destination nodes overlap.

    This is a problem for userspace NUMA placement tools. The amount of
    time spent doing extra memory moves cancels out some of the NUMA
    performance improvements. Furthermore, if the number of source and
    destination nodes are to maintain the previous interleaving layout
    anyway.

    This patch changes do_migrate_pages() to only preserve the relative
    layout inside the program if the number of NUMA nodes in the source and
    destination mask are the same. If the number is different, we do a much
    more efficient migration by not touching memory that is in an allowed
    node.

    This preserves the old behaviour for programs that want it, while
    allowing a userspace NUMA placement tool to use the new, faster
    migration. This improves performance in our tests by up to a factor of
    7.

    Without this change migrating tasks from a cpuset containing nodes 0-7
    to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
    they are in the both the source and destination nodesets:

    Migrating 7 to 4
    Migrating 6 to 3
    Migrating 5 to 4
    Migrating 4 to 3
    Migrating 1 to 4
    Migrating 3 to 4
    Migrating 0 to 3
    Migrating 2 to 3

    With this change we only migrate from nodes that are not in the
    destination nodesets:

    Migrating 7 to 4
    Migrating 6 to 3
    Migrating 5 to 4
    Migrating 2 to 3
    Migrating 1 to 4
    Migrating 0 to 3

    Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
    containing 3,4,5 we still do move everything so that we preserve the
    desired NUMA offsets:

    Migrating 4 to 5
    Migrating 3 to 4
    Migrating 2 to 3

    As far as performance is concerned this simple patch improves the time
    it takes to move 14, 20 and 26 large tasks from a cpuset containing
    nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
    Here are the timings with and without the patch:

    BEFORE PATCH -- Move times: 59, 140, 651 seconds
    ============

    Moving 14 tasks from nodes (0-7) to nodes (1,3)
    numad(8780) do_migrate_pages (mm=0xffff88081d414400
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 14 tasks...)
    PID 8890 moved to node(s) 1,3 in 59.2 seconds

    Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
    numad(8780) do_migrate_pages (mm=0xffff88081d88c700
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 20 tasks...)
    PID 8962 moved to node(s) 1,4-5 in 139.88 seconds

    Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
    numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
    from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
    numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 26 tasks...)
    PID 9058 moved to node(s) 1-3,5 in 651.45 seconds

    AFTER PATCH -- Move times: 42, 56, 93 seconds
    ===========

    Moving 14 tasks from nodes (0-7) to nodes (5,7)
    numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
    (Above moves repeated for each of the 14 tasks...)
    PID 33221 moved to node(s) 5,7 in 41.67 seconds

    Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
    numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 20 tasks...)
    PID 33289 moved to node(s) 1,3,5 in 56.3 seconds

    Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
    numad(33209) do_migrate_pages (mm=0xffff88101d924400
    from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
    numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
    (Above moves repeated for each of the 26 tasks...)
    PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds

    [akpm@linux-foundation.org: clean up comment layout]
    Signed-off-by: Larry Woodman
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman