24 Feb, 2013

6 commits

  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration of KSM pages is now safe: remove the PageKsm restrictions from
    mempolicy.c and migrate.c.

    But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
    are irrelevant to KSM: it looks as if that code was preventing hotremove
    migration of KSM pages, unless they happened to be in swapcache.

    There is some question as to whether enforcing a NUMA mempolicy migration
    ought to migrate KSM pages, mapped into entirely unrelated processes; but
    moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
    and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
    any area where this is a worry.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM page migration is already supported in the case of memory hotremove,
    which takes the ksm_thread_mutex across all its migrations to keep life
    simple.

    But the new KSM NUMA merge_across_nodes knob introduces a problem, when
    it's set to non-default 0: if a KSM page is migrated to a different NUMA
    node, how do we migrate its stable node to the right tree? And what if
    that collides with an existing stable node?

    So far there's no provision for that, and this patch does not attempt to
    deal with it either. But how will I test a solution, when I don't know
    how to hotremove memory? The best answer is to enable KSM page migration
    in all cases now, and test more common cases. With THP and compaction
    added since KSM came in, page migration is now mainstream, and it's a
    shame that a KSM page can frustrate freeing a page block.

    Without worrying about merge_across_nodes 0 for now, this patch gets KSM
    page migration working reliably for default merge_across_nodes 1 (but
    leave the patch enabling it until near the end of the series).

    It's much simpler than I'd originally imagined, and does not require an
    additional tier of locking: page migration relies on the page lock, KSM
    page reclaim relies on the page lock, the page lock is enough for KSM page
    migration too.

    Almost all the care has to be in get_ksm_page(): that's the function which
    worries about when a stable node is stale and should be freed, now it also
    has to worry about the KSM page being migrated.

    The only new overhead is an additional put/get/lock/unlock_page when
    stable_tree_search() arrives at a matching node: to make sure migration
    respects the raised page count, and so does not migrate the page while
    we're busy with it here. That's probably avoidable, either by changing
    internal interfaces from using kpage to stable_node, or by moving the
    ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
    swapcache); but this works well, I've no urge to pull it apart now.

    (Descents of the stable tree may pass through nodes whose KSM pages are
    under migration: being unlocked, the raised page count does not prevent
    that, nor need it: it's safe to memcmp against either old or new page.)

    You might worry about mremap, and whether page migration's rmap_walk to
    remove migration entries will find all the KSM locations where it inserted
    earlier: that should already be handled, by the satisfyingly heavy hammer
    of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The function names page_xchg_last_nid(), page_last_nid() and
    reset_page_last_nid() were judged to be inconsistent so rename them to a
    struct_field_op style pattern. As it looked jarring to have
    reset_page_mapcount() and page_nid_reset_last() beside each other in
    memmap_init_zone(), this patch also renames reset_page_mapcount() to
    page_mapcount_reset(). There are others like init_page_count() but as
    it is used throughout the arch code a rename would likely cause more
    conflicts than it is worth.

    [akpm@linux-foundation.org: fix zcache]
    Signed-off-by: Mel Gorman
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
    THP before migrating") Hugh Dickins noted that the control flow for
    transhuge migration was difficult to follow. Unconditionally calling
    put_page() in numamigrate_isolate_page() made the failure paths of both
    migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
    complex that they should be. Further, he was extremely wary that an
    unlock_page() should ever happen after a put_page() even if the
    put_page() should never be the final put_page.

    Hugh implemented the following cleanup to simplify the path by calling
    putback_lru_page() inside numamigrate_isolate_page() if it failed to
    isolate and always calling unlock_page() within
    migrate_misplaced_transhuge_page().

    There is no functional change after this patch is applied but the code
    is easier to follow and unlock_page() always happens before put_page().

    [mgorman@suse.de: changelog only]
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
    one base page is being migrated when in fact it can also be checking
    THP.

    The consequences are that a migration will be attempted when a target
    node is nearly full and fail later. It's unlikely to be user-visible
    but it should be fixed. While we are there, migrate_balanced_pgdat()
    should treat nr_migrate_pages as an unsigned long as it is treated as a
    watermark.

    Signed-off-by: Mel Gorman
    Suggested-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Simon Jeons
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

05 Feb, 2013

1 commit


12 Jan, 2013

1 commit

  • Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does
    not check page_count before migrating like base page migration and
    khugepage. He could not see why this was safe and he is right.

    The potential impact of the bug is avoided due to the limitations of
    NUMA balancing. The page_mapcount() check ensures that only a single
    address space is using this page and as THPs are typically private it
    should not be possible for another address space to fault it in
    parallel. If the address space has one associated task then it's
    difficult to have both a GUP pin and be referencing the page at the same
    time. If there are multiple tasks then a buggy scenario requires that
    another thread be accessing the page while the direct IO is in flight.
    This is dodgy behaviour as there is a possibility of corruption with or
    without THP migration. It would be

    While we happen to be safe for the most part it is shoddy to depend on
    such "safety" so this patch checks the page count similar to anonymous
    pages. Note that this does not mean that the page_mapcount() check can
    go away. If we were to remove the page_mapcount() check the the THP
    would have to be unmapped from all referencing PTEs, replaced with
    migration PTEs and restored properly afterwards.

    Signed-off-by: Mel Gorman
    Reported-by: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Dec, 2012

1 commit

  • This build error is currently hidden by the fact that the x86
    implementation of 'update_mmu_cache_pmd()' is a macro that doesn't use
    its last argument, but commit b32967ff101a ("mm: numa: Add THP migration
    for the NUMA working set scanning fault case") introduced a call with
    the wrong third argument.

    In the akpm tree, it causes this build error:

    mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
    mm/migrate.c:1666:2: error: incompatible type for argument 3 of 'update_mmu_cache_pmd'
    arch/x86/include/asm/pgtable.h:792:20: note: expected 'struct pmd_t *' but argument is of type 'pmd_t'

    Fix it.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Christoph Lameter
    Signed-off-by: Wen Congyang
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

4 commits

  • The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
    hacks around putback_lru_pages() in order to allow ballooned pages to be
    re-inserted on balloon page list as if a ballooned page was like a LRU page.

    As ballooned pages are not legitimate LRU pages, this patch introduces
    putback_movable_pages() to properly cope with cases where the isolated
    pageset contains ballooned pages and LRU pages, thus fixing the mentioned
    inelegant hack around putback_lru_pages().

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a guest,
    thus imposing performance penalties associated with the reduced number of
    transparent huge pages that could be used by the guest workload.

    This patch introduces the helper functions as well as the necessary changes
    to teach compaction and migration bits how to cope with pages which are
    part of a guest memory balloon, in order to make them movable by memory
    compaction procedures.

    Signed-off-by: Rafael Aquini
    Acked-by: Mel Gorman
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Memory fragmentation introduced by ballooning might reduce significantly
    the number of 2MB contiguous memory blocks that can be used within a
    guest, thus imposing performance penalties associated with the reduced
    number of transparent huge pages that could be used by the guest workload.

    This patch-set follows the main idea discussed at 2012 LSFMMS session:
    "Ballooning for transparent huge pages" -- http://lwn.net/Articles/490114/
    to introduce the required changes to the virtio_balloon driver, as well as
    the changes to the core compaction & migration bits, in order to make
    those subsystems aware of ballooned pages and allow memory balloon pages
    become movable within a guest, thus avoiding the aforementioned
    fragmentation issue

    Following are numbers that prove this patch benefits on allowing
    compaction to be more effective at memory ballooned guests.

    Results for STRESS-HIGHALLOC benchmark, from Mel Gorman's mmtests suite,
    running on a 4gB RAM KVM guest which was ballooning 512mB RAM in 64mB
    chunks, at every minute (inflating/deflating), while test was running:

    ===BEGIN stress-highalloc

    STRESS-HIGHALLOC
    highalloc-3.7 highalloc-3.7
    rc4-clean rc4-patch
    Pass 1 55.00 ( 0.00%) 62.00 ( 7.00%)
    Pass 2 54.00 ( 0.00%) 62.00 ( 8.00%)
    while Rested 75.00 ( 0.00%) 80.00 ( 5.00%)

    MMTests Statistics: duration
    3.7 3.7
    rc4-clean rc4-patch
    User 1207.59 1207.46
    System 1300.55 1299.61
    Elapsed 2273.72 2157.06

    MMTests Statistics: vmstat
    3.7 3.7
    rc4-clean rc4-patch
    Page Ins 3581516 2374368
    Page Outs 11148692 10410332
    Swap Ins 80 47
    Swap Outs 3641 476
    Direct pages scanned 37978 33826
    Kswapd pages scanned 1828245 1342869
    Kswapd pages reclaimed 1710236 1304099
    Direct pages reclaimed 32207 31005
    Kswapd efficiency 93% 97%
    Kswapd velocity 804.077 622.546
    Direct efficiency 84% 91%
    Direct velocity 16.703 15.682
    Percentage direct scans 2% 2%
    Page writes by reclaim 79252 9704
    Page writes file 75611 9228
    Page writes anon 3641 476
    Page reclaim immediate 16764 11014
    Page rescued immediate 0 0
    Slabs scanned 2171904 2152448
    Direct inode steals 385 2261
    Kswapd inode steals 659137 609670
    Kswapd skipped wait 1 69
    THP fault alloc 546 631
    THP collapse alloc 361 339
    THP splits 259 263
    THP fault fallback 98 50
    THP collapse fail 20 17
    Compaction stalls 747 499
    Compaction success 244 145
    Compaction failures 503 354
    Compaction pages moved 370888 474837
    Compaction move failure 77378 65259

    ===END stress-highalloc

    This patch:

    Introduce MIGRATEPAGE_SUCCESS as the default return code for
    address_space_operations.migratepage() method and documents the expected
    return code for the same method in failure cases.

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     
  • Several place need to find the pmd by(mm_struct, address), so introduce a
    function to simplify it.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ni zhan Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     

11 Dec, 2012

13 commits

  • rmap_walk_anon() and try_to_unmap_anon() appears to be too
    careful about locking the anon vma: while it needs protection
    against anon vma list modifications, it does not need exclusive
    access to the list itself.

    Transforming this exclusive lock to a read-locked rwsem removes
    a global lock from the hot path of page-migration intense
    threaded workloads which can cause pathological performance like
    this:

    96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
    |
    --- perf_trace_sched_switch
    __schedule
    schedule
    schedule_preempt_disabled
    __mutex_lock_common.isra.6
    __mutex_lock_slowpath
    mutex_lock
    |
    |--50.61%-- rmap_walk
    | move_to_new_page
    | migrate_pages
    | migrate_misplaced_page
    | __do_numa_page.isra.69
    | handle_pte_fault
    | handle_mm_fault
    | __do_page_fault
    | do_page_fault
    | page_fault
    | __memset_sse2
    | |
    | --100.00%-- worker_thread
    | |
    | --100.00%-- start_thread
    |
    --49.39%-- page_lock_anon_vma
    try_to_unmap_anon
    try_to_unmap
    migrate_pages
    migrate_misplaced_page
    __do_numa_page.isra.69
    handle_pte_fault
    handle_mm_fault
    __do_page_fault
    do_page_fault
    page_fault
    __memset_sse2
    |
    --100.00%-- worker_thread
    start_thread

    With this change applied the profile is now nicely flat
    and there's no anon-vma related scheduling/blocking.

    Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
    to make it clearer that it's an exclusive write-lock in
    that case - suggested by Rik van Riel.

    Suggested-by: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman

    Ingo Molnar
     
  • If there is excessive migration due to NUMA balancing it gets rate
    limited. It does this by counting the number of pages it has migrated
    recently but counts a transhuge page as 1 page. Account for it properly.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Subject says it all. Allocation failures and a failure to isolate should
    be accounted as a migration failure. This is partially another
    difference between base page and transhuge page migration. A base page
    migration makes multiple attempts for these conditions before it would
    be accounted for as a failure.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Commit "Add THP migration for the NUMA working set scanning fault case"
    breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
    explode without CONFIG_TRANSPARENT_HUGEPAGE:

    mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
    mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
    mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed

    CONFIG_NUMA_BALANCING allows compilation without enabling transparent
    hugepages, so define the dummy function for such a configuration and only
    define migrate_misplaced_transhuge_page_put() when transparent hugepages
    are enabled.

    Signed-off-by: David Rientjes
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Note: This is very heavily based on a patch from Peter Zijlstra with
    fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
    put a lot of migration logic into mm/huge_memory.c where it does
    not belong. This version puts tries to share some of the migration
    logic with migrate_misplaced_page. However, it should be noted
    that now migrate.c is doing more with the pagetable manipulation
    than is preferred. The end result is barely recognisable so as
    before, the signed-offs had to be removed but will be re-added if
    the original authors are ok with it.

    Add THP migration for the NUMA working set scanning fault case.

    It uses the page lock to serialize. No migration pte dance is
    necessary because the pte is already unmapped when we decide
    to migrate.

    [dhillf@gmail.com: Fix memory leak on isolation failure]
    [dhillf@gmail.com: Fix transfer of last_nid information]
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Pass last_nid from misplaced page to newly allocated migration target page.

    Signed-off-by: Hillf Danton
    Signed-off-by: Mel Gorman

    Hillf Danton
     
  • If there are a large number of NUMA hinting faults and all of them
    are resulting in migrations it may indicate that memory is just
    bouncing uselessly around. NUMA balancing cost is likely exceeding
    any benefit from locality. Rate limit the PTE updates if the node
    is migration rate-limited. As noted in the comments, this distorts
    the NUMA faulting statistics.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: This is very heavily based on similar logic in autonuma. It should
    be signed off by Andrea but because there was no standalone
    patch and it's sufficiently different from what he did that
    the signed-off is omitted. Will be added back if requested.

    If a large number of pages are misplaced then the memory bus can be
    saturated just migrating pages between nodes. This patch rate-limits
    the amount of memory that can be migrating between nodes.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • It is tricky to quantify the basic cost of automatic NUMA placement in a
    meaningful manner. This patch adds some vmstats that can be used as part
    of a basic costing model.

    u = basic unit = sizeof(void *)
    Ca = cost of struct page access = sizeof(struct page) / u
    Cpte = Cost PTE access = Ca
    Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
    where Cpte is incurred twice for a read and a write and Wlock
    is a constant representing the cost of taking or releasing a
    lock
    Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
    Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
    Ci = Cost of page isolation = Ca + Wi
    where Wi is a constant that should reflect the approximate cost
    of the locking operation
    Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
    where Wnuma is the approximate NUMA factor. 1 is local. 1.2
    would imply that remote accesses are 20% more expensive

    Balancing cost = Cpte * numa_pte_updates +
    Cnumahint * numa_hint_faults +
    Ci * numa_pages_migrated +
    Cpagecopy * numa_pages_migrated

    Note that numa_pages_migrated is used as a measure of how many pages
    were isolated even though it would miss pages that failed to migrate. A
    vmstat counter could have been added for it but the isolation cost is
    pretty marginal in comparison to the overall cost so it seemed overkill.

    The ideal way to measure automatic placement benefit would be to count
    the number of remote accesses versus local accesses and do something like

    benefit = (remote_accesses_before - remove_access_after) * Wnuma

    but the information is not readily available. As a workload converges, the
    expection would be that the number of remote numa hints would reduce to 0.

    convergence = numa_hint_faults_local / numa_hint_faults
    where this is measured for the last N number of
    numa hints recorded. When the workload is fully
    converged the value is 1.

    This can measure if the placement policy is converging and how fast it is
    doing it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Mel Gorman
     
  • If we have to avoid migrating to a node that is nearly full, put page
    and return zero.

    Signed-off-by: Hillf Danton
    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • Note: This was originally based on Peter's patch "mm/migrate: Introduce
    migrate_misplaced_page()" but borrows extremely heavily from Andrea's
    "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
    collection". The end result is barely recognisable so signed-offs
    had to be dropped. If original authors are ok with it, I'll
    re-add the signed-off-bys.

    Add migrate_misplaced_page() which deals with migrating pages from
    faults.

    Based-on-work-by: Lee Schermerhorn
    Based-on-work-by: Peter Zijlstra
    Based-on-work-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Peter Zijlstra
     
  • The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
    about migration activity but not the type or the reason. This patch adds
    a tracepoint to identify the type of page migration and why the page is
    being migrated.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     
  • The compact_pages_moved and compact_pagemigrate_failed events are
    convenient for determining if compaction is active and to what
    degree migration is succeeding but it's at the wrong level. Other
    users of migration may also want to know if migration is working
    properly and this will be particularly true for any automated
    NUMA migration. This patch moves the counters down to migration
    with the new events called pgmigrate_success and pgmigrate_fail.
    The compact_blocks_moved counter is removed because while it was
    useful for debugging initially, it's worthless now as no meaningful
    conclusions can be drawn from its value.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel

    Mel Gorman
     

01 Aug, 2012

3 commits

  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With HugeTLB pages, hugetlb cgroup is uncharged in compound page
    destructor. Since we are holding a hugepage reference, we can be sure
    that old page won't get uncharged till the last put_page().

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Since we migrate only one hugepage, don't use linked list for passing the
    page around. Directly pass the page that need to be migrated as argument.
    This also removes the usage of page->lru in the migrate path.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

04 Jun, 2012

1 commit

  • New tmpfs use of !PageUptodate pages for fallocate() is triggering the
    WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
    is called from migrate_page_copy() for compaction.

    It is anomalous that migration should use __set_page_dirty_nobuffers()
    on an address_space that does not participate in dirty and writeback
    accounting; and this has also been observed to insert surprising dirty
    tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

    We should probably give migrate_page_copy() a better way to preserve the
    tag and migrate accounting info, when mapping_cap_account_dirty(). But
    that needs some more work: so in the interim, avoid the warning by using
    a simple SetPageDirty on PageSwapBacked pages.

    Reported-and-tested-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


26 Apr, 2012

1 commit

  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Mar, 2012

1 commit

  • Migration functions perform the rcu_read_unlock too early. As a result
    the task pointed to may change from under us. This can result in an oops,
    as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

    The following patch extend the period of the rcu_read_lock until after the
    permissions checks are done. We also take a refcount so that the task
    reference is stable when calling security check functions and performing
    cpuset node validation (which takes a mutex).

    The refcount is dropped before actual page migration occurs so there is no
    change to the refcounts held during page migration.

    Also move the determination of the mm of the task struct to immediately
    before the do_migrate*() calls so that it is clear that we switch from
    handling the task during permission checks to the mm for the actual
    migration. Since the determination is only done once and we then no
    longer use the task_struct we can be sure that we operate on a specific
    address space that will not change from under us.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Christoph Lameter
    Cc: "Eric W. Biederman"
    Reported-by: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Mar, 2012

1 commit

  • When moving tasks from old memcg (with move_charge_at_immigrate on new
    memcg), followed by removal of old memcg, hit General Protection Fault in
    mem_cgroup_lru_del_list() (called from release_pages called from
    free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
    exit_mmap from mmput from exit_mm from do_exit).

    Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
    been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
    still trying to update its stats, and take page off lru before freeing.

    A task, or a charge, or a page on lru: each secures a memcg against
    removal. In this case, the last task has been moved out of the old memcg,
    and it is exiting: anonymous pages are uncharged one by one from the
    memcg, as they are zapped from its pagetables, so the charge gets down to
    0; but the pages themselves are queued in an mmu_gather for freeing.

    Most of those pages will be on lru (and force_empty is careful to
    lru_add_drain_all, to add pages from pagevec to lru first), but not
    necessarily all: perhaps some have been isolated for page reclaim, perhaps
    some isolated for other reasons. So, force_empty may find no task, no
    charge and no page on lru, and let the removal proceed.

    There would still be no problem if these pages were immediately freed; but
    typically (and the put_page_testzero protocol demands it) they have to be
    added back to lru before they are found freeable, then removed from lru
    and freed. We don't see the issue when adding, because the
    mem_cgroup_iter() loops keep their own reference to the memcg being
    scanned; but when it comes to mem_cgroup_lru_del_list().

    I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
    PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
    of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
    38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
    removed those convolutions, but left this General Protection Fault.

    But it's surprisingly easy to restore the old behaviour: just check
    PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
    to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
    change? just going back to how it worked before; testing, and an audit of
    uses of pc->mem_cgroup, show no problem.

    And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
    sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
    longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
    clear pc->mem_cgroup if necessary").

    Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
    is not strictly necessary: the lru_lock there, with RCU before memcg
    structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
    without that; but it seems cleaner to rely on one dependency less.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Feb, 2012

1 commit

  • Postpone resetting page->mapping until the final remove_migration_ptes().
    Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
    work.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

2 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman