21 Mar, 2014

1 commit

  • Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
    little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
    indicating that remove_migration_ptes() failed to find one of the
    migration entries that was temporarily inserted.

    The problem comes from remap_file_pages()'s switch from vma_interval_tree
    (good for inserting the migration entry) to i_mmap_nonlinear list (no good
    for locating it again); but can only be a problem if the remap_file_pages()
    range does not cover the whole of the vma (zap_pte() clears the range).

    remove_migration_ptes() needs a file_nonlinear method to go down the
    i_mmap_nonlinear list, applying linear location to look for migration
    entries in those vmas too, just in case there was this race.

    The file_nonlinear method does need rmap_walk_control.arg to do this;
    but it never needed vma passed in - vma comes from its own iteration.

    Reported-and-tested-by: Dave Jones
    Reported-and-tested-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Mar, 2014

1 commit

  • GFP_THISNODE is for callers that implement their own clever fallback to
    remote nodes. It restricts the allocation to the specified node and
    does not invoke reclaim, assuming that the caller will take care of it
    when the fallback fails, e.g. through a subsequent allocation request
    without GFP_THISNODE set.

    However, many current GFP_THISNODE users only want the node exclusive
    aspect of the flag, without actually implementing their own fallback or
    triggering reclaim if necessary. This results in things like page
    migration failing prematurely even when there is easily reclaimable
    memory available, unless kswapd happens to be running already or a
    concurrent allocation attempt triggers the necessary reclaim.

    Convert all callsites that don't implement their own fallback strategy
    to __GFP_THISNODE. This restricts the allocation a single node too, but
    at the same time allows the allocator to enter the slowpath, wake
    kswapd, and invoke direct reclaim if necessary, to make the allocation
    happen when memory is full.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Jan Stancek
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

28 Jan, 2014

1 commit


24 Jan, 2014

2 commits

  • Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copiess
    over the cpupid at page migration time. It is unnecessary to set it
    again in migrate_misplaced_transhuge_page().

    Signed-off-by: Wanpeng Li
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Jan, 2014

8 commits

  • fail_migrate_page() isn't used anywhere, so remove it.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Some part of putback_lru_pages() and putback_movable_pages() is
    duplicated, so it could confuse us what we should use. We can remove
    putback_lru_pages() since it is not really needed now. This makes us
    undestand and maintain the code more easily.

    And comment on putback_movable_pages() is stale now, so fix it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We should remove the page from the list if we fail with ENOSYS, since
    migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
    permanent failure and it assumes that the page would be removed from the
    list. Without this patch, we could overcount number of failure.

    In addition, we should put back the new hugepage if
    !hugepage_migration_support(). If not, we would leak hugepage memory.

    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Let's add a comment about where the failed page goes to, which makes
    code more readable.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Acked-by: Christoph Lameter
    Reviewed-by: Wanpeng Li
    Acked-by: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • A low local/remote numa hinting fault ratio is potentially explained by
    failed migrations. This patch adds a tracepoint that fires when
    migration fails due to migration rate limitation.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NUMA migrate rate limiting protects a migration counter and window using
    a lock but in some cases this can be a contended lock. It is not
    critical that the number of pages be perfect, lost updates are
    acceptable. Reduce the importance of this lock.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • numamigrate_update_ratelimit and numamigrate_isolate_page only have
    callers in mm/migrate.c. This patch makes them static.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In each rmap traverse case, there is some difference so that we need
    function pointers and arguments to them in order to handle these

    For this purpose, struct rmap_walk_control is introduced in this patch,
    and will be extended in following patch. Introducing and extending are
    separate, because it clarify changes.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

22 Dec, 2013

1 commit

  • The arbitrary restriction on page counts offered by the core
    migrate_page_move_mapping() code results in rather suspicious looking
    fiddling with page reference counts in the aio_migratepage() operation.
    To fix this, make migrate_page_move_mapping() take an extra_count parameter
    that allows aio to tell the code about its own reference count on the page
    being migrated.

    While cleaning up aio_migratepage(), make it validate that the old page
    being passed in is actually what aio_migratepage() expects to prevent
    misbehaviour in the case of races.

    Signed-off-by: Benjamin LaHaise

    Benjamin LaHaise
     

19 Dec, 2013

5 commits

  • THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a PMD changes during a THP migration then migration aborts but the
    failure path is doing more work than is necessary.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MMU notifiers must be called on THP page migration or secondary MMUs
    will get very confused.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 Nov, 2013

1 commit

  • Right now, the migration code in migrate_page_copy() uses copy_huge_page()
    for hugetlbfs and thp pages:

    if (PageHuge(page) || PageTransHuge(page))
    copy_huge_page(newpage, page);

    So, yay for code reuse. But:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);

    and a non-hugetlbfs page has no page_hstate(). This works 99% of the
    time because page_hstate() determines the hstate from the page order
    alone. Since the page order of a THP page matches the default hugetlbfs
    page order, it works.

    But, if you change the default huge page size on the boot command-line
    (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
    so page_hstate() returns null and copy_huge_page() oopses pretty fast
    since copy_huge_page() dereferences the hstate:

    void copy_huge_page(struct page *dst, struct page *src)
    {
    struct hstate *h = page_hstate(src);
    if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
    ...

    Mel noticed that the migration code is really the only user of these
    functions. This moves all the copy code over to migrate.c and makes
    copy_huge_page() work for THP by checking for it explicitly.

    I believe the bug was introduced in commit b32967ff101a ("mm: numa: Add
    THP migration for the NUMA working set scanning fault case")

    [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
    Signed-off-by: Dave Hansen
    Acked-by: Mel Gorman
    Reviewed-by: Naoya Horiguchi
    Cc: Hillf Danton
    Cc: Andrea Arcangeli
    Tested-by: Dave Jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

15 Nov, 2013

2 commits

  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Hugetlb supports multiple page sizes. We use split lock only for PMD
    level, but not for PUD.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Nov, 2013

1 commit

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Oct, 2013

1 commit

  • THP migration uses the page lock to guard against parallel allocations
    but there are cases like this still open

    Task A Task B
    --------------------- ---------------------
    do_huge_pmd_numa_page do_huge_pmd_numa_page
    lock_page
    mpol_misplaced == -1
    unlock_page
    goto clear_pmdnuma
    lock_page
    mpol_misplaced == 2
    migrate_misplaced_transhuge
    pmd = pmd_mknonnuma
    set_pmd_at

    During hours of testing, one crashed with weird errors and while I have
    no direct evidence, I suspect something like the race above happened.
    This patch extends the page lock to being held until the pmd_numa is
    cleared to prevent migration starting in parallel while the pmd_numa is
    being cleared. It also flushes the old pmd entry and orders pagetable
    insertion before rmap insertion.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Oct, 2013

1 commit

  • If page migration is turned on in config and the page is migrating, we
    may lose the soft dirty bit. If fork and mprotect are called on
    migrating pages (once migration is complete) pages do not obtain the
    soft dirty bit in the correspond pte entries. Fix it adding an
    appropriate test on swap entries.

    Signed-off-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

09 Oct, 2013

5 commits

  • After page migration, the new page has the nidpid unset. This makes
    every fault on a recently migrated page look like a first numa fault,
    leading to another page migration.

    Copying over the nidpid at page migration time should prevent erroneous
    migrations of recently migrated pages.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-46-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Currently automatic NUMA balancing is unable to distinguish between false
    shared versus private pages except by ignoring pages with an elevated
    page_mapcount entirely. This avoids shared pages bouncing between the
    nodes whose task is using them but that is ignored quite a lot of data.

    This patch kicks away the training wheels in preparation for adding support
    for identifying shared/private pages is now in place. The ordering is so
    that the impact of the shared/private detection can be easily measured. Note
    that the patch does not migrate shared, file-backed within vmas marked
    VM_EXEC as these are generally shared library pages. Migrating such pages
    is not beneficial as there is an expectation they are read-shared between
    caches and iTLB and iCache pressure is generally low.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-28-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • THP migration uses the page lock to guard against parallel allocations
    but there are cases like this still open

    Task A Task B
    --------------------- ---------------------
    do_huge_pmd_numa_page do_huge_pmd_numa_page
    lock_page
    mpol_misplaced == -1
    unlock_page
    goto clear_pmdnuma
    lock_page
    mpol_misplaced == 2
    migrate_misplaced_transhuge
    pmd = pmd_mknonnuma
    set_pmd_at

    During hours of testing, one crashed with weird errors and while I have
    no direct evidence, I suspect something like the race above happened.
    This patch extends the page lock to being held until the pmd_numa is
    cleared to prevent migration starting in parallel while the pmd_numa is
    being cleared. It also flushes the old pmd entry and orders pagetable
    insertion before rmap insertion.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

01 Oct, 2013

1 commit

  • Isolated balloon pages can wrongly end up in LRU lists when
    migrate_pages() finishes its round without draining all the isolated
    page list.

    The same issue can happen when reclaim_clean_pages_from_list() tries to
    reclaim pages from an isolated page list, before migration, in the CMA
    path. Such balloon page leak opens a race window against LRU lists
    shrinkers that leads us to the following kernel panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] shrink_page_list+0x24e/0x897
    PGD 3cda2067 PUD 3d713067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: shrink_page_list+0x24e/0x897
    RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
    RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
    RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
    R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
    R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
    FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
    Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
    Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
    RIP [] shrink_page_list+0x24e/0x897
    RSP
    CR2: 0000000000000028
    ---[ end trace 703d2451af6ffbfd ]---
    Kernel panic - not syncing: Fatal exception

    This patch fixes the issue, by assuring the proper tests are made at
    putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
    isolated balloon pages being wrongly reinserted in LRU lists.

    [akpm@linux-foundation.org: clarify awkward comment text]
    Signed-off-by: Rafael Aquini
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

14 Sep, 2013

1 commit

  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

12 Sep, 2013

5 commits

  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • Currently hugepage migration works well only for pmd-based hugepages
    (mainly due to lack of testing,) so we had better not enable migration of
    other levels of hugepages until we are ready for it.

    Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
    page table walk and check pud/pmd_huge() there, so they are safe. But the
    other users (softoffline and memory hotremove) don't do this, so without
    this patch they can try to migrate unexpected types of hugepages.

    To prevent this, we introduce hugepage_migration_support() as an
    architecture dependent check of whether hugepage are implemented on a pmd
    basis or not. And on some architecture multiple sizes of hugepages are
    available, so hugepage_migration_support() also checks hugepage size.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Extend move_pages() to handle vma with VM_HUGETLB set. We will be able to
    migrate hugepage with move_pages(2) after applying the enablement patch
    which comes later in this series.

    We avoid getting refcount on tail pages of hugepage, because unlike thp,
    hugepage is not split and we need not care about races with splitting.

    And migration of larger (1GB for x86_64) hugepage are not enabled.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
    as an argument, instead of taking a pointer to the list of hugepages to be
    migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
    simplify migrate_huge_page()"), and was OK because until now hugepage
    migration is enabled only for soft-offlining which migrates only one
    hugepage in a single call.

    But the situation will change in the later patches in this series which
    enable other users of page migration to support hugepage migration. They
    can kick migration for both of normal pages and hugepages in a single
    call, so we need to go back to original implementation which uses linked
    lists to collect the hugepages to be migrated.

    With this patch, soft_offline_huge_page() switches to use migrate_pages(),
    and migrate_huge_page() is not used any more. So let's remove it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migration is available only for soft offlining, but
    it's also useful for some other users of page migration (clearly because
    users of hugepage can enjoy the benefit of mempolicy and memory hotplug.)
    So this patchset tries to extend such users to support hugepage migration.

    The target of this patchset is to enable hugepage migration for NUMA
    related system calls (migrate_pages(2), move_pages(2), and mbind(2)), and
    memory hotplug.

    This patchset does not add hugepage migration for memory compaction,
    because users of memory compaction mainly expect to construct thp by
    arranging raw pages, and there's little or no need to compact hugepages.
    CMA, another user of page migration, can have benefit from hugepage
    migration, but is not enabled to support it for now (just because of lack
    of testing and expertise in CMA.)

    Hugepage migration of non pmd-based hugepage (for example 1GB hugepage in
    x86_64, or hugepages in architectures like ia64) is not enabled for now
    (again, because of lack of testing.)

    As for how these are achived, I extended the API (migrate_pages()) to
    handle hugepage (with patch 1 and 2) and adjusted code of each caller to
    check and collect movable hugepages (with patch 3-7). Remaining 2 patches
    are kind of miscellaneous ones to avoid unexpected behavior. Patch 8 is
    about making sure that we only migrate pmd-based hugepages. And patch 9
    is about choosing appropriate zone for hugepage allocation.

    My test is mainly functional one, simply kicking hugepage migration via
    each entry point and confirm that migration is done correctly. Test code
    is available here:

    git://github.com/Naoya-Horiguchi/test_hugepage_migration_extension.git

    And I always run libhugetlbfs test when changing hugetlbfs's code. With
    this patchset, no regression was found in the test.

    This patch (of 9):

    Before enabling each user of page migration to support hugepage,
    this patch enables the list of pages for migration to link not only
    LRU pages, but also hugepages. As a result, putback_movable_pages()
    and migrate_pages() can handle both of LRU pages and hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Jul, 2013

1 commit

  • As the aio job will pin the ring pages, that will lead to mem migrated
    failed. In order to fix this problem we use an anon inode to manage the aio ring
    pages, and setup the migratepage callback in the anon inode's address space, so
    that when mem migrating the aio ring pages will be moved to other mem node safely.

    Signed-off-by: Gu Zheng
    Signed-off-by: Benjamin LaHaise

    Gu Zheng
     

13 Jun, 2013

1 commit

  • When we have a page fault for the address which is backed by a hugepage
    under migration, the kernel can't wait correctly and do busy looping on
    hugepage fault until the migration finishes. As a result, users who try
    to kick hugepage migration (via soft offlining, for example) occasionally
    experience long delay or soft lockup.

    This is because pte_offset_map_lock() can't get a correct migration entry
    or a correct page table lock for hugepage. This patch introduces
    migration_entry_wait_huge() to solve this.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Rik van Riel
    Reviewed-by: Wanpeng Li
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: KOSAKI Motohiro
    Cc: [2.6.35+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

25 May, 2013

1 commit

  • Page 'new' during MIGRATION can't be flushed with flush_cache_page().
    Using flush_cache_page(vma, addr, pfn) is justified only if the page is
    already placed in process page table, and that is done right after
    flush_cache_page(). But without it the arch function has no knowledge
    of process PTE and does nothing.

    Besides that, flush_cache_page() flushes an application cache page, but
    the kernel has a different page virtual address and dirtied it.

    Replace it with flush_dcache_page(new) which is the proper usage.

    The old page is flushed in try_to_unmap_one() before migration.

    This bug takes place in Sead3 board with M14Kc MIPS CPU without cache
    aliasing (but Harvard arch - separate I and D cache) in tight memory
    environment (128MB) each 1-3days on SOAK test. It fails in cc1 during
    kernel build (SIGILL, SIGBUS, SIGSEG) if CONFIG_COMPACTION is switched
    ON.

    Signed-off-by: Leonid Yegoshin
    Cc: Leonid Yegoshin
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Ralf Baechle
    Cc: Russell King
    Cc: David Miller
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Leonid Yegoshin