13 Oct, 2018

1 commit

  • commit e125fe405abedc1dc8a5b2229b80ee91c1434015 upstream.

    A transparent huge page is represented by a single entry on an LRU list.
    Therefore, we can only make unevictable an entire compound page, not
    individual subpages.

    If a user tries to mlock() part of a huge page, we want the rest of the
    page to be reclaimable.

    We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
    PMD on border of VM_LOCKED VMA will be split into PTE table.

    Introduction of THP migration breaks[1] the rules around mlocking THP
    pages. If we had a single PMD mapping of the page in mlocked VMA, the
    page will get mlocked, regardless of PTE mappings of the page.

    For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
    remove_migration_pmd().

    Anon THP pages can only be shared between processes via fork(). Mlocked
    page can only be shared if parent mlocked it before forking, otherwise CoW
    will be triggered on mlock().

    For Anon-THP, we can fix the issue by munlocking the page on removing PTE
    migration entry for the page. PTEs for the page will always come after
    mlocked PMD: rmap walks VMAs from oldest to newest.

    Test-case:

    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    unsigned long nodemask = 4;
    void *addr;

    addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

    if (fork()) {
    wait(NULL);
    return 0;
    }

    mlock(addr, 4UL << 10);
    mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
    &nodemask, 4, MPOL_MF_MOVE);

    return 0;
    }

    [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

    Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Reviewed-by: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Oct, 2017

1 commit

  • Index was incremented before last use and thus the second array could
    dereference to an invalid address (not mentioning the fact that it did
    not properly clear the entry we intended to clear).

    Link: http://lkml.kernel.org/r/1506973525-16491-1-git-send-email-jglisse@redhat.com
    Fixes: 8315ada7f095bf ("mm/migrate: allow migrate_vma() to alloc new page on empty entry")
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Jérôme Glisse
    Cc: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Hairgrove
     

09 Sep, 2017

9 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This allows callers of migrate_vma() to allocate new page for empty CPU
    page table entry (pte_none or back by zero page). This is only for
    anonymous memory and it won't allow new page to be instanced if the
    userfaultfd is armed.

    This is useful to device driver that want to migrate a range of virtual
    address and would rather allocate new memory than having to fault later
    on.

    Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Allow to unmap and restore special swap entry of un-addressable
    ZONE_DEVICE memory.

    Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Common case for migration of virtual address range is page are map only
    once inside the vma in which migration is taking place. Because we
    already walk the CPU page table for that range we can directly do the
    unmap there and setup special migration swap entry.

    Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This patch add a new memory migration helpers, which migrate memory
    backing a range of virtual address of a process to different memory (which
    can be allocated through special allocator). It differs from numa
    migration by working on a range of virtual address and thus by doing
    migration in chunk that can be large enough to use DMA engine or special
    copy offloading engine.

    Expected users are any one with heterogeneous memory where different
    memory have different characteristics (latency, bandwidth, ...). As an
    example IBM platform with CAPI bus can make use of this feature to migrate
    between regular memory and CAPI device memory. New CPU architecture with
    a pool of high performance memory not manage as cache but presented as
    regular memory (while being faster and with lower latency than DDR) will
    also be prime user of this patch.

    Migration to private device memory will be useful for device that have
    large pool of such like GPU, NVidia plans to use HMM for that.

    Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Introduce a new migration mode that allow to offload the copy to a device
    DMA engine. This changes the workflow of migration and not all
    address_space migratepage callback can support this.

    This is intended to be use by migrate_vma() which itself is use for thing
    like HMM (see include/linux/hmm.h).

    No additional per-filesystem migratepage testing is needed. I disables
    MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
    added comment in those to explain why (part of this patch). The commit
    message is unclear it should say that any callback that wish to support
    this new mode need to be aware of the difference in the migration flow
    from other mode.

    Some of these callbacks do extra locking while copying (aio, zsmalloc,
    balloon, ...) and for DMA to be effective you want to copy multiple
    pages in one DMA operations. But in the problematic case you can not
    easily hold the extra lock accross multiple call to this callback.

    Usual flow is:

    For each page {
    1 - lock page
    2 - call migratepage() callback
    3 - (extra locking in some migratepage() callback)
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    5 - copy page
    6 - (unlock any extra lock of migratepage() callback)
    7 - return from migratepage() callback
    8 - unlock page
    }

    The new mode MIGRATE_SYNC_NO_COPY:
    1 - lock multiple pages
    For each page {
    2 - call migratepage() callback
    3 - abort in all problematic migratepage() callback
    4 - migrate page state (freeze refcount, update page cache, buffer
    head, ...)
    } // finished all calls to migratepage() callback
    5 - DMA copy multiple pages
    6 - unlock all the pages

    To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
    new callback migratepages() (for instance) that deals with multiple
    pages in one transaction.

    Because the problematic cases are not important for current usage I did
    not wanted to complexify this patchset even more for no good reason.

    Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This patch enables thp migration for move_pages(2).

    Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Add thp migration's core code, including conversions between a PMD entry
    and a swap entry, setting PMD migration entry, removing PMD migration
    entry, and waiting on PMD migration entries.

    This patch makes it possible to support thp migration. If you fail to
    allocate a destination page as a thp, you just split the source thp as
    we do now, and then enter the normal page migration. If you succeed to
    allocate destination thp, you enter thp migration. Subsequent patches
    actually enable thp migration for each caller of page migration by
    allowing its get_new_page() callback to allocate thps.

    [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
    Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
    [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

21 Aug, 2017

1 commit

  • The 'move_paghes()' system call was introduced long long ago with the
    same permission checks as for sending a signal (except using
    CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

    That turns out to not be a great choice - while the system call really
    only moves physical page allocations around (and you need other
    capabilities to do a lot of it), you can check the return value to map
    out some the virtual address choices and defeat ASLR of a binary that
    still shares your uid.

    So change the access checks to the more common 'ptrace_may_access()'
    model instead.

    This tightens the access checks for the uid, and also effectively
    changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
    anybody really _uses_ this legacy system call any more (we hav ebetter
    NUMA placement models these days), so I expect nobody to notice.

    Famous last words.

    Reported-by: Otto Ebeling
    Acked-by: Eric W. Biederman
    Cc: Willy Tarreau
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Aug, 2017

1 commit

  • While deferring TLB flushes is a good practice, the reverted patch
    caused pending TLB flushes to be checked while the page-table lock is
    not taken. As a result, in architectures with weak memory model (PPC),
    Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
    and cause (in theory) a memory corruption.

    Since the alternative of using smp_mb__after_unlock_lock() was
    considered a bit open-coded, and the performance impact is expected to
    be small, the previous patch is reverted.

    This reverts b0943d61b8fa ("mm: numa: defer TLB flush for THP migration
    as long as possible").

    Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.com
    Signed-off-by: Nadav Amit
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Andy Lutomirski
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Russell King
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

11 Jul, 2017

2 commits

  • When migrating a transparent hugepage, migrate_misplaced_transhuge_page
    guards itself against a concurrent fastgup of the page by checking that
    the page count is equal to 2 before and after installing the new pmd.

    If the page count changes, then the pmd is reverted back to the original
    entry, however there is a small window where the new (possibly writable)
    pmd is installed and the underlying page could be written by userspace.
    Restoring the old pmd could therefore result in loss of data.

    This patch fixes the problem by freezing the page count whilst updating
    the page tables, which protects against a concurrent fastgup without the
    need to restore the old pmd in the failure case (since the page count
    can no longer change under our feet).

    Link: http://lkml.kernel.org/r/1497349722-6731-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Will Deacon
    Acked-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Mark Rutland
    Cc: Steve Capper
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Currently hugepage migrated by soft-offline (i.e. due to correctable
    memory errors) is contained as a hugepage, which means many non-error
    pages in it are unreusable, i.e. wasted.

    This patch solves this issue by dissolving source hugepages into buddy.
    As done in previous patch, PageHWPoison is set only on a head page of
    the error hugepage. Then in dissoliving we move the PageHWPoison flag
    to the raw error page so that all healthy subpages return back to buddy.

    [arnd@arndb.de: fix warnings: replace some macros with inline functions]
    Link: http://lkml.kernel.org/r/20170609102544.2947326-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1496305019-5493-5-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

07 Jul, 2017

1 commit

  • Patch series "HugeTLB migration support for PPC64", v2.

    This patch (of 9):

    The right interface to use to set a hugetlb pte entry is set_huge_pte_at.
    Use that instead of set_pte_at.

    Link: http://lkml.kernel.org/r/1494926612-23928-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

04 May, 2017

3 commits

  • rmap_one's return value controls whether rmap_work should contine to
    scan other ptes or not so it's target for changing to boolean. Return
    true if the scan should be continued. Otherwise, return false to stop
    the scanning.

    This patch makes rmap_one's return value to boolean.

    Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are a few places the code assumes anonymous pages should have
    SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
    going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
    for them. The assumption doesn't hold any more, so fix them.

    Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • NUMA balancing already checks the watermarks of the target node to
    decide whether it's a suitable balancing target. Whether the node is
    reclaimable or not is irrelevant when we don't intend to reclaim.

    Link: http://lkml.kernel.org/r/20170228214007.5621-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 Apr, 2017

1 commit

  • Commit 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn
    based migration") moved the dec_node_page_state() call (along with the
    page_is_file_cache() call) to after putback_lru_page().

    But page_is_file_cache() can change after putback_lru_page() is called,
    so it should be called before putback_lru_page(), as it was before that
    patch, to prevent NR_ISOLATE_* stats from going negative.

    Without this fix, non-CONFIG_SMP kernels end up hanging in the
    while(too_many_isolated()) { congestion_wait() } loop in
    shrink_active_list() due to the negative stats.

    Mem-Info:
    active_anon:32567 inactive_anon:121 isolated_anon:1
    active_file:6066 inactive_file:6639 isolated_file:4294967295
    ^^^^^^^^^^
    unevictable:0 dirty:115 writeback:0 unstable:0
    slab_reclaimable:2086 slab_unreclaimable:3167
    mapped:3398 shmem:18366 pagetables:1145 bounce:0
    free:1798 free_pcp:13 free_cma:0

    Fixes: 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn based migration")
    Link: http://lkml.kernel.org/r/1492683865-27549-1-git-send-email-rabin.vincent@axis.com
    Signed-off-by: Rabin Vincent
    Acked-by: Michal Hocko
    Cc: Ming Ling
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rabin Vincent
     

01 Apr, 2017

1 commit

  • I found that calling page migration for ksm pages causes the following
    bug:

    page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913
    flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked)
    raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001
    raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffff88007d81b800
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/rmap.c:1086!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260
    RSP: 0018:ffffc90002473b30 EFLAGS: 00010282
    RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0
    RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1
    R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80
    R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000
    FS: 00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    remove_migration_pte+0x220/0x2c0
    rmap_walk_ksm+0x143/0x220
    rmap_walk+0x55/0x60
    remove_migration_ptes+0x53/0x80
    migrate_pages+0x8ed/0xb60
    soft_offline_page+0x309/0x8d0
    store_soft_offline_page+0xaf/0xf0
    dev_attr_store+0x18/0x30
    sysfs_kf_write+0x3a/0x50
    kernfs_fop_write+0xff/0x180
    __vfs_write+0x37/0x160
    vfs_write+0xb2/0x1b0
    SyS_write+0x55/0xc0
    do_syscall_64+0x67/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7f51956339e0
    RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0
    RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001
    RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740
    R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400
    R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
    Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48
    RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30
    ---[ end trace a679d00f4af2df48 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled
    ---[ end Kernel panic - not syncing: Fatal exception

    The problem is in the following lines:

    new = page - pvmw.page->index +
    linear_page_index(vma, pvmw.address);

    The 'new' is calculated with 'page' which is given by the caller as a
    destination page and some offset adjustment for thp. But this doesn't
    properly work for ksm pages because pvmw.page->index doesn't change for
    each address but linear_page_index() changes, which means that 'new'
    points to different pages for each addresses backed by the ksm page. As
    a result, we try to set totally unrelated pages as destination pages,
    and that causes kernel crash.

    This patch fixes the miscalculation and makes ksm page migration work
    fine.

    Fixes: 3fe87967c536 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()")
    Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

02 Mar, 2017

1 commit

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

25 Feb, 2017

2 commits

  • remove_migration_pte() also can easily be converted to
    page_vma_mapped_walk().

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170129173858.45174-13-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Patch series "HWPOISON: soft offlining for non-lru movable page", v6.

    After Minchan's commit bda807d44454 ("mm: migrate: support non-lru
    movable page migration"), some type of non-lru page like zsmalloc and
    virtio-balloon page also support migration.

    Therefore, we can:

    1) soft offlining no-lru movable pages, which means when memory
    corrected errors occur on a non-lru movable page, we can stop to use
    it by migrating data onto another page and disable the original
    (maybe half-broken) one.

    2) enable memory hotplug for non-lru movable pages, i.e. we may offline
    blocks, which include such pages, by using non-lru page migration.

    This patchset is heavily dependent on non-lru movable page migration.

    This patch (of 4):

    Change the return type of isolate_movable_page() from bool to int. It
    will return 0 when isolate movable page successfully, and return -EBUSY
    when it isolates failed.

    There is no functional change within this patch but prepare for later
    patch.

    [xieyisheng1@huawei.com: v6]
    Link: http://lkml.kernel.org/r/1486108770-630-2-git-send-email-xieyisheng1@huawei.com
    Link: http://lkml.kernel.org/r/1485867981-16037-2-git-send-email-ysxie@foxmail.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Andi Kleen
    Cc: Hanjun Guo
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Reza Arbab
    Cc: Taku Izumi
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

26 Dec, 2016

1 commit


13 Dec, 2016

2 commits

  • The bug in khugepaged fixed earlier in this series shows that radix tree
    slot replacement is fragile; and it will become more so when not only
    NULL!NULL transitions need to be caught but transitions from and to
    exceptional entries as well. We need checks.

    Re-implement radix_tree_replace_slot() on top of the sanity-checked
    __radix_tree_replace(). This requires existing callers to also pass the
    radix tree root, but it'll warn us when somebody replaces slots with
    contents that need proper accounting (transitions between NULL entries,
    real entries, exceptional entries) and where a replacement through the
    slot pointer would corrupt the radix tree node counts.

    Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit bda807d44454 ("mm: migrate: support non-lru movable page
    migration") isolate_migratepages_block) can isolate !PageLRU pages which
    would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
    pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
    heuristics based on those counters such as pgdat_reclaimable_pages resp.
    too_many_isolated which would lead to unexpected stalls during the
    direct reclaim without any good reason. Note that
    __alloc_contig_migrate_range can isolate a lot of pages at once.

    On mobile devices such as 512M ram android Phone, it may use a big zram
    swap. In some cases zram(zsmalloc) uses too many non-lru but
    migratedable pages, such as:

    MemTotal: 468148 kB
    Normal free:5620kB
    Free swap:4736kB
    Total swap:409596kB
    ZRAM: 164616kB(zsmalloc non-lru pages)
    active_anon:60700kB
    inactive_anon:60744kB
    active_file:34420kB
    inactive_file:37532kB

    Fix this by only accounting lru pages to NR_ISOLATED_* in
    isolate_migratepages_block right after they were isolated and we still
    know they were on LRU. Drop acct_isolated because it is called after
    the fact and we've lost that information. Batching per-cpu counter
    doesn't make much improvement anyway. Also make sure that we uncharge
    only LRU pages when putting them back on the LRU in
    putback_movable_pages resp. when unmap_and_move migrates the page.

    [mhocko@suse.com: replace acct_isolated() with direct counting]
    Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
    Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
    Signed-off-by: Ming Ling
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Ling
     

08 Oct, 2016

1 commit

  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

29 Jul, 2016

6 commits

  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_PAGES is the number of mapped anon pages.

    This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
    NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we
    have

    NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_MAPPED is the number of mapped anon pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

5 commits

  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Now, VM has a feature to migrate non-lru movable pages so balloon
    doesn't need custom migration hooks in migrate.c and compaction.c.

    Instead, this patch implements the page->mapping->a_ops->
    {isolate|migrate|putback} functions.

    With that, we could remove hooks for ballooning in general migration
    functions and make balloon compaction simple.

    [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
    Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, I got many reports about perfermance degradation in embedded
    system(Android mobile phone, webOS TV and so on) and easy fork fail.

    The problem was fragmentation caused by zram and GPU driver mainly.
    With memory pressure, their pages were spread out all of pageblock and
    it cannot be migrated with current compaction algorithm which supports
    only LRU pages. In the end, compaction cannot work well so reclaimer
    shrinks all of working set pages. It made system very slow and even to
    fail to fork easily which requires order-[2 or 3] allocations.

    Other pain point is that they cannot use CMA memory space so when OOM
    kill happens, I can see many free pages in CMA area, which is not memory
    efficient. In our product which has big CMA memory, it reclaims zones
    too exccessively to allocate GPU and zram page although there are lots
    of free space in CMA so system becomes very slow easily.

    To solve these problem, this patch tries to add facility to migrate
    non-lru pages via introducing new functions and page flags to help
    migration.

    struct address_space_operations {
    ..
    ..
    bool (*isolate_page)(struct page *, isolate_mode_t);
    void (*putback_page)(struct page *);
    ..
    }

    new page flags

    PG_movable
    PG_isolated

    For details, please read description in "mm: migrate: support non-lru
    movable page migration".

    Originally, Gioh Kim had tried to support this feature but he moved so I
    took over the work. I took many code from his work and changed a little
    bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
    have many credit, too.

    And I should mention Chulmin who have tested this patchset heavily so I
    can find many bugs from him. :)

    Thanks, Gioh, Konstantin and Chulmin!

    This patchset consists of five parts.

    1. clean up migration
    mm: use put_page to free page instead of putback_lru_page

    2. add non-lru page migration feature
    mm: migrate: support non-lru movable page migration

    3. rework KVM memory-ballooning
    mm: balloon: use general non-lru movable page feature

    4. zsmalloc refactoring for preparing page migration
    zsmalloc: keep max_object in size_class
    zsmalloc: use bit_spin_lock
    zsmalloc: use accessor
    zsmalloc: factor page chain functionality out
    zsmalloc: introduce zspage structure
    zsmalloc: separate free_zspage from putback_zspage
    zsmalloc: use freeobj for index

    5. zsmalloc page migration
    zsmalloc: page migration support
    zram: use __GFP_MOVABLE for memory allocation

    This patch (of 12):

    Procedure of page migration is as follows:

    First of all, it should isolate a page from LRU and try to migrate the
    page. If it is successful, it releases the page for freeing.
    Otherwise, it should put the page back to LRU list.

    For LRU pages, we have used putback_lru_page for both freeing and
    putback to LRU list. It's okay because put_page is aware of LRU list so
    if it releases last refcount of the page, it removes the page from LRU
    list. However, It makes unnecessary operations (e.g., lru_cache_add,
    pagevec and flags operations. It would be not significant but no worth
    to do) and harder to support new non-lru page migration because put_page
    isn't aware of non-lru page's data structure.

    To solve the problem, we can add new hook in put_page with PageMovable
    flags check but it can increase overhead in hot path and needs new
    locking scheme to stabilize the flag check with put_page.

    So, this patch cleans it up to divide two semantic(ie, put and putback).
    If migration is successful, use put_page instead of putback_lru_page and
    use putback_lru_page only on failure. That makes code more readable and
    doesn't add overhead in put_page.

    Comment from Vlastimil
    "Yeah, and compaction (perhaps also other migration users) has to drain
    the lru pvec... Getting rid of this stuff is worth even by itself."

    Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim