16 Jan, 2016

40 commits

  • When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
    consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
    siginficat slower (ie, 2x times) than madvise_dontneed.

    loop = 5;
    mmap(512M);
    while (loop--) {
    memset(512M);
    madvise(MADV_FREE or MADV_DONTNEED);
    }

    The reason is lots of swapin.

    1) dontneed: 1,612 swapin
    2) madvfree: 879,585 swapin

    If we find hinted pages were already swapped out when syscall is called,
    it's pointless to keep the swapped-out pages in pte. Instead, let's
    free the cold page because swapin is more expensive than (alloc page +
    zeroing).

    With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

    1) dontneed: 6.10user 233.50system 0:50.44elapsed
    2) madvfree: 6.03user 401.17system 1:30.67elapsed
    2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • For uapi, need try to let all macros have same value, and MADV_FREE is
    added into main branch recently, so need redefine MADV_FREE for it.

    At present, '8' can be shared with all architectures, so redefine it to
    '8'.

    [sudipm.mukherjee@gmail.com: correct uniform value of MADV_FREE]
    Signed-off-by: Chen Gang
    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Cc: Ralf Baechle
    Cc: Arnd Bergmann
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Roland Dreier
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Daniel Micay
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Most architectures use asm-generic, but alpha, mips, parisc, xtensa need
    their own definitions.

    This patch defines MADV_FREE for them so it should fix build break for
    their architectures.

    Maybe, I should split and feed pieces to arch maintainers but included
    here for mmotm convenience.

    [gang.chen.5i5j@gmail.com: let MADV_FREE have same value for all architectures]
    Signed-off-by: Minchan Kim
    Acked-by: Max Filippov
    Cc: Wu Fengguang
    Cc: Michael Kerrisk
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Hugh Dickins
    Cc: Jason Evans
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Mika Penttil
    Cc: Rik van Riel
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Signed-off-by: Chen Gang
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The use of idr_remove() is forbidden in the callback functions of
    idr_for_each(). It is therefore unsafe to call idr_remove in
    zram_remove().

    This patch moves the call to idr_remove() from zram_remove() to
    hot_remove_store(). In the detroy_devices() path, idrs are removed by
    idr_destroy(). This solves an use-after-free detected by KASan.

    [akpm@linux-foundation.org: fix coding stype, per Sergey]
    Signed-off-by: Jerome Marchand
    Acked-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
    code for looking up pte of a (possibly transhuge) page. Move this code
    to a new helper function, page_check_address_transhuge(), and make the
    above mentioned functions use it.

    This is just a cleanup, no functional changes are intended.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • During freeze_page(), we remove the page from rmap. It munlocks the
    page if it was mlocked. clear_page_mlock() uses thelru cache, which
    temporary pins the page.

    Let's drain the lru cache before checking page's count vs. mapcount.
    The change makes mlocked page split on first attempt, if it was not
    pinned by somebody else.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Writing 1 into 'split_huge_pages' will try to find and split all huge
    pages in the system. This is useful for debuging.

    [akpm@linux-foundation.org: fix printk text, per Vlastimil]
    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both page_referenced() and page_idle_clear_pte_refs_one() assume that
    THP can only be mapped with PMD, so there's no reason to look on PTEs
    for PageTransHuge() pages. That's no true anymore: THP can be mapped
    with PTEs too.

    The patch removes PageTransHuge() test from the functions and opencode
    page table check.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Kirill A. Shutemov
    Cc: Vladimir Davydov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch updates Documentation/vm/transhuge.txt to reflect changes in
    THP design.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • All parts of THP with new refcounting are now in place. We can now
    allow to enable THP.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently we don't split huge page on partial unmap. It's not an ideal
    situation. It can lead to memory overhead.

    Furtunately, we can detect partial unmap on page_remove_rmap(). But we
    cannot call split_huge_page() from there due to locking context.

    It's also counterproductive to do directly from munmap() codepath: in
    many cases we will hit this from exit(2) and splitting the huge page
    just to free it up in small pages is not what we really want.

    The patch introduce deferred_split_huge_page() which put the huge page
    into queue for splitting. The splitting itself will happen when we get
    memory pressure via shrinker interface. The page will be dropped from
    list on freeing through compound page destructor.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are not able to migrate THPs. It means it's not enough to split only
    PMD on migration -- we need to split compound page under it too.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch adds implementation of split_huge_page() for new
    refcountings.

    Unlike previous implementation, new split_huge_page() can fail if
    somebody holds GUP pin on the page. It also means that pin on page
    would prevent it from bening split under you. It makes situation in
    many places much cleaner.

    The basic scheme of split_huge_page():

    - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

    - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

    - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

    - Split compound page.

    - Unfreeze the page by removing migration entries.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
    changes in thp refcounting rule. This patch fixes them up.

    In the new refcounting, we no longer use tail->_mapcount to keep tail's
    refcount, and thereby we can simplify get/put_hwpoison_page().

    And another change is that tail's refcount is not transferred to the raw
    page during thp split (more precisely, in new rule we don't take
    refcount on tail page any more.) So when we need thp split, we have to
    transfer the refcount properly to the 4kB soft-offlined page before
    migration.

    thp split code goes into core code only when precheck
    (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
    split, where we assume that one refcount is held by the caller of thp
    split and the others are taken via mapping. To meet this assumption,
    this patch moves thp split part in soft_offline_page() after
    get_any_page().

    [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • I saw the following BUG_ON triggered in a testcase where a process calls
    madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
    calls migratepages command repeatedly (doing ping-pong among different
    NUMA nodes) for the first process:

    Soft offlining page 0x60000 at 0x700000600000
    __get_any_page: 0x60000 free buddy page
    page:ffffea0001800000 count:0 mapcount:-127 mapping: (null) index:0x1
    flags: 0x1fffc0000000000()
    page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/include/linux/mm.h:342!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
    CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G O 4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
    RIP: 0010:[] [] put_page+0x5c/0x60
    RSP: 0018:ffff88007c213e00 EFLAGS: 00010246
    Call Trace:
    put_hwpoison_page+0x4e/0x80
    soft_offline_page+0x501/0x520
    SyS_madvise+0x6bc/0x6f0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
    RIP [] put_page+0x5c/0x60
    RSP

    The root cause resides in get_any_page() which retries to get a refcount
    of the page to be soft-offlined. This function calls
    put_hwpoison_page(), expecting that the target page is putback to LRU
    list. But it can be also freed to buddy. So the second check need to
    care about such case.

    Fixes: af8fae7c0886 ("mm/memory-failure.c: clean up soft_offline_page()")
    Signed-off-by: Naoya Horiguchi
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • We're going to use migration entries instead of compound_lock() to
    stabilize page refcounts. Setup and remove migration entries require
    page to be locked.

    Some of split_huge_page() callers already have the page locked. Let's
    require everybody to lock the page before calling split_huge_page().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to use migration PTE entries to stabilize page counts. If
    the page is mapped with PMDs we need to split the PMD and setup
    migration entries. It's reasonable to combine these operations to avoid
    double-scanning over the page table.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Original split_huge_page() combined two operations: splitting PMDs into
    tables of PTEs and splitting underlying compound page. This patch
    implements split_huge_pmd() which split given PMD without splitting
    other PMDs this page mapped with or underlying compound page.

    Without tail page refcounting, implementation of split_huge_pmd() is
    pretty straight-forward.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to have THP mapped with PTEs. It will confuse
    numabalancing. Let's skip them for now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's define page_mapped() to be true for compound pages if any
    sub-pages of the compound page is mapped (with PMD or PTE).

    On other hand page_mapcount() return mapcount for this particular small
    page.

    This will make cases like page_get_anon_vma() behave correctly once we
    allow huge pages to be mapped with PTE.

    Most users outside core-mm should use page_mapcount() instead of
    page_mapped().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Acked-by: Jerome Marchand
    Cc: Aneesh Kumar K.V
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    pmdp_splitting_flush() is not needed too: on splitting PMD we will do
    pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
    needed for fast_gup.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    pmdp_splitting_flush() is not needed too: on splitting PMD we will do
    pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
    needed for fast_gup.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Cc: Sasha Levin
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    pmdp_splitting_flush() is not needed too: on splitting PMD we will do
    pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
    needed for fast_gup.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    pmdp_splitting_flush() is not needed too: on splitting PMD we will do
    pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
    needed for fast_gup.

    [arnd@arndb.de: fix unterminated ifdef in header file]
    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    pmdp_splitting_flush() is not needed too: on splitting PMD we will do
    pmdp_clear_flush() + set_pte_at(). pmdp_clear_flush() will do IPI as
    needed for fast_gup.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to use migration entries to stabilize page counts. It
    means we don't need compound_lock() for that.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't need special code to stabilize THP. If you've got reference to
    any subpage of THP it will not be split under you.

    New split_huge_page() also accepts tail pages: no need in special code
    to get reference to head page.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new THP refcounting, we don't need tricks to stabilize huge page.
    If we've got reference to tail page, it can't split under us.

    This patch effectively reverts a5b338f2b0b1 ("thp: update futex compound
    knowledge").

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Tested-by: Artem Savkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tail page refcounting is utterly complicated and painful to support.

    It uses ->_mapcount on tail pages to store how many times this page is
    pinned. get_page() bumps ->_mapcount on tail page in addition to
    ->_count on head. This information is required by split_huge_page() to
    be able to distribute pins from head of compound page to tails during
    the split.

    We will need ->_mapcount to account PTE mappings of subpages of the
    compound page. We eliminate need in current meaning of ->_mapcount in
    tail pages by forbidding split entirely if the page is pinned.

    The only user of tail page refcounting is THP which is marked BROKEN for
    now.

    Let's drop all this mess. It makes get_page() and put_page() much
    simpler.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We will re-introduce new version with new refcounting later in patchset.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Up to this point we tried to keep patchset bisectable, but next patches
    are going to change how core of THP refcounting work.

    It would be beneficial to split the change into several patches and make
    it more reviewable. Unfortunately, I don't see how we can achieve that
    while keeping THP working.

    Let's hide THP under CONFIG_BROKEN for now and bring it back when new
    refcounting get established.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
    THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
    are going to be able split PMD without the compound page and that
    split_huge_page() can fail.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Christoph Lameter
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov