12 Jan, 2017

1 commit

  • commit 6afcf8ef0ca0a69d014f8edb613d94821f0ae700 upstream.

    Since commit bda807d44454 ("mm: migrate: support non-lru movable page
    migration") isolate_migratepages_block) can isolate !PageLRU pages which
    would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru
    pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
    heuristics based on those counters such as pgdat_reclaimable_pages resp.
    too_many_isolated which would lead to unexpected stalls during the
    direct reclaim without any good reason. Note that
    __alloc_contig_migrate_range can isolate a lot of pages at once.

    On mobile devices such as 512M ram android Phone, it may use a big zram
    swap. In some cases zram(zsmalloc) uses too many non-lru but
    migratedable pages, such as:

    MemTotal: 468148 kB
    Normal free:5620kB
    Free swap:4736kB
    Total swap:409596kB
    ZRAM: 164616kB(zsmalloc non-lru pages)
    active_anon:60700kB
    inactive_anon:60744kB
    active_file:34420kB
    inactive_file:37532kB

    Fix this by only accounting lru pages to NR_ISOLATED_* in
    isolate_migratepages_block right after they were isolated and we still
    know they were on LRU. Drop acct_isolated because it is called after
    the fact and we've lost that information. Batching per-cpu counter
    doesn't make much improvement anyway. Also make sure that we uncharge
    only LRU pages when putting them back on the LRU in
    putback_movable_pages resp. when unmap_and_move migrates the page.

    [mhocko@suse.com: replace acct_isolated() with direct counting]
    Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration")
    Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org
    Signed-off-by: Ming Ling
    Signed-off-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ming Ling
     

08 Oct, 2016

1 commit

  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

29 Jul, 2016

6 commits

  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_PAGES is the number of mapped anon pages.

    This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
    NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we
    have

    NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_MAPPED is the number of mapped anon pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

5 commits

  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Now, VM has a feature to migrate non-lru movable pages so balloon
    doesn't need custom migration hooks in migrate.c and compaction.c.

    Instead, this patch implements the page->mapping->a_ops->
    {isolate|migrate|putback} functions.

    With that, we could remove hooks for ballooning in general migration
    functions and make balloon compaction simple.

    [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
    Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rafael Aquini
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Recently, I got many reports about perfermance degradation in embedded
    system(Android mobile phone, webOS TV and so on) and easy fork fail.

    The problem was fragmentation caused by zram and GPU driver mainly.
    With memory pressure, their pages were spread out all of pageblock and
    it cannot be migrated with current compaction algorithm which supports
    only LRU pages. In the end, compaction cannot work well so reclaimer
    shrinks all of working set pages. It made system very slow and even to
    fail to fork easily which requires order-[2 or 3] allocations.

    Other pain point is that they cannot use CMA memory space so when OOM
    kill happens, I can see many free pages in CMA area, which is not memory
    efficient. In our product which has big CMA memory, it reclaims zones
    too exccessively to allocate GPU and zram page although there are lots
    of free space in CMA so system becomes very slow easily.

    To solve these problem, this patch tries to add facility to migrate
    non-lru pages via introducing new functions and page flags to help
    migration.

    struct address_space_operations {
    ..
    ..
    bool (*isolate_page)(struct page *, isolate_mode_t);
    void (*putback_page)(struct page *);
    ..
    }

    new page flags

    PG_movable
    PG_isolated

    For details, please read description in "mm: migrate: support non-lru
    movable page migration".

    Originally, Gioh Kim had tried to support this feature but he moved so I
    took over the work. I took many code from his work and changed a little
    bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
    have many credit, too.

    And I should mention Chulmin who have tested this patchset heavily so I
    can find many bugs from him. :)

    Thanks, Gioh, Konstantin and Chulmin!

    This patchset consists of five parts.

    1. clean up migration
    mm: use put_page to free page instead of putback_lru_page

    2. add non-lru page migration feature
    mm: migrate: support non-lru movable page migration

    3. rework KVM memory-ballooning
    mm: balloon: use general non-lru movable page feature

    4. zsmalloc refactoring for preparing page migration
    zsmalloc: keep max_object in size_class
    zsmalloc: use bit_spin_lock
    zsmalloc: use accessor
    zsmalloc: factor page chain functionality out
    zsmalloc: introduce zspage structure
    zsmalloc: separate free_zspage from putback_zspage
    zsmalloc: use freeobj for index

    5. zsmalloc page migration
    zsmalloc: page migration support
    zram: use __GFP_MOVABLE for memory allocation

    This patch (of 12):

    Procedure of page migration is as follows:

    First of all, it should isolate a page from LRU and try to migrate the
    page. If it is successful, it releases the page for freeing.
    Otherwise, it should put the page back to LRU list.

    For LRU pages, we have used putback_lru_page for both freeing and
    putback to LRU list. It's okay because put_page is aware of LRU list so
    if it releases last refcount of the page, it removes the page from LRU
    list. However, It makes unnecessary operations (e.g., lru_cache_add,
    pagevec and flags operations. It would be not significant but no worth
    to do) and harder to support new non-lru page migration because put_page
    isn't aware of non-lru page's data structure.

    To solve the problem, we can add new hook in put_page with PageMovable
    flags check but it can increase overhead in hot path and needs new
    locking scheme to stabilize the flag check with put_page.

    So, this patch cleans it up to divide two semantic(ie, put and putback).
    If migration is successful, use put_page instead of putback_lru_page and
    use putback_lru_page only on failure. That makes code more readable and
    doesn't add overhead in put_page.

    Comment from Vlastimil
    "Yeah, and compaction (perhaps also other migration users) has to drain
    the lru pvec... Getting rid of this stuff is worth even by itself."

    Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Jun, 2016

1 commit


21 May, 2016

1 commit

  • If page migration fails due to -ENOMEM, nr_failed should still be
    incremented for proper statistics.

    This was encountered recently when all page migration vmstats showed 0,
    and inferred that migrate_pages() was never called, although in reality
    the first page migration failed because compaction_alloc() failed to
    find a migration target.

    This patch increments nr_failed so the vmstat is properly accounted on
    ENOMEM.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 May, 2016

1 commit

  • v3.16 commit 07a427884348 ("mm: shmem: avoid atomic operation during
    shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
    by __SetPageSwapBacked, pointing out that the newly allocated page is
    not yet visible to other users (except speculative get_page_unless_zero-
    ers, who may not update page flags before their further checks).

    That was part of a series in which Mel was focused on tmpfs profiles:
    but almost all SetPageSwapBacked uses can be so optimized, with the same
    justification.

    Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
    it's not an error to free a page with PG_swapbacked set.

    Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
    doing it differently in different places; but that's for tidiness - if
    the ordering actually mattered, we should not be using the __variants.

    There's probably scope for further __SetPageFlags in other places, but
    SwapBacked is the one I'm interested in at the moment.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Reviewed-by: Mel Gorman
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Apr, 2016

1 commit

  • Currently, migration code increses num_poisoned_pages on *failed*
    migration page as well as successfully migrated one at the trial of
    memory-failure. It will make the stat wrong. As well, it marks the
    page as PG_HWPoison even if the migration trial failed. It would mean
    we cannot recover the corrupted page using memory-failure facility.

    This patches fixes it.

    Signed-off-by: Minchan Kim
    Reported-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Mar, 2016

3 commits

  • Make remove_migration_ptes() available to be used in split_huge_page().

    New parameter 'locked' added: as with try_to_umap() we need a way to
    indicate that caller holds rmap lock.

    We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
    pages are never mlocked.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We remove one instace of flush_tlb_range here. That was added by commit
    f714f4f20e59 ("mm: numa: call MMU notifiers on THP migration"). But the
    pmdp_huge_clear_flush_notify should have done the require flush for us.
    Hence remove the extra flush.

    Signed-off-by: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

16 Mar, 2016

4 commits

  • Rather than scattering mem_cgroup_migrate() calls all over the place,
    have a single call from a safe place where every migration operation
    eventually ends up in - migrate_page_copy().

    Signed-off-by: Johannes Weiner
    Suggested-by: Hugh Dickins
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Mateusz Guzik
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing a page's memcg association complicates dealing with the page,
    so we want to limit this as much as possible. Page migration e.g. does
    not have to do that. Just like page cache replacement, it can forcibly
    charge a replacement page, and then uncharge the old page when it gets
    freed. Temporarily overcharging the cgroup by a single page is not an
    issue in practice, and charging is so cheap nowadays that this is much
    preferrable to the headache of messing with live pages.

    The only place that still changes the page->mem_cgroup binding of live
    pages is when pages move along with a task to another cgroup. But that
    path isolates the page from the LRU, takes the page lock, and the move
    lock (lock_page_memcg()). That means page->mem_cgroup is always stable
    in callers that have the page isolated from the LRU or locked. Lighter
    unlocked paths, like writeback accounting, can use lock_page_memcg().

    [akpm@linux-foundation.org: fix build]
    [vdavydov@virtuozzo.com: fix lockdep splat]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • During migration, page_owner info is now copied with the rest of the
    page, so the stacktrace leading to free page allocation during migration
    is overwritten. For debugging purposes, it might be however useful to
    know that the page has been migrated since its initial allocation. This
    might happen many times during the lifetime for different reasons and
    fully tracking this, especially with stacktraces would incur extra
    memory costs. As a compromise, store and print the migrate_reason of
    the last migration that occurred to the page. This is enough to
    distinguish compaction, numa balancing etc.

    Example page_owner entry after the patch:

    Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
    PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
    [] __alloc_pages_nodemask+0x134/0x230
    [] alloc_pages_vma+0xb5/0x250
    [] shmem_alloc_page+0x61/0x90
    [] shmem_getpage_gfp+0x678/0x960
    [] shmem_fallocate+0x329/0x440
    [] vfs_fallocate+0x140/0x230
    [] SyS_fallocate+0x44/0x70
    [] entry_SYSCALL_64_fastpath+0x12/0x71
    Page has been migrated, last migrate reason: compaction

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The page_owner mechanism stores gfp_flags of an allocation and stack
    trace that lead to it. During page migration, the original information
    is practically replaced by the allocation of free page as the migration
    target. Arguably this is less useful and might lead to all the
    page_owner info for migratable pages gradually converge towards
    compaction or numa balancing migrations. It has also lead to
    inaccuracies such as one fixed by commit e2cfc91120fa ("mm/page_owner:
    set correct gfp_mask on page_owner").

    This patch thus introduces copying the page_owner info during migration.
    However, since the fact that the page has been migrated from its
    original place might be useful for debugging, the next patch will
    introduce a way to track that information as well.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

28 Feb, 2016

1 commit

  • Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
    flag combination due to confusing semantics. It noted that
    alloc_misplaced_dst_page() was one such user after changes made by
    commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").

    Unfortunately when GFP_THISNODE was removed, users of
    alloc_misplaced_dst_page() started waking kswapd and entering direct
    reclaim because the wrong GFP flags are cleared. The consequence is
    that workloads that used to fit into memory now get reclaimed which is
    addressed by this patch.

    The problem can be demonstrated with "mutilate" that exercises memcached
    which is software dedicated to memory object caching. The configuration
    uses 80% of memory and is run 3 times for varying numbers of clients.
    The results on a 4-socket NUMA box are

    mutilate
    4.4.0 4.4.0
    vanilla numaswap-v1
    Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
    Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
    Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
    Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
    Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
    Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
    Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)

    The metric is queries/second with the more the better. The results are
    way outside of the noise and the reason for the improvement is obvious
    from some of the vmstats

    4.4.0 4.4.0
    vanillanumaswap-v1r1
    Minor Faults 1929399272 2146148218
    Major Faults 19746529 3567
    Swap Ins 57307366 9913
    Swap Outs 50623229 17094
    Allocation stalls 35909 443
    DMA allocs 0 0
    DMA32 allocs 72976349 170567396
    Normal allocs 5306640898 5310651252
    Movable allocs 0 0
    Direct pages scanned 404130893 799577
    Kswapd pages scanned 160230174 0
    Kswapd pages reclaimed 55928786 0
    Direct pages reclaimed 1843936 41921
    Page writes file 2391 0
    Page writes anon 50623229 17094

    The vanilla kernel is swapping like crazy with large amounts of direct
    reclaim and kswapd activity. The figures are aggregate but it's known
    that the bad activity is throughout the entire test.

    Note that simple streaming anon/file memory consumers also see this
    problem but it's not as obvious. In those cases, kswapd is awake when
    it should not be.

    As there are at least two reclaim-related bugs out there, it's worth
    spelling out the user-visible impact. This patch only addresses bugs
    related to excessive reclaim on NUMA hardware when the working set is
    larger than a NUMA node. There is a bug related to high kswapd CPU
    usage but the reports are against laptops and other UMA hardware and is
    not addressed by this patch.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jan, 2016

5 commits

  • Currently we don't split huge page on partial unmap. It's not an ideal
    situation. It can lead to memory overhead.

    Furtunately, we can detect partial unmap on page_remove_rmap(). But we
    cannot call split_huge_page() from there due to locking context.

    It's also counterproductive to do directly from munmap() codepath: in
    many cases we will hit this from exit(2) and splitting the huge page
    just to free it up in small pages is not what we really want.

    The patch introduce deferred_split_huge_page() which put the huge page
    into queue for splitting. The splitting itself will happen when we get
    memory pressure via shrinker interface. The page will be dropped from
    list on freeing through compound page destructor.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to use migration entries instead of compound_lock() to
    stabilize page refcounts. Setup and remove migration entries require
    page to be locked.

    Some of split_huge_page() callers already have the page locked. Let's
    require everybody to lock the page before calling split_huge_page().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Nov, 2015

2 commits

  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

8 commits

  • clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
    since v2.6.16 first introduced page migration; and the set_page_dirty()
    which completed its migration of PageDirty, later had to be moderated to
    __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.

    No actual problems seen with this procedure recently, but if you look into
    what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
    achieving, it turns out to be nothing more than moving the PageDirty flag,
    and its NR_FILE_DIRTY stat from one zone to another.

    It would be good to avoid a pile of irrelevant decrementations and
    incrementations, and improper event counting, and unnecessary descent of
    the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
    radix_tree_replace_slot() left in place anyway).

    Do the NR_FILE_DIRTY movement, like the other stats movements, while
    interrupts still disabled in migrate_page_move_mapping(); and don't even
    bother if the zone is the same. Do the PageDirty movement there under
    tree_lock too, where old page is frozen and newpage not yet visible:
    bearing in mind that as soon as newpage becomes visible in radix_tree, an
    un-page-locked set_page_dirty() might interfere (or perhaps that's just
    not possible: anything doing so should already hold an additional
    reference to the old page, preventing its migration; but play safe).

    But we do still need to transfer PageDirty in migrate_page_copy(), for
    those who don't go the mapping route through migrate_page_move_mapping().

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have had trouble in the past from the way in which page migration's
    newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
    fix direct reclaim writeback regression") which proposed a cleanup.

    We have no actual problem now, but I think the procedure would be clearer
    (and alternative get_new_page pools safer to implement) if we assert that
    newpage is not touched until we are sure that it's going to be used -
    except for taking the trylock on it in __unmap_and_move().

    So shift the early initializations from move_to_new_page() into
    migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
    migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
    deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

    Adjust stages 3 to 8 in the Documentation file accordingly.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __unmap_and_move() contains a long stale comment on page_get_anon_vma()
    and PageSwapCache(), with an odd control flow that's hard to follow.
    Mostly this reflects our confusion about the lifetime of an anon_vma, in
    the early days of page migration, before we could take a reference to one.
    Nowadays this seems quite straightforward: cut it all down to essentials.

    I cannot see the relevance of swapcache here at all, so don't treat it any
    differently: I believe the old comment reflects in part our anon_vma
    confusions, and in part the original v2.6.16 page migration technique,
    which used actual swap to migrate anon instead of swap-like migration
    entries. Why should a swapcache page not be migrated with the aid of
    migration entry ptes like everything else? So lose that comment now, and
    enable migration entries for swapcache in the next patch.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little more by calling remove_migration_ptes()
    from the same level, on success or on failure, from __unmap_and_move() or
    from unmap_and_move_huge_page().

    Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
    leave that to when the page is freed. Except for here in page migration,
    it has been an invariant that a PageAnon (bit set in page->mapping) page
    stays PageAnon until it is freed, and I think we're safer to keep to that.

    And with the above rearrangement, it's necessary because zap_pte_range()
    wants to identify whether a migration entry represents a file or an anon
    page, to update the appropriate rss stats without waiting on it.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little by moving the trylock of newpage from
    move_to_new_page() into __unmap_and_move(), where the old page has been
    locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
    accordingly.

    But make one kind-of-functional change on the way: whereas trylock of
    newpage used to BUG() if it failed, now simply return -EAGAIN if so.
    Cutting out BUG()s is good, right? But, to be honest, this is really to
    extend the usefulness of the custom put_new_page feature, allowing a pool
    of new pages to be shared perhaps with racing uses.

    Use an "else" instead of that "skip_unmap" label.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I don't know of any problem from the way it's used in our current tree,
    but there is one defect in page migration's custom put_new_page feature.

    An unused newpage is expected to be released with the put_new_page(), but
    there was one MIGRATEPAGE_SUCCESS (0) path which released it with
    putback_lru_page(): which can be very wrong for a custom pool.

    Fixed more easily by resetting put_new_page once it won't be needed, than
    by adding a further flag to modify the rc test.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's migrate.c not migration,c, and nowadays putback_movable_pages() not
    putback_lru_pages().

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
    mem_cgroup_migrate() doesn't have much to offer in page migration: convert
    migrate_misplaced_transhuge_page() to set_page_memcg() instead.

    Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
    remaining callers are replace_page_cache_page() and shmem_replace_page():
    both of whom passed lrucare true, so just eliminate that argument.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins