18 Mar, 2016

3 commits

  • Make remove_migration_ptes() available to be used in split_huge_page().

    New parameter 'locked' added: as with try_to_umap() we need a way to
    indicate that caller holds rmap lock.

    We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
    pages are never mlocked.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We remove one instace of flush_tlb_range here. That was added by commit
    f714f4f20e59 ("mm: numa: call MMU notifiers on THP migration"). But the
    pmdp_huge_clear_flush_notify should have done the require flush for us.
    Hence remove the extra flush.

    Signed-off-by: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

16 Mar, 2016

4 commits

  • Rather than scattering mem_cgroup_migrate() calls all over the place,
    have a single call from a safe place where every migration operation
    eventually ends up in - migrate_page_copy().

    Signed-off-by: Johannes Weiner
    Suggested-by: Hugh Dickins
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Mateusz Guzik
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Changing a page's memcg association complicates dealing with the page,
    so we want to limit this as much as possible. Page migration e.g. does
    not have to do that. Just like page cache replacement, it can forcibly
    charge a replacement page, and then uncharge the old page when it gets
    freed. Temporarily overcharging the cgroup by a single page is not an
    issue in practice, and charging is so cheap nowadays that this is much
    preferrable to the headache of messing with live pages.

    The only place that still changes the page->mem_cgroup binding of live
    pages is when pages move along with a task to another cgroup. But that
    path isolates the page from the LRU, takes the page lock, and the move
    lock (lock_page_memcg()). That means page->mem_cgroup is always stable
    in callers that have the page isolated from the LRU or locked. Lighter
    unlocked paths, like writeback accounting, can use lock_page_memcg().

    [akpm@linux-foundation.org: fix build]
    [vdavydov@virtuozzo.com: fix lockdep splat]
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • During migration, page_owner info is now copied with the rest of the
    page, so the stacktrace leading to free page allocation during migration
    is overwritten. For debugging purposes, it might be however useful to
    know that the page has been migrated since its initial allocation. This
    might happen many times during the lifetime for different reasons and
    fully tracking this, especially with stacktraces would incur extra
    memory costs. As a compromise, store and print the migrate_reason of
    the last migration that occurred to the page. This is enough to
    distinguish compaction, numa balancing etc.

    Example page_owner entry after the patch:

    Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
    PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
    [] __alloc_pages_nodemask+0x134/0x230
    [] alloc_pages_vma+0xb5/0x250
    [] shmem_alloc_page+0x61/0x90
    [] shmem_getpage_gfp+0x678/0x960
    [] shmem_fallocate+0x329/0x440
    [] vfs_fallocate+0x140/0x230
    [] SyS_fallocate+0x44/0x70
    [] entry_SYSCALL_64_fastpath+0x12/0x71
    Page has been migrated, last migrate reason: compaction

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The page_owner mechanism stores gfp_flags of an allocation and stack
    trace that lead to it. During page migration, the original information
    is practically replaced by the allocation of free page as the migration
    target. Arguably this is less useful and might lead to all the
    page_owner info for migratable pages gradually converge towards
    compaction or numa balancing migrations. It has also lead to
    inaccuracies such as one fixed by commit e2cfc91120fa ("mm/page_owner:
    set correct gfp_mask on page_owner").

    This patch thus introduces copying the page_owner info during migration.
    However, since the fact that the page has been migrated from its
    original place might be useful for debugging, the next patch will
    introduce a way to track that information as well.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

28 Feb, 2016

1 commit

  • Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
    flag combination due to confusing semantics. It noted that
    alloc_misplaced_dst_page() was one such user after changes made by
    commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify").

    Unfortunately when GFP_THISNODE was removed, users of
    alloc_misplaced_dst_page() started waking kswapd and entering direct
    reclaim because the wrong GFP flags are cleared. The consequence is
    that workloads that used to fit into memory now get reclaimed which is
    addressed by this patch.

    The problem can be demonstrated with "mutilate" that exercises memcached
    which is software dedicated to memory object caching. The configuration
    uses 80% of memory and is run 3 times for varying numbers of clients.
    The results on a 4-socket NUMA box are

    mutilate
    4.4.0 4.4.0
    vanilla numaswap-v1
    Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
    Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
    Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
    Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
    Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
    Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
    Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)

    The metric is queries/second with the more the better. The results are
    way outside of the noise and the reason for the improvement is obvious
    from some of the vmstats

    4.4.0 4.4.0
    vanillanumaswap-v1r1
    Minor Faults 1929399272 2146148218
    Major Faults 19746529 3567
    Swap Ins 57307366 9913
    Swap Outs 50623229 17094
    Allocation stalls 35909 443
    DMA allocs 0 0
    DMA32 allocs 72976349 170567396
    Normal allocs 5306640898 5310651252
    Movable allocs 0 0
    Direct pages scanned 404130893 799577
    Kswapd pages scanned 160230174 0
    Kswapd pages reclaimed 55928786 0
    Direct pages reclaimed 1843936 41921
    Page writes file 2391 0
    Page writes anon 50623229 17094

    The vanilla kernel is swapping like crazy with large amounts of direct
    reclaim and kswapd activity. The figures are aggregate but it's known
    that the bad activity is throughout the entire test.

    Note that simple streaming anon/file memory consumers also see this
    problem but it's not as obvious. In those cases, kswapd is awake when
    it should not be.

    As there are at least two reclaim-related bugs out there, it's worth
    spelling out the user-visible impact. This patch only addresses bugs
    related to excessive reclaim on NUMA hardware when the working set is
    larger than a NUMA node. There is a bug related to high kswapd CPU
    usage but the reports are against laptops and other UMA hardware and is
    not addressed by this patch.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Jan, 2016

5 commits

  • Currently we don't split huge page on partial unmap. It's not an ideal
    situation. It can lead to memory overhead.

    Furtunately, we can detect partial unmap on page_remove_rmap(). But we
    cannot call split_huge_page() from there due to locking context.

    It's also counterproductive to do directly from munmap() codepath: in
    many cases we will hit this from exit(2) and splitting the huge page
    just to free it up in small pages is not what we really want.

    The patch introduce deferred_split_huge_page() which put the huge page
    into queue for splitting. The splitting itself will happen when we get
    memory pressure via shrinker interface. The page will be dropped from
    list on freeing through compound page destructor.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to use migration entries instead of compound_lock() to
    stabilize page refcounts. Setup and remove migration entries require
    page to be locked.

    Some of split_huge_page() callers already have the page locked. Let's
    require everybody to lock the page before calling split_huge_page().

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Nov, 2015

2 commits

  • __GFP_WAIT was used to signal that the caller was in atomic context and
    could not sleep. Now it is possible to distinguish between true atomic
    context and callers that are not willing to sleep. The latter should
    clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
    __GFP_WAIT behaves differently, there is a risk that people will clear the
    wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
    indicate what it does -- setting it allows all reclaim activity, clearing
    them prevents it.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

06 Nov, 2015

10 commits

  • clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
    since v2.6.16 first introduced page migration; and the set_page_dirty()
    which completed its migration of PageDirty, later had to be moderated to
    __set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.

    No actual problems seen with this procedure recently, but if you look into
    what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
    achieving, it turns out to be nothing more than moving the PageDirty flag,
    and its NR_FILE_DIRTY stat from one zone to another.

    It would be good to avoid a pile of irrelevant decrementations and
    incrementations, and improper event counting, and unnecessary descent of
    the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
    radix_tree_replace_slot() left in place anyway).

    Do the NR_FILE_DIRTY movement, like the other stats movements, while
    interrupts still disabled in migrate_page_move_mapping(); and don't even
    bother if the zone is the same. Do the PageDirty movement there under
    tree_lock too, where old page is frozen and newpage not yet visible:
    bearing in mind that as soon as newpage becomes visible in radix_tree, an
    un-page-locked set_page_dirty() might interfere (or perhaps that's just
    not possible: anything doing so should already hold an additional
    reference to the old page, preventing its migration; but play safe).

    But we do still need to transfer PageDirty in migrate_page_copy(), for
    those who don't go the mapping route through migrate_page_move_mapping().

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We have had trouble in the past from the way in which page migration's
    newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
    fix direct reclaim writeback regression") which proposed a cleanup.

    We have no actual problem now, but I think the procedure would be clearer
    (and alternative get_new_page pools safer to implement) if we assert that
    newpage is not touched until we are sure that it's going to be used -
    except for taking the trylock on it in __unmap_and_move().

    So shift the early initializations from move_to_new_page() into
    migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
    migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
    deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

    Adjust stages 3 to 8 in the Documentation file accordingly.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • __unmap_and_move() contains a long stale comment on page_get_anon_vma()
    and PageSwapCache(), with an odd control flow that's hard to follow.
    Mostly this reflects our confusion about the lifetime of an anon_vma, in
    the early days of page migration, before we could take a reference to one.
    Nowadays this seems quite straightforward: cut it all down to essentials.

    I cannot see the relevance of swapcache here at all, so don't treat it any
    differently: I believe the old comment reflects in part our anon_vma
    confusions, and in part the original v2.6.16 page migration technique,
    which used actual swap to migrate anon instead of swap-like migration
    entries. Why should a swapcache page not be migrated with the aid of
    migration entry ptes like everything else? So lose that comment now, and
    enable migration entries for swapcache in the next patch.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little more by calling remove_migration_ptes()
    from the same level, on success or on failure, from __unmap_and_move() or
    from unmap_and_move_huge_page().

    Don't reset page->mapping of a PageAnon old page in move_to_new_page(),
    leave that to when the page is freed. Except for here in page migration,
    it has been an invariant that a PageAnon (bit set in page->mapping) page
    stays PageAnon until it is freed, and I think we're safer to keep to that.

    And with the above rearrangement, it's necessary because zap_pte_range()
    wants to identify whether a migration entry represents a file or an anon
    page, to update the appropriate rss stats without waiting on it.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Clean up page migration a little by moving the trylock of newpage from
    move_to_new_page() into __unmap_and_move(), where the old page has been
    locked. Adjust unmap_and_move_huge_page() and balloon_page_migrate()
    accordingly.

    But make one kind-of-functional change on the way: whereas trylock of
    newpage used to BUG() if it failed, now simply return -EAGAIN if so.
    Cutting out BUG()s is good, right? But, to be honest, this is really to
    extend the usefulness of the custom put_new_page feature, allowing a pool
    of new pages to be shared perhaps with racing uses.

    Use an "else" instead of that "skip_unmap" label.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I don't know of any problem from the way it's used in our current tree,
    but there is one defect in page migration's custom put_new_page feature.

    An unused newpage is expected to be released with the put_new_page(), but
    there was one MIGRATEPAGE_SUCCESS (0) path which released it with
    putback_lru_page(): which can be very wrong for a custom pool.

    Fixed more easily by resetting put_new_page once it won't be needed, than
    by adding a further flag to modify the rc test.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's migrate.c not migration,c, and nowadays putback_movable_pages() not
    putback_lru_pages().

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • After v4.3's commit 0610c25daa3e ("memcg: fix dirty page migration")
    mem_cgroup_migrate() doesn't have much to offer in page migration: convert
    migrate_misplaced_transhuge_page() to set_page_memcg() instead.

    Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
    remaining callers are replace_page_cache_page() and shmem_replace_page():
    both of whom passed lrucare true, so just eliminate that argument.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit e6c509f85455 ("mm: use clear_page_mlock() in page_remove_rmap()")
    in v3.7 inadvertently made mlock_migrate_page() impotent: page migration
    unmaps the page from userspace before migrating, and that commit clears
    PageMlocked on the final unmap, leaving mlock_migrate_page() with
    nothing to do. Not a serious bug, the next attempt at reclaiming the
    page would fix it up; but a betrayal of page migration's intent - the
    new page ought to emerge as PageMlocked.

    I don't see how to fix it for mlock_migrate_page() itself; but easily
    fixed in remove_migration_pte(), by calling mlock_vma_page() when the vma
    is VM_LOCKED - under pte lock as in try_to_unmap_one().

    Delete mlock_migrate_page()? Not quite, it does still serve a purpose for
    migrate_misplaced_transhuge_page(): where we could replace it by a test,
    clear_page_mlock(), mlock_vma_page() sequence; but would that be an
    improvement? mlock_migrate_page() is fairly lean, and let's make it
    leaner by skipping the irq save/restore now clearly not needed.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Acked-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Sasha Levin
    Cc: Dmitry Vyukov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Migration tries up to 10 times to migrate pages that return -EAGAIN until
    it gives up. If some pages fail all retries, they are counted towards the
    number of failed pages that migrate_pages() returns. They should also be
    counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages
    tracepoint.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: "Aneesh Kumar K.V"
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

02 Oct, 2015

1 commit

  • The problem starts with a file backed dirty page which is charged to a
    memcg. Then page migration is used to move oldpage to newpage.

    Migration:
    - copies the oldpage's data to newpage
    - clears oldpage.PG_dirty
    - sets newpage.PG_dirty
    - uncharges oldpage from memcg
    - charges newpage to memcg

    Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
    count.

    However, because newpage is not yet charged, setting newpage.PG_dirty
    does not increment the memcg's dirty page count. After migration
    completes newpage.PG_dirty is eventually cleared, often in
    account_page_cleaned(). At this time newpage is charged to a memcg so
    the memcg's dirty page count is decremented which causes underflow
    because the count was not previously incremented by migration. This
    underflow causes balance_dirty_pages() to see a very large unsigned
    number of dirty memcg pages which leads to aggressive throttling of
    buffered writes by processes in non root memcg.

    This issue:
    - can harm performance of non root memcg buffered writes.
    - can report too small (even negative) values in
    memory.stat[(total_)dirty] counters of all memcg, including the root.

    To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
    page_memcg() and set_page_memcg() helpers.

    Test:
    0) setup and enter limited memcg
    mkdir /sys/fs/cgroup/test
    echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/test/cgroup.procs

    1) buffered writes baseline
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    2) buffered writes with compaction antagonist to induce migration
    yes 1 > /proc/sys/vm/compact_memory &
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    kill %
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    3) buffered writes without antagonist, should match baseline
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    (speed, dirty residue)
    unpatched patched
    1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages
    2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages
    3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages

    Notice that unpatched baseline performance (1) fell after
    migration (3): 841 -> 114 MB/s. In the patched kernel, post
    migration performance matches baseline.

    Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
    Signed-off-by: Greg Thelen
    Reported-by: Dave Hansen
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

23 Sep, 2015

1 commit

  • Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    each hugetlb page maintains its active flag to avoid a race condition
    betwe= en multiple calls of isolate_huge_page(), but current kernel
    doesn't set the f= lag on a hugepage allocated by migration because the
    proper putback routine isn= 't called. This means that users could
    still encounter the race referred to by bcc54222309c in this special
    case, so this patch fixes it.

    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: [4.1.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Sep, 2015

1 commit

  • Knowing the portion of memory that is not used by a certain application or
    memory cgroup (idle memory) can be useful for partitioning the system
    efficiently, e.g. by setting memory cgroup limits appropriately.
    Currently, the only means to estimate the amount of idle memory provided
    by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
    access bit for all pages mapped to a particular process by writing 1 to
    clear_refs, wait for some time, and then count smaps:Referenced. However,
    this method has two serious shortcomings:

    - it does not count unmapped file pages
    - it affects the reclaimer logic

    To overcome these drawbacks, this patch introduces two new page flags,
    Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
    A page's Idle flag can only be set from userspace by setting bit in
    /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
    and it is cleared whenever the page is accessed either through page tables
    (it is cleared in page_referenced() in this case) or using the read(2)
    system call (mark_page_accessed()). Thus by setting the Idle flag for
    pages of a particular workload, which can be found e.g. by reading
    /proc/PID/pagemap, waiting for some time to let the workload access its
    working set, and then reading the bitmap file, one can estimate the amount
    of pages that are not used by the workload.

    The Young page flag is used to avoid interference with the memory
    reclaimer. A page's Young flag is set whenever the Access bit of a page
    table entry pointing to the page is cleared by writing to the bitmap file.
    If page_referenced() is called on a Young page, it will add 1 to its
    return value, therefore concealing the fact that the Access bit was
    cleared.

    Note, since there is no room for extra page flags on 32 bit, this feature
    uses extended page flags when compiled on 32 bit.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: kpageidle requires an MMU]
    [akpm@linux-foundation.org: decouple from page-flags rework]
    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Sep, 2015

2 commits

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Wanpeng Li reported a race between soft_offline_page() and
    unpoison_memory(), which causes the following kernel panic:

    BUG: Bad page state in process bash pfn:97000
    page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00
    flags: 0x1fffff80080048(uptodate|active|swapbacked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x40(active)
    Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
    CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
    Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
    Call Trace:
    dump_stack+0x48/0x5c
    bad_page+0xe6/0x140
    free_pages_prepare+0x2f9/0x320
    ? uncharge_list+0xdd/0x100
    free_hot_cold_page+0x40/0x170
    __put_single_page+0x20/0x30
    put_page+0x25/0x40
    unmap_and_move+0x1a6/0x1f0
    migrate_pages+0x100/0x1d0
    ? kill_procs+0x100/0x100
    ? unlock_page+0x6f/0x90
    __soft_offline_page+0x127/0x2a0
    soft_offline_page+0xa6/0x200

    This race is explained like below:

    CPU0 CPU1

    soft_offline_page
    __soft_offline_page
    TestSetPageHWPoison
    unpoison_memory
    PageHWPoison check (true)
    TestClearPageHWPoison
    put_page -> release refcount held by get_hwpoison_page in unpoison_memory
    put_page -> release refcount held by isolate_lru_page in __soft_offline_page
    migrate_pages

    The second put_page() releases refcount held by isolate_lru_page() which
    will lead to unmap_and_move() releases the last refcount of page and w/
    mapcount still 1 since try_to_unmap() is not called if there is only one
    user map the page. Anyway, the page refcount and mapcount will still
    mess if the page is mapped by multiple users.

    This race was introduced by commit 4491f71260 ("mm/memory-failure: set
    PageHWPoison before migrate_pages()"), which focuses on preventing the
    reuse of successfully migrated page. Before this commit we prevent the
    reuse by changing the migratetype to MIGRATE_ISOLATE during soft
    offlining, which has the following problems, so simply reverting the
    commit is not a best option:

    1) it doesn't eliminate the reuse completely, because
    set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
    target page if the pageblock of the page contains one or more
    unmovable pages (i.e. has_unmovable_pages() returns true).

    2) the original code changes migratetype to MIGRATE_ISOLATE
    forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
    regardless of the original migratetype state, which could impact
    other subsystems like memory hotplug or compaction.

    This patch moves PageSetHWPoison just after put_page() in
    unmap_and_move(), which closes up the reported race window and minimizes
    another race window b/w SetPageHWPoison and reallocation (which causes
    the reuse of soft-offlined page.) The latter race window still exists
    but it's acceptable, because it's rare and effectively the same as
    ordinary "containment failure" case even if it happens, so keep the
    window open is acceptable.

    Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
    Signed-off-by: Wanpeng Li
    Signed-off-by: Naoya Horiguchi
    Reported-by: Wanpeng Li
    Tested-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

05 Sep, 2015

1 commit

  • The manpage for move_pages(2) specifies that status code for zero page is
    supposed to be -EFAULT. Currently kernel return -ENOENT in this case.

    follow_page() can do it for us, if we would ask for FOLL_DUMP. The use of
    FOLL_DUMP also means that the upper layer page tables pages are no longer
    allocated.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Christoph Lameter
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Aug, 2015

2 commits

  • Now page freeing code doesn't consider PageHWPoison as a bad page, so by
    setting it before completing the page containment, we can prevent the
    error page from being reused just after successful page migration.

    I added TTU_IGNORE_HWPOISON for try_to_unmap() to make sure that the
    page table entry is transformed into migration entry, not to hwpoison
    entry.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The race condition addressed in commit add05cecef80 ("mm: soft-offline:
    don't free target page in successful page migration") was not closed
    completely, because that can happen not only for soft-offline, but also
    for hard-offline. Consider that a slab page is about to be freed into
    buddy pool, and then an uncorrected memory error hits the page just
    after entering __free_one_page(), then VM_BUG_ON_PAGE(page->flags &
    PAGE_FLAGS_CHECK_AT_PREP) is triggered, despite the fact that it's not
    necessary because the data on the affected page is not consumed.

    To solve it, this patch drops __PG_HWPOISON from page flag checks at
    allocation/free time. I think it's justified because __PG_HWPOISON
    flags is defined to prevent the page from being reused, and setting it
    outside the page's alloc-free cycle is a designed behavior (not a bug.)

    For recent months, I was annoyed about BUG_ON when soft-offlined page
    remains on lru cache list for a while, which is avoided by calling
    put_page() instead of putback_lru_page() in page migration's success
    path. This means that this patch reverts a major change from commit
    add05cecef80 about the new refcounting rule of soft-offlined pages, so
    "reuse window" revives. This will be closed by a subsequent patch.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Dean Nelson
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

25 Jun, 2015

2 commits

  • We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
    _huge_ to pmdp_clear functions so that we are clear that they operate on
    hugepage pte.

    We don't bother about other functions like pmdp_set_wrprotect,
    pmdp_clear_flush_young, because they operate on PTE bits and hence
    indicate they are operating on hugepage ptes

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Andrea Arcangeli
    Cc: Martin Schwidefsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Stress testing showed that soft offline events for a process iterating
    "mmap-pagefault-munmap" loop can trigger
    VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

    Soft offlining page 0x70fe1 at 0x70100008d000
    Soft offlining page 0x705fb at 0x70300008d000
    page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
    flags: 0x1fffff80800000(hwpoison)
    page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
    CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
    RIP: free_pcppages_bulk+0x52a/0x6f0
    Call Trace:
    drain_pages_zone+0x3d/0x50
    drain_local_pages+0x1d/0x30
    on_each_cpu_mask+0x46/0x80
    drain_all_pages+0x14b/0x1e0
    soft_offline_page+0x432/0x6e0
    SyS_madvise+0x73c/0x780
    system_call_fastpath+0x12/0x17
    Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
    RIP [] free_pcppages_bulk+0x52a/0x6f0
    RSP
    ---[ end trace 53926436e76d1f35 ]---

    When soft offline successfully migrates page, the source page is supposed
    to be freed. But there is a race condition where a source page looks
    isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
    somewhat linked to pcplist. Then another soft offline event calls
    drain_all_pages() and tries to free such hwpoisoned page, which is
    forbidden.

    This odd page state seems to happen due to the race between put_page() in
    putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
    with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
    drop lru_add_drain_all() in __soft_offline_page()", or to change page
    freeing code for this soft offline's purpose.

    Instead, let's think about the difference between hard offline and soft
    offline. There is an interesting difference in how to isolate the in-use
    page between these, that is, hard offline marks PageHWPoison of the target
    page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
    offline tries to free the target page then marks PageHWPoison. This
    difference might be the source of complexity and result in bugs like the
    above. So making soft offline isolate with keeping refcount can be a
    solution for this problem.

    We can pass to page migration code the "reason" which shows the caller, so
    let's use this more to avoid calling putback_lru_page() when called from
    soft offline, which effectively does the isolation for soft offline. With
    this change, target pages of soft offline never be reused without changing
    migratetype, so this patch also removes the related code.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Apr, 2015

1 commit

  • With the page flag sanitization patchset, an invalid usage of
    ClearPageSwapCache() is detected in migration_page_copy().
    migrate_page_copy() is shared by both normal and hugepage (both thp and
    hugetlb) code path, so let's check PageSwapCache() and clear it if it's
    set to avoid misuse of the invalid clear operation.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

15 Apr, 2015

2 commits

  • This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
    scanning even if migrate rate limited") so remove it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-12ubuntu1) :

    mm/migrate.c: In function `migrate_pages':
    mm/migrate.c:1148:1: internal compiler error: in push_minipool_fix, at config/arm/arm.c:13500
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See for instructions.
    Preprocessed source stored into /tmp/ccPoM1tr.out file, please attach this to your bugreport.
    make[1]: *** [mm/migrate.o] Error 1
    make: *** [mm/migrate.o] Error 2

    Mark unmap_and_move() (which is used in a single place only) "noinline"
    to work around this compiler bug.

    [akpm@linux-foundation.org: make it conditional on gcc-4.7.3 and arm]
    [khilman@kernel.org: fine-tune compiler versions]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Geert Uytterhoeven
    Reported-by: Kevin Hilman
    Cc: Marc Zyngier
    Tested-by: Kevin Hilman
    Tested-by: Lina Iyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

13 Feb, 2015

2 commits

  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Automatic NUMA balancing depends on being able to protect PTEs to trap a
    fault and gather reference locality information. Very broadly speaking
    it would mark PTEs as not present and use another bit to distinguish
    between NUMA hinting faults and other types of faults. It was
    universally loved by everybody and caused no problems whatsoever. That
    last sentence might be a lie.

    This series is very heavily based on patches from Linus and Aneesh to
    replace the existing PTE/PMD NUMA helper functions with normal change
    protections. I did alter and add parts of it but I consider them
    relatively minor contributions. At their suggestion, acked-bys are in
    there but I've no problem converting them to Signed-off-by if requested.

    AFAIK, this has received no testing on ppc64 and I'm depending on Aneesh
    for that. I tested trinity under kvm-tool and passed and ran a few
    other basic tests. At the time of writing, only the short-lived tests
    have completed but testing of V2 indicated that long-term testing had no
    surprises. In most cases I'm leaving out detail as it's not that
    interesting.

    specjbb single JVM: There was negligible performance difference in the
    benchmark itself for short runs. However, system activity is
    higher and interrupts are much higher over time -- possibly TLB
    flushes. Migrations are also higher. Overall, this is more overhead
    but considering the problems faced with the old approach I think
    we just have to suck it up and find another way of reducing the
    overhead.

    specjbb multi JVM: Negligible performance difference to the actual benchmark
    but like the single JVM case, the system overhead is noticeably
    higher. Again, interrupts are a major factor.

    autonumabench: This was all over the place and about all that can be
    reasonably concluded is that it's different but not necessarily
    better or worse.

    autonumabench
    3.18.0-rc5 3.18.0-rc5
    mmotm-20141119 protnone-v3r3
    User NUMA01 32380.24 ( 0.00%) 21642.92 ( 33.16%)
    User NUMA01_THEADLOCAL 22481.02 ( 0.00%) 22283.22 ( 0.88%)
    User NUMA02 3137.00 ( 0.00%) 3116.54 ( 0.65%)
    User NUMA02_SMT 1614.03 ( 0.00%) 1543.53 ( 4.37%)
    System NUMA01 322.97 ( 0.00%) 1465.89 (-353.88%)
    System NUMA01_THEADLOCAL 91.87 ( 0.00%) 49.32 ( 46.32%)
    System NUMA02 37.83 ( 0.00%) 14.61 ( 61.38%)
    System NUMA02_SMT 7.36 ( 0.00%) 7.45 ( -1.22%)
    Elapsed NUMA01 716.63 ( 0.00%) 599.29 ( 16.37%)
    Elapsed NUMA01_THEADLOCAL 553.98 ( 0.00%) 539.94 ( 2.53%)
    Elapsed NUMA02 83.85 ( 0.00%) 83.04 ( 0.97%)
    Elapsed NUMA02_SMT 86.57 ( 0.00%) 79.15 ( 8.57%)
    CPU NUMA01 4563.00 ( 0.00%) 3855.00 ( 15.52%)
    CPU NUMA01_THEADLOCAL 4074.00 ( 0.00%) 4136.00 ( -1.52%)
    CPU NUMA02 3785.00 ( 0.00%) 3770.00 ( 0.40%)
    CPU NUMA02_SMT 1872.00 ( 0.00%) 1959.00 ( -4.65%)

    System CPU usage of NUMA01 is worse but it's an adverse workload on this
    machine so I'm reluctant to conclude that it's a problem that matters. On
    the other workloads that are sensible on this machine, system CPU usage is
    great. Overall time to complete the benchmark is comparable

    3.18.0-rc5 3.18.0-rc5
    mmotm-20141119protnone-v3r3
    User 59612.50 48586.44
    System 460.22 1537.45
    Elapsed 1442.20 1304.29

    NUMA alloc hit 5075182 5743353
    NUMA alloc miss 0 0
    NUMA interleave hit 0 0
    NUMA alloc local 5075174 5743339
    NUMA base PTE updates 637061448 443106883
    NUMA huge PMD updates 1243434 864747
    NUMA page range updates 1273699656 885857347
    NUMA hint faults 1658116 1214277
    NUMA hint local faults 959487 754113
    NUMA hint local percent 57 62
    NUMA pages migrated 5467056 61676398

    The NUMA pages migrated look terrible but when I looked at a graph of the
    activity over time I see that the massive spike in migration activity was
    during NUMA01. This correlates with high system CPU usage and could be
    simply down to bad luck but any modifications that affect that workload
    would be related to scan rates and migrations, not the protection
    mechanism. For all other workloads, migration activity was comparable.

    Overall, headline performance figures are comparable but the overhead is
    higher, mostly in interrupts. To some extent, higher overhead from this
    approach was anticipated but not to this degree. It's going to be
    necessary to reduce this again with a separate series in the future. It's
    still worth going ahead with this series though as it's likely to avoid
    constant headaches with Xen and is probably easier to maintain.

    This patch (of 10):

    A transhuge NUMA hinting fault may find the page is migrating and should
    wait until migration completes. The check is race-prone because the pmd
    is deferenced outside of the page lock and while the race is tiny, it'll
    be larger if the PMD is cleared while marking PMDs for hinting fault.
    This patch closes the race.

    Signed-off-by: Mel Gorman
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman