14 Oct, 2020

5 commits

  • The function is_huge_zero_page() doesn't call compound_head() to make sure
    the page pointer is a head page. The call to is_huge_zero_page() in
    release_pages() is made before compound_head() is called so the test would
    fail if release_pages() was called with a tail page of the huge_zero_page
    and put_page_testzero() would be called releasing the page.
    This is unlikely to be happening in normal use or we would be seeing all
    sorts of process data corruption when accessing a THP zero page.

    Looking at other places where is_huge_zero_page() is called, all seem to
    only pass a head page so I think the right solution is to move the call
    to compound_head() in release_pages() to a point before calling
    is_huge_zero_page().

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Yu Zhao
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Christoph Hellwig
    Link: https://lkml.kernel.org/r/20200917173938.16420-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Since commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
    pagevecs"), unevictable pages do not goes directly back onto zone's
    unevictable list.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Shakeel Butt
    Link: https://lkml.kernel.org/r/20200927122209.59328-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Since commit 07d802699528 ("mm: devmap: refactor 1-based refcounting for
    ZONE_DEVICE pages"), we have renamed the func put_devmap_managed_page() to
    page_is_devmap_managed().

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: John Hubbard
    Link: https://lkml.kernel.org/r/20200905084453.19353-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • To activate a page, mark_page_accessed() always holds a reference on it.
    It either gets a new reference when adding a page to
    lru_pvecs.activate_page or reuses an existing one it previously got when
    it added a page to lru_pvecs.lru_add. So it doesn't call SetPageActive()
    on a page that doesn't have any reference left. Therefore, the race is
    impossible these days (I didn't brother to dig into its history).

    For other paths, namely reclaim and migration, a reference count is always
    held while calling SetPageActive() on a page.

    SetPageSlabPfmemalloc() also uses SetPageActive(), but it's irrelevant to
    LRU pages.

    Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Cc: Alexander Duyck
    Cc: David Hildenbrand
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200818184704.3625199-2-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • We don't initially add anon pages to active lruvec after commit
    b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU").
    Remove activate_page() from unuse_pte(), which seems to be missed by the
    commit. And make the function static while we are at it.

    Before the commit, we called lru_cache_add_active_or_unevictable() to add
    new ksm pages to active lruvec. Therefore, activate_page() wasn't
    necessary for them in the first place.

    Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Cc: Alexander Duyck
    Cc: Huang Ying
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Mel Gorman
    Cc: Nicholas Piggin
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

09 Oct, 2020

1 commit


20 Sep, 2020

1 commit

  • 5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has
    established that vm_events should count every subpage of a THP, including
    unevictable_pgs_culled and unevictable_pgs_rescued; but
    lru_cache_add_inactive_or_unevictable() was not doing so for
    unevictable_pgs_mlocked, and mm/mlock.c was not doing so for
    unevictable_pgs mlocked, munlocked, cleared and stranded.

    Fix them; but THPs don't go the pagevec way in mlock.c, so no fixes needed
    on that path.

    Fixes: 5d91f31faf8e ("mm: swap: fix vmstats for huge page")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Yang Shi
    Cc: Alex Shi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301408230.5954@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

10 Sep, 2020

1 commit

  • Commit eef1a429f234 ("mm/swap.c: piggyback lru_add_drain_all() calls")
    implemented an optimization mechanism to exit the to-be-started LRU
    drain operation (name it A) if another drain operation *started and
    finished* while (A) was blocked on the LRU draining mutex.

    This was done through a seqcount_t latch, which is an abuse of its
    semantics:

    1. seqcount_t latching should be used for the purpose of switching
    between two storage places with sequence protection to allow
    interruptible, preemptible, writer sections. The referenced
    optimization mechanism has absolutely nothing to do with that.

    2. The used raw_write_seqcount_latch() has two SMP write memory
    barriers to insure one consistent storage place out of the two
    storage places available. A full memory barrier is required
    instead: to guarantee that the pagevec counter stores visible by
    local CPU are visible to other CPUs -- before loading the current
    drain generation.

    Beside the seqcount_t API abuse, the semantics of a latch sequence
    counter was force-fitted into the referenced optimization. What was
    meant is to track "generations" of LRU draining operations, where
    "global lru draining generation = x" implies that all generations
    0 < n
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/87y2pg9erj.fsf@vostro.fn.ogness.net

    Ahmed S. Darwish
     

15 Aug, 2020

2 commits

  • Read to lru_add_pvec->nr could be interrupted and then write to the same
    variable. The write has local interrupt disabled, but the plain reads
    result in data races. However, it is unlikely the compilers could do much
    damage here given that lru_add_pvec->nr is a "unsigned char" and there is
    an existing compiler barrier. Thus, annotate the reads using the
    data_race() macro. The data races were reported by KCSAN,

    BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page

    write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
    rotate_reclaimable_page+0x2df/0x490
    pagevec_add at include/linux/pagevec.h:81
    (inlined by) rotate_reclaimable_page at mm/swap.c:259
    end_page_writeback+0x1b5/0x2b0
    end_swap_bio_write+0x1d0/0x280
    bio_endio+0x297/0x560
    dec_pending+0x218/0x430 [dm_mod]
    clone_endio+0xe4/0x2c0 [dm_mod]
    bio_endio+0x297/0x560
    blk_update_request+0x201/0x920
    scsi_end_request+0x6b/0x4a0
    scsi_io_completion+0xb7/0x7e0
    scsi_finish_command+0x1ed/0x2a0
    scsi_softirq_done+0x1c9/0x1d0
    blk_done_softirq+0x181/0x1d0
    __do_softirq+0xd9/0x57c
    irq_exit+0xa2/0xc0
    do_IRQ+0x8b/0x190
    ret_from_intr+0x0/0x42
    delay_tsc+0x46/0x80
    __const_udelay+0x3c/0x40
    __udelay+0x10/0x20
    kcsan_setup_watchpoint+0x202/0x3a0
    __tsan_read1+0xc2/0x100
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
    lru_add_drain_cpu+0xb8/0x3f0
    lru_add_drain_cpu at mm/swap.c:602
    lru_add_drain+0x25/0x40
    shrink_active_list+0xe1/0xc80
    shrink_lruvec+0x766/0xb70
    shrink_node+0x2d6/0xca0
    do_try_to_free_pages+0x1f7/0x9a0
    try_to_free_pages+0x252/0x5b0
    __alloc_pages_slowpath+0x458/0x1290
    __alloc_pages_nodemask+0x3bb/0x450
    alloc_pages_vma+0x8a/0x2c0
    do_anonymous_page+0x16e/0x6f0
    __handle_mm_fault+0xcd5/0xd40
    handle_mm_fault+0xfc/0x2f0
    do_page_fault+0x263/0x6f9
    page_fault+0x34/0x40

    2 locks held by oom02/37761:
    #0: ffff9281e5928808 (&mm->mmap_sem#2){++++}, at: do_page_fault
    #1: ffffffffb3ade380 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part
    irq event stamp: 1949217
    trace_hardirqs_on_thunk+0x1a/0x1c
    __do_softirq+0x2e7/0x57c
    __do_softirq+0x34c/0x57c
    irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 23 PID: 37761 Comm: oom02 Not tainted 5.6.0-rc3-next-20200226+ #6
    Hardware name: HP ProLiant BL660c Gen9, BIOS I38 10/17/2018

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Acked-by: Marco Elver
    Link: http://lkml.kernel.org/r/20200228044018.1263-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

1 commit

  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

26 Jun, 2020

1 commit

  • Non-file-lru page could also be activated in mark_page_accessed() and we
    need to count this activation for nonresident_age.

    Note that it's better for this patch to be squashed into the patch "mm:
    workingset: age nonresident information alongside anonymous pages".

    Link: http://lkml.kernel.org/r/1592288204-27734-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

04 Jun, 2020

11 commits

  • The commit 2262185c5b28 ("mm: per-cgroup memory reclaim stats") added
    PGLAZYFREE, PGACTIVATE & PGDEACTIVATE stats for cgroups but missed
    couple of places and PGLAZYFREE missed huge page handling. Fix that.
    Also for PGLAZYFREE use the irq-unsafe function to update as the irq is
    already disabled.

    Fixes: 2262185c5b28 ("mm: per-cgroup memory reclaim stats")
    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Link: http://lkml.kernel.org/r/20200527182947.251343-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Many of the callbacks called by pagevec_lru_move_fn() does not correctly
    update the vmstats for huge pages. Fix that. Also __pagevec_lru_add_fn()
    use the irq-unsafe alternative to update the stat as the irqs are
    already disabled.

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Link: http://lkml.kernel.org/r/20200527182916.249910-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • The VM tries to balance reclaim pressure between anon and file so as to
    reduce the amount of IO incurred due to the memory shortage. It already
    counts refaults and swapins, but in addition it should also count
    writepage calls during reclaim.

    For swap, this is obvious: it's IO that wouldn't have occurred if the
    anonymous memory hadn't been under memory pressure. From a relative
    balancing point of view this makes sense as well: even if anon is cold and
    reclaimable, a cache that isn't thrashing may have equally cold pages that
    don't require IO to reclaim.

    For file writeback, it's trickier: some of the reclaim writepage IO would
    have likely occurred anyway due to dirty expiration. But not all of it -
    premature writeback reduces batching and generates additional writes.
    Since the flushers are already woken up by the time the VM starts writing
    cache pages one by one, let's assume that we'e likely causing writes that
    wouldn't have happened without memory pressure. In addition, the per-page
    cost of IO would have probably been much cheaper if written in larger
    batches from the flusher thread rather than the single-page-writes from
    kswapd.

    For our purposes - getting the trend right to accelerate convergence on a
    stable state that doesn't require paging at all - this is sufficiently
    accurate. If we later wanted to optimize for sustained thrashing, we can
    still refine the measurements.

    Count all writepage calls from kswapd as IO cost toward the LRU that the
    page belongs to.

    Why do this dynamically? Don't we know in advance that anon pages require
    IO to reclaim, and so could build in a static bias?

    First, scanning is not the same as reclaiming. If all the anon pages are
    referenced, we may not swap for a while just because we're scanning the
    anon list. During this time, however, it's important that we age
    anonymous memory and the page cache at the same rate so that their
    hot-cold gradients are comparable. Everything else being equal, we still
    want to reclaim the coldest memory overall.

    Second, we keep copies in swap unless the page changes. If there is
    swap-backed data that's mostly read (tmpfs file) and has been swapped out
    before, we can reclaim it without incurring additional IO.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We split the LRU lists into anon and file, and we rebalance the scan
    pressure between them when one of them begins thrashing: if the file cache
    experiences workingset refaults, we increase the pressure on anonymous
    pages; if the workload is stalled on swapins, we increase the pressure on
    the file cache instead.

    With cgroups and their nested LRU lists, we currently don't do this
    correctly. While recursive cgroup reclaim establishes a relative LRU
    order among the pages of all involved cgroups, LRU pressure balancing is
    done on an individual cgroup LRU level. As a result, when one cgroup is
    thrashing on the filesystem cache while a sibling may have cold anonymous
    pages, pressure doesn't get equalized between them.

    This patch moves LRU balancing decision to the root of reclaim - the same
    level where the LRU order is established.

    It does this by tracking LRU cost recursively, so that every level of the
    cgroup tree knows the aggregate LRU cost of all memory within its domain.
    When the page scanner calculates the scan balance for any given individual
    cgroup's LRU list, it uses the values from the ancestor cgroup that
    initiated the reclaim cycle.

    If one sibling is then thrashing on the cache, it will tip the pressure
    balance inside its ancestors, and the next hierarchical reclaim iteration
    will go more after the anon pages in the tree.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since the LRUs were split into anon and file lists, the VM has been
    balancing between page cache and anonymous pages based on per-list ratios
    of scanned vs. rotated pages. In most cases that tips page reclaim
    towards the list that is easier to reclaim and has the fewest actively
    used pages, but there are a few problems with it:

    1. Refaults and LRU rotations are weighted the same way, even though
    one costs IO and the other costs a bit of CPU.

    2. The less we scan an LRU list based on already observed rotations,
    the more we increase the sampling interval for new references, and
    rotations become even more likely on that list. This can enter a
    death spiral in which we stop looking at one list completely until
    the other one is all but annihilated by page reclaim.

    Since commit a528910e12ec ("mm: thrash detection-based file cache sizing")
    we have refault detection for the page cache. Along with swapin events,
    they are good indicators of when the file or anon list, respectively, is
    too small for its workingset and needs to grow.

    For example, if the page cache is thrashing, the cache pages need more
    time in memory, while there may be colder pages on the anonymous list.
    Likewise, if swapped pages are faulting back in, it indicates that we
    reclaim anonymous pages too aggressively and should back off.

    Replace LRU rotations with refaults and swapins as the basis for relative
    reclaim cost of the two LRUs. This will have the VM target list balances
    that incur the least amount of IO on aggregate.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Operations like MADV_FREE, FADV_DONTNEED etc. currently move any affected
    active pages to the inactive list to accelerate their reclaim (good) but
    also steer page reclaim toward that LRU type, or away from the other
    (bad).

    The reason why this is undesirable is that such operations are not part of
    the regular page aging cycle, and rather a fluke that doesn't say much
    about the remaining pages on that list; they might all be in heavy use,
    and once the chunk of easy victims has been purged, the VM continues to
    apply elevated pressure on those remaining hot pages. The other LRU,
    meanwhile, might have easily reclaimable pages, and there was never a need
    to steer away from it in the first place.

    As the previous patch outlined, we should focus on recording actually
    observed cost to steer the balance rather than speculating about the
    potential value of one LRU list over the other. In that spirit, leave
    explicitely deactivated pages to the LRU algorithm to pick up, and let
    rotations decide which list is the easiest to reclaim.

    [cai@lca.pw: fix set-but-not-used warning]
    Link: http://lkml.kernel.org/r/20200522133335.GA624@Qians-MacBook-Air.local
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200520232525.798933-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, scan pressure between the anon and file LRU lists is balanced
    based on a mixture of reclaim efficiency and a somewhat vague notion of
    "value" of having certain pages in memory over others. That concept of
    value is problematic, because it has caused us to count any event that
    remotely makes one LRU list more or less preferrable for reclaim, even
    when these events are not directly comparable and impose very different
    costs on the system. One example is referenced file pages that we still
    deactivate and referenced anonymous pages that we actually rotate back to
    the head of the list.

    There is also conceptual overlap with the LRU algorithm itself. By
    rotating recently used pages instead of reclaiming them, the algorithm
    already biases the applied scan pressure based on page value. Thus, when
    rebalancing scan pressure due to rotations, we should think of reclaim
    cost, and leave assessing the page value to the LRU algorithm.

    Lastly, considering both value-increasing as well as value-decreasing
    events can sometimes cause the same type of event to be counted twice,
    i.e. how rotating a page increases the LRU value, while reclaiming it
    succesfully decreases the value. In itself this will balance out fine,
    but it quietly skews the impact of events that are only recorded once.

    The abstract metric of "value", the murky relationship with the LRU
    algorithm, and accounting both negative and positive events make the
    current pressure balancing model hard to reason about and modify.

    This patch switches to a balancing model of accounting the concrete,
    actually observed cost of reclaiming one LRU over another. For now, that
    cost includes pages that are scanned but rotated back to the list head.
    Subsequent patches will add consideration for IO caused by refaulting of
    recently evicted pages.

    Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
    make everything that affects cost go through a new lru_note_cost()
    function.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When the splitlru patches divided page cache and swap-backed pages into
    separate LRU lists, the pressure balance between the lists was biased to
    account for the fact that streaming IO can cause memory pressure with a
    flood of pages that are used only once. New page cache additions would
    tip the balance toward the file LRU, and repeat access would neutralize
    that bias again. This ensured that page reclaim would always go for
    used-once cache first.

    Since e9868505987a ("mm,vmscan: only evict file pages when we have
    plenty"), page reclaim generally skips over swap-backed memory entirely as
    long as there is used-once cache present, and will apply the LRU balancing
    when only repeatedly accessed cache pages are left - at which point the
    previous use-once bias will have been neutralized. This makes the
    use-once cache balancing bias unnecessary.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-7-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The reclaim code that balances between swapping and cache reclaim tries to
    predict likely reuse based on in-memory reference patterns alone. This
    works in many cases, but when it fails it cannot detect when the cache is
    thrashing pathologically, or when we're in the middle of a swap storm.

    The high seek cost of rotational drives under which the algorithm evolved
    also meant that mistakes could quickly result in lockups from too
    aggressive swapping (which is predominantly random IO). As a result, the
    balancing code has been tuned over time to a point where it mostly goes
    for page cache and defers swapping until the VM is under significant
    memory pressure.

    The resulting strategy doesn't make optimal caching decisions - where
    optimal is the least amount of IO required to execute the workload.

    The proliferation of fast random IO devices such as SSDs, in-memory
    compression such as zswap, and persistent memory technologies on the
    horizon, has made this undesirable behavior very noticable: Even in the
    presence of large amounts of cold anonymous memory and a capable swap
    device, the VM refuses to even seriously scan these pages, and can leave
    the page cache thrashing needlessly.

    This series sets out to address this. Since commit ("a528910e12ec mm:
    thrash detection-based file cache sizing") we have exact tracking of
    refault IO - the ultimate cost of reclaiming the wrong pages. This allows
    us to use an IO cost based balancing model that is more aggressive about
    scanning anonymous memory when the cache is thrashing, while being able to
    avoid unnecessary swap storms.

    These patches base the LRU balance on the rate of refaults on each list,
    times the relative IO cost between swap device and filesystem
    (swappiness), in order to optimize reclaim for least IO cost incurred.

    History

    I floated these changes in 2016. At the time they were incomplete and
    full of workarounds due to a lack of infrastructure in the reclaim code:
    We didn't have PageWorkingset, we didn't have hierarchical cgroup
    statistics, and problems with the cgroup swap controller. As swapping
    wasn't too high a priority then, the patches stalled out. With all
    dependencies in place now, here we are again with much cleaner,
    feature-complete patches.

    I kept the acks for patches that stayed materially the same :-)

    Below is a series of test results that demonstrate certain problematic
    behavior of the current code, as well as showcase the new code's more
    predictable and appropriate balancing decisions.

    Test #1: No convergence

    This test shows an edge case where the VM currently doesn't converge at
    all on a new file workingset with a stale anon/tmpfs set.

    The test sets up a cold anon set the size of 3/4 RAM, then tries to
    establish a new file set half the size of RAM (flat access pattern).

    The vanilla kernel refuses to even scan anon pages and never converges.
    The file set is perpetually served from the filesystem.

    The first test kernel is with the series up to the workingset patch
    applied. This allows thrashing page cache to challenge the anonymous
    workingset. The VM then scans the lists based on the current
    scanned/rotated balancing algorithm. It converges on a stable state where
    all cold anon pages are pushed out and the fileset is served entirely from
    cache:

    noconverge/5.7-rc5-mm noconverge/5.7-rc5-mm-workingset
    Scanned 417719308.00 ( +0.00%) 64091155.00 ( -84.66%)
    Reclaimed 417711094.00 ( +0.00%) 61640308.00 ( -85.24%)
    Reclaim efficiency % 100.00 ( +0.00%) 96.18 ( -3.78%)
    Scanned file 417719308.00 ( +0.00%) 59211118.00 ( -85.83%)
    Scanned anon 0.00 ( +0.00%) 4880037.00 ( )
    Swapouts 0.00 ( +0.00%) 2439957.00 ( )
    Swapins 0.00 ( +0.00%) 257.00 ( )
    Refaults 415246605.00 ( +0.00%) 59183722.00 ( -85.75%)
    Restore refaults 0.00 ( +0.00%) 54988252.00 ( )

    The second test kernel is with the full patch series applied, which
    replaces the scanned/rotated ratios with refault/swapin rate-based
    balancing. It evicts the cold anon pages more aggressively in the
    presence of a thrashing cache and the absence of swapins, and so converges
    with about 60% of the IO and reclaim activity:

    noconverge/5.7-rc5-mm-workingset noconverge/5.7-rc5-mm-lrubalance
    Scanned 64091155.00 ( +0.00%) 37579741.00 ( -41.37%)
    Reclaimed 61640308.00 ( +0.00%) 35129293.00 ( -43.01%)
    Reclaim efficiency % 96.18 ( +0.00%) 93.48 ( -2.78%)
    Scanned file 59211118.00 ( +0.00%) 32708385.00 ( -44.76%)
    Scanned anon 4880037.00 ( +0.00%) 4871356.00 ( -0.18%)
    Swapouts 2439957.00 ( +0.00%) 2435565.00 ( -0.18%)
    Swapins 257.00 ( +0.00%) 262.00 ( +1.94%)
    Refaults 59183722.00 ( +0.00%) 32675667.00 ( -44.79%)
    Restore refaults 54988252.00 ( +0.00%) 28480430.00 ( -48.21%)

    We're triggering this case in host sideloading scenarios: When a host's
    primary workload is not saturating the machine (primary load is usually
    driven by user activity), we can optimistically sideload a batch job; if
    user activity picks up and the primary workload needs the whole host
    during this time, we freeze the sideload and rely on it getting pushed to
    swap. Frequently that swapping doesn't happen and the completely inactive
    sideload simply stays resident while the expanding primary worklad is
    struggling to gain ground.

    Test #2: Kernel build

    This test is a a kernel build that is slightly memory-restricted (make -j4
    inside a 400M cgroup).

    Despite the very aggressive swapping of cold anon pages in test #1, this
    test shows that the new kernel carefully balances swap against cache
    refaults when both the file and the cache set are pressured.

    It shows the patched kernel to be slightly better at finding the coldest
    memory from the combined anon and file set to evict under pressure. The
    result is lower aggregate reclaim and paging activity:

    z 5.7-rc5-mm 5.7-rc5-mm-lrubalance
    Real time 210.60 ( +0.00%) 210.97 ( +0.18%)
    User time 745.42 ( +0.00%) 746.48 ( +0.14%)
    System time 69.78 ( +0.00%) 69.79 ( +0.02%)
    Scanned file 354682.00 ( +0.00%) 293661.00 ( -17.20%)
    Scanned anon 465381.00 ( +0.00%) 378144.00 ( -18.75%)
    Swapouts 185920.00 ( +0.00%) 147801.00 ( -20.50%)
    Swapins 34583.00 ( +0.00%) 32491.00 ( -6.05%)
    Refaults 212664.00 ( +0.00%) 172409.00 ( -18.93%)
    Restore refaults 48861.00 ( +0.00%) 80091.00 ( +63.91%)
    Total paging IO 433167.00 ( +0.00%) 352701.00 ( -18.58%)

    Test #3: Overload

    This next test is not about performance, but rather about the
    predictability of the algorithm. The current balancing behavior doesn't
    always lead to comprehensible results, which makes performance analysis
    and parameter tuning (swappiness e.g.) very difficult.

    The test shows the balancing behavior under equivalent anon and file
    input. Anon and file sets are created of equal size (3/4 RAM), have the
    same access patterns (a hot-cold gradient), and synchronized access rates.
    Swappiness is raised from the default of 60 to 100 to indicate equal IO
    cost between swap and cache.

    With the vanilla balancing code, anon scans make up around 9% of the total
    pages scanned, or a ~1:10 ratio. This is a surprisingly skewed ratio, and
    it's an outcome that is hard to explain given the input parameters to the
    VM.

    The new balancing model targets a 1:2 balance: All else being equal,
    reclaiming a file page costs one page IO - the refault; reclaiming an anon
    page costs two IOs - the swapout and the swapin. In the test we observe a
    ~1:3 balance.

    The scanned and paging IO numbers indicate that the anon LRU algorithm we
    have in place right now does a slightly worse job at picking the coldest
    pages compared to the file algorithm. There is ongoing work to improve
    this, like Joonsoo's anon workingset patches; however, it's difficult to
    compare the two aging strategies when the balancing between them is
    behaving unintuitively.

    The slightly less efficient anon reclaim results in a deviation from the
    optimal 1:2 scan ratio we would like to see here - however, 1:3 is much
    closer to what we'd want to see in this test than the vanilla kernel's
    aging of 10+ cache pages for every anonymous one:

    overload-100/5.7-rc5-mm-workingset overload-100/5.7-rc5-mm-lrubalance-realfile
    Scanned 533633725.00 ( +0.00%) 595687785.00 ( +11.63%)
    Reclaimed 494325440.00 ( +0.00%) 518154380.00 ( +4.82%)
    Reclaim efficiency % 92.63 ( +0.00%) 86.98 ( -6.03%)
    Scanned file 484532894.00 ( +0.00%) 456937722.00 ( -5.70%)
    Scanned anon 49100831.00 ( +0.00%) 138750063.00 ( +182.58%)
    Swapouts 8096423.00 ( +0.00%) 48982142.00 ( +504.98%)
    Swapins 10027384.00 ( +0.00%) 62325044.00 ( +521.55%)
    Refaults 479819973.00 ( +0.00%) 451309483.00 ( -5.94%)
    Restore refaults 426422087.00 ( +0.00%) 399914067.00 ( -6.22%)
    Total paging IO 497943780.00 ( +0.00%) 562616669.00 ( +12.99%)

    Test #4: Parallel IO

    It's important to note that these patches only affect the situation where
    the kernel has to reclaim workingset memory, which is usually a
    transitionary period. The vast majority of page reclaim occuring in a
    system is from trimming the ever-expanding page cache.

    These patches don't affect cache trimming behavior. We never swap as long
    as we only have use-once cache moving through the file LRU, we only
    consider swapping when the cache is actively thrashing.

    The following test demonstrates this. It has an anon workingset that
    takes up half of RAM and then writes a file that is twice the size of RAM
    out to disk.

    As the cache is funneled through the inactive file list, no anon pages are
    scanned (aside from apparently some background noise of 10 pages):

    5.7-rc5-mm 5.7-rc5-mm-lrubalance
    Scanned 10714722.00 ( +0.00%) 10723445.00 ( +0.08%)
    Reclaimed 10703596.00 ( +0.00%) 10712166.00 ( +0.08%)
    Reclaim efficiency % 99.90 ( +0.00%) 99.89 ( -0.00%)
    Scanned file 10714722.00 ( +0.00%) 10723435.00 ( +0.08%)
    Scanned anon 0.00 ( +0.00%) 10.00 ( )
    Swapouts 0.00 ( +0.00%) 7.00 ( )
    Swapins 0.00 ( +0.00%) 0.00 ( +0.00%)
    Refaults 92.00 ( +0.00%) 41.00 ( -54.84%)
    Restore refaults 0.00 ( +0.00%) 0.00 ( +0.00%)
    Total paging IO 92.00 ( +0.00%) 48.00 ( -47.31%)

    This patch (of 14):

    Currently, THP are counted as single pages until they are split right
    before being swapped out. However, at that point the VM is already in the
    middle of reclaim, and adjusting the LRU balance then is useless.

    Always account THP by the number of basepages, and remove the fixup from
    the splitting path.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-1-hannes@cmpxchg.org
    Link: http://lkml.kernel.org/r/20200520232525.798933-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • None of the three callers of get_compound_page_dtor() want to know the
    value; they just want to call the function. Replace it with
    destroy_compound_page() which calls the dtor for them.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

28 May, 2020

1 commit

  • The various struct pagevec per CPU variables are protected by disabling
    either preemption or interrupts across the critical sections. Inside
    these sections spinlocks have to be acquired.

    These spinlocks are regular spinlock_t types which are converted to
    "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
    locks cannot be acquired in preemption or interrupt disabled sections.

    local locks provide a trivial way to substitute preempt and interrupt
    disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
    to preempt_disable() and local_lock_irq() to local_irq_disable().

    Create lru_rotate_pvecs containing the pagevec and the locallock.
    Create lru_pvecs containing the remaining pagevecs and the locallock.
    Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
    exporting the pvec structure.

    Change the relevant call sites to acquire these locks instead of using
    preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
    local_irq_save().

    There is neither a functional change nor a change in the generated
    binary code for non PREEMPT_RT enabled non-debug kernels.

    When lockdep is enabled local locks have lockdep maps embedded. These
    allow lockdep to validate the protections, i.e. inappropriate usage of a
    preemption only protected sections would result in a lockdep warning
    while the same problem would not be noticed with a plain
    preempt_disable() based protection.

    local locks also improve readability as they provide a named scope for
    the protections while preempt/interrupt disable are opaque scopeless.

    Finally local locks allow PREEMPT_RT to substitute them with real
    locking primitives to ensure the correctness of operation in a fully
    preemptible kernel.

    [ bigeasy: Adopted to use local_lock ]

    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de

    Ingo Molnar
     

08 Apr, 2020

2 commits

  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     

03 Apr, 2020

2 commits

  • Memory barrier is needed after setting LRU bit, but smp_mb() is too
    strong. Some architectures, i.e. x86, imply memory barrier with atomic
    operations, so replacing it with smp_mb__after_atomic() sounds better,
    which is nop on strong ordered machines, and full memory barriers on
    others. With this change the vm-scalability cases would perform better on
    x86, I saw total 6% improvement with this patch and previous inline fix.

    The test data (lru-file-readtwice throughput) against v5.6-rc4:
    mainline w/ inline fix w/ both (adding this)
    150MB 154MB 159MB

    Fixes: 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping pagevecs")
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Matthew Wilcox (Oracle)
    Link: http://lkml.kernel.org/r/1584500541-46817-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • __pagevec_lru_add() is only used in mm directory now.

    Remove the export symbol.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200126011436.22979-1-richardw.yang@linux.intel.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

01 Feb, 2020

1 commit

  • An upcoming patch changes and complicates the refcounting and especially
    the "put page" aspects of it. In order to keep everything clean,
    refactor the devmap page release routines:

    * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
    limit the functionality to "read only": return a bool, with no side
    effects.

    * Add a new routine, put_devmap_managed_page(), to handle decrementing
    the refcount for ZONE_DEVICE pages.

    * Change callers (just release_pages() and put_page()) to check
    page_is_devmap_managed() before calling the new
    put_devmap_managed_page() routine. This is a performance point:
    put_page() is a hot path, so we need to avoid non- inline function calls
    where possible.

    * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
    limit the functionality to unconditionally freeing a devmap page.

    This is originally based on a separate patch by Ira Weiny, which applied
    to an early version of the put_user_page() experiments. Since then,
    Jérôme Glisse suggested the refactoring described above.

    Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
    Signed-off-by: Ira Weiny
    Signed-off-by: John Hubbard
    Suggested-by: Jérôme Glisse
    Reviewed-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Christoph Hellwig
    Cc: Kirill A. Shutemov
    Cc: Alex Williamson
    Cc: Aneesh Kumar K.V
    Cc: Björn Töpel
    Cc: Daniel Vetter
    Cc: Hans Verkuil
    Cc: Jason Gunthorpe
    Cc: Jason Gunthorpe
    Cc: Jens Axboe
    Cc: Jonathan Corbet
    Cc: Leon Romanovsky
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     

01 Dec, 2019

2 commits

  • This is a very slow operation. Right now POSIX_FADV_DONTNEED is the top
    user because it has to freeze page references when removing it from the
    cache. invalidate_bdev() calls it for the same reason. Both are
    triggered from userspace, so it's easy to generate a storm.

    mlock/mlockall no longer calls lru_add_drain_all - I've seen here
    serious slowdown on older kernels.

    There are some less obvious paths in memory migration/CMA/offlining
    which shouldn't call frequently.

    The worst case requires a non-trivial workload because
    lru_add_drain_all() skips cpus where vectors are empty. Something must
    constantly generate a flow of pages for each cpu. Also cpus must be
    busy to make scheduling per-cpu works slower. And the machine must be
    big enough (64+ cpus in our case).

    In our case that was a massive series of mlock calls in map-reduce while
    other tasks write logs (and generates flows of new pages in per-cpu
    vectors). Mlock calls were serialized by mutex and accumulated latency
    up to 10 seconds or more.

    The kernel does not call lru_add_drain_all on mlock paths since 4.15,
    but the same scenario could be triggered by fadvise(POSIX_FADV_DONTNEED)
    or any other remaining user.

    There is no reason to do the drain again if somebody else already
    drained all the per-cpu vectors while we waited for the lock.

    Piggyback on a drain starting and finishing while we wait for the lock:
    all pages pending at the time of our entry were drained from the
    vectors.

    Callers like POSIX_FADV_DONTNEED retry their operations once after
    draining per-cpu vectors when pages have unexpected references.

    Link: http://lkml.kernel.org/r/157019456205.3142.3369423180908482020.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • This avoids duplicated PageReferenced() calls. No behavior change.

    Link: http://lkml.kernel.org/r/20191016225326.GB12497@wfg-t540p.sh.intel.com
    Signed-off-by: Fengguang Wu
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Liu Jingqi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

26 Sep, 2019

1 commit

  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 Sep, 2019

2 commits

  • A later patch makes THP deferred split shrinker memcg aware, but it needs
    page->mem_cgroup information in THP destructor, which is called after
    mem_cgroup_uncharge() now.

    So move mem_cgroup_uncharge() from __page_cache_release() to compound page
    destructor, which is called by both THP and other compound pages except
    HugeTLB. And call it in __put_single_page() for single order page.

    Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • This is a cleanup patch that replaces two historical uses of
    list_move_tail() with relatively recent add_page_to_lru_list_tail().

    Link: http://lkml.kernel.org/r/20190716212436.7137-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jason Gunthorpe
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

15 Jul, 2019

2 commits

  • The stuff under sysctl describes /sys interface from userspace
    point of view. So, add it to the admin-guide and remove the
    :orphan: from its index file.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Rename the /proc/sys/ documentation files to ReST, using the
    README file as a template for an index.rst, adding the other
    files there via TOC markup.

    Despite being written on different times with different
    styles, try to make them somewhat coherent with a similar
    look and feel, ensuring that they'll look nice as both
    raw text file and as via the html output produced by the
    Sphinx build system.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

02 Jul, 2019

1 commit

  • release_pages() is an optimized version of a loop around put_page().
    Unfortunately for devmap pages the logic is not entirely correct in
    release_pages(). This is because device pages can be more than type
    MEMORY_DEVICE_PUBLIC. There are in fact 4 types, private, public, FS DAX,
    and PCI P2PDMA. Some of these have specific needs to "put" the page while
    others do not.

    This logic to handle any special needs is contained in
    put_devmap_managed_page(). Therefore all devmap pages should be processed
    by this function where we can contain the correct logic for a page put.

    Handle all device type pages within release_pages() by calling
    put_devmap_managed_page() on all devmap pages. If
    put_devmap_managed_page() returns true the page has been put and we
    continue with the next page. A false return of put_devmap_managed_page()
    means the page did not require special processing and should fall to
    "normal" processing.

    This was found via code inspection while determining if release_pages()
    and the new put_user_pages() could be interchangeable.[1]

    [1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com

    Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Reviewed-by: Dan Williams
    Reviewed-by: John Hubbard
    Signed-off-by: Ira Weiny
    Signed-off-by: Jason Gunthorpe

    Ira Weiny
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • There is no function named munlock_vma_pages(). Correct it to
    munlock_vma_page().

    Link: http://lkml.kernel.org/r/20190402095609.27181-1-peng.fan@nxp.com
    Signed-off-by: Peng Fan
    Reviewed-by: Andrew Morton
    Reviewed-by: Mukesh Ojha
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peng Fan