20 Jan, 2021

1 commit

  • [ Upstream commit feb889fb40fafc6933339cf1cca8f770126819fb ]

    So technically there is nothing wrong with adding a pinned page to the
    swap cache, but the pinning obviously means that the page can't actually
    be free'd right now anyway, so it's a bit pointless.

    However, the real problem is not with it being a bit pointless: the real
    issue is that after we've added it to the swap cache, we'll try to unmap
    the page. That will succeed, because the code in mm/rmap.c doesn't know
    or care about pinned pages.

    Even the unmapping isn't fatal per se, since the page will stay around
    in memory due to the pinning, and we do hold the connection to it using
    the swap cache. But when we then touch it next and take a page fault,
    the logic in do_swap_page() will map it back into the process as a
    possibly read-only page, and we'll then break the page association on
    the next COW fault.

    Honestly, this issue could have been fixed in any of those other places:
    (a) we could refuse to unmap a pinned page (which makes conceptual
    sense), or (b) we could make sure to re-map a pinned page writably in
    do_swap_page(), or (c) we could just make do_wp_page() not COW the
    pinned page (which was what we historically did before that "mm:
    do_wp_page() simplification" commit).

    But while all of them are equally valid models for breaking this chain,
    not putting pinned pages into the swap cache in the first place is the
    simplest one by far.

    It's also the safest one: the reason why do_wp_page() was changed in the
    first place was that getting the "can I re-use this page" wrong is so
    fraught with errors. If you do it wrong, you end up with an incorrectly
    shared page.

    As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
    do_swap_page() would be a serious bug since it is only a (very good)
    heuristic. Re-using the page requires a hard black-and-white rule with
    no room for ambiguity.

    In contrast, saying "this page is very likely dma pinned, so let's not
    add it to the swap cache and try to unmap it" is an obviously safe thing
    to do, and if the heuristic might very rarely be a false positive, no
    harm is done.

    Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
    Reported-and-tested-by: Martin Raiber
    Cc: Pavel Begunkov
    Cc: Jens Axboe
    Cc: Peter Xu
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

30 Dec, 2020

1 commit

  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     

15 Nov, 2020

1 commit

  • Previously the negated unsigned long would be cast back to signed long
    which would have the correct negative value. After commit 730ec8c01a2b
    ("mm/vmscan.c: change prototype for shrink_page_list"), the large
    unsigned int converts to a large positive signed long.

    Symptoms include CMA allocations hanging forever holding the cma_mutex
    due to alloc_contig_range->...->isolate_migratepages_block waiting
    forever in "while (unlikely(too_many_isolated(pgdat)))".

    [akpm@linux-foundation.org: fix -stat.nr_lazyfree_fail as well, per Michal]

    Fixes: 730ec8c01a2b ("mm/vmscan.c: change prototype for shrink_page_list")
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vaneet Narang
    Cc: Maninder Singh
    Cc: Amit Sahrawat
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Link: https://lkml.kernel.org/r/20201029032320.1448441-1-npiggin@gmail.com
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

17 Oct, 2020

2 commits


14 Oct, 2020

2 commits

  • fix comments for isolate_lru_page():
    s/fundamentnal/fundamental

    Signed-off-by: Hui Su
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200927173923.GA8058@rlk
    Signed-off-by: Linus Torvalds

    Hui Su
     
  • We have observed that drop_caches can take a considerable amount of
    time (). Especially when there are many memcgs involved
    because they are adding an additional overhead.

    It is quite unfortunate that the operation cannot be interrupted by a
    signal currently. Add a check for fatal signals into the main loop so
    that userspace can control early bailout.

    There are two reasons:

    1. We have too many memcgs, even though one object freed in one memcg,
    the sum of object is bigger than 10.

    2. We spend a lot of time in traverse memcg once. So, the memcg who
    traversed at the first have been freed many objects. Traverse memcg
    next time, the freed count bigger than 10 again.

    We can get the following info through 'ps':

    root:~# ps -aux | grep drop
    root 357956 ... R Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
    root 1771385 ... R Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
    root 1986319 ... R 18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
    root 2002148 ... R Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
    root 2564666 ... R 18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
    root 2639347 ... R Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
    root 3904747 ... R 03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
    root 4016780 ... R Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches

    Use bpftrace follow 'freed' value in drop_slab_node:

    root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
    Attaching 1 probe...
    ^B^C

    @ret:
    [64, 128) 1 | |
    [128, 256) 28 | |
    [256, 512) 107 |@ |
    [512, 1K) 298 |@@@ |
    [1K, 2K) 613 |@@@@@@@ |
    [2K, 4K) 4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [4K, 8K) 442 |@@@@@ |
    [8K, 16K) 299 |@@@ |
    [16K, 32K) 100 |@ |
    [32K, 64K) 139 |@ |
    [64K, 128K) 56 | |
    [128K, 256K) 26 | |
    [256K, 512K) 2 | |

    In the while loop, we can check whether the TASK_KILLABLE signal is set,
    if so, we should break the loop.

    Signed-off-by: Chunxin Zang
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20200909152047.27905-1-zangchunxin@bytedance.com
    Signed-off-by: Linus Torvalds

    Chunxin Zang
     

20 Sep, 2020

1 commit

  • check_move_unevictable_pages() is used in making unevictable shmem pages
    evictable: by shmem_unlock_mapping(), drm_gem_check_release_pagevec() and
    i915/gem check_release_pagevec(). Those may pass down subpages of a huge
    page, when /sys/kernel/mm/transparent_hugepage/shmem_enabled is "force".

    That does not crash or warn at present, but the accounting of vmstats
    unevictable_pgs_scanned and unevictable_pgs_rescued is inconsistent:
    scanned being incremented on each subpage, rescued only on the head (since
    tails already appear evictable once the head has been updated).

    5.8 commit 5d91f31faf8e ("mm: swap: fix vmstats for huge page") has
    established that vm_events in general (and unevictable_pgs_rescued in
    particular) should count every subpage: so follow that precedent here.

    Do this in such a way that if mem_cgroup_page_lruvec() is made stricter
    (to check page->mem_cgroup is always set), no problem: skip the tails
    before calling it, and add thp_nr_pages() to vmstats on the head.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Yang Shi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301405000.5954@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Sep, 2020

1 commit

  • We've met softlockup with "CONFIG_PREEMPT_NONE=y", when the target memcg
    doesn't have any reclaimable memory.

    It can be easily reproduced as below:

    watchdog: BUG: soft lockup - CPU#0 stuck for 111s![memcg_test:2204]
    CPU: 0 PID: 2204 Comm: memcg_test Not tainted 5.9.0-rc2+ #12
    Call Trace:
    shrink_lruvec+0x49f/0x640
    shrink_node+0x2a6/0x6f0
    do_try_to_free_pages+0xe9/0x3e0
    try_to_free_mem_cgroup_pages+0xef/0x1f0
    try_charge+0x2c1/0x750
    mem_cgroup_charge+0xd7/0x240
    __add_to_page_cache_locked+0x2fd/0x370
    add_to_page_cache_lru+0x4a/0xc0
    pagecache_get_page+0x10b/0x2f0
    filemap_fault+0x661/0xad0
    ext4_filemap_fault+0x2c/0x40
    __do_fault+0x4d/0xf9
    handle_mm_fault+0x1080/0x1790

    It only happens on our 1-vcpu instances, because there's no chance for
    oom reaper to run to reclaim the to-be-killed process.

    Add a cond_resched() at the upper shrink_node_memcgs() to solve this
    issue, this will mean that we will get a scheduling point for each memcg
    in the reclaimed hierarchy without any dependency on the reclaimable
    memory in that memcg thus making it more predictable.

    Suggested-by: Michal Hocko
    Signed-off-by: Xunlei Pang
    Signed-off-by: Andrew Morton
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Link: http://lkml.kernel.org/r/1598495549-67324-1-git-send-email-xlpang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     

15 Aug, 2020

1 commit

  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

7 commits

  • Drop the repeated word "marked".
    Change "time time" to "same time".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-14-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Now that workingset detection is implemented for anonymous LRU, we don't
    need large inactive list to allow detecting frequently accessed pages
    before they are reclaimed, anymore. This effectively reverts the
    temporary measure put in by commit "mm/vmscan: make active/inactive ratio
    as 1:1 for anon lru".

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This patch implements workingset detection for anonymous LRU. All the
    infrastructure is implemented by the previous patches so this patch just
    activates the workingset detection by installing/retrieving the shadow
    entry and adding refault calculation.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Workingset detection for anonymous page will be implemented in the
    following patch and it requires to store the shadow entries into the
    swapcache. This patch implements an infrastructure to store the shadow
    entry in the swapcache.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1595490560-15117-5-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • To prepare the workingset detection for anon LRU, this patch splits
    workingset event counters for refault, activate and restore into anon and
    file variants, as well as the refaults counter in struct lruvec.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "workingset protection/detection on the anonymous LRU list", v7.

    * PROBLEM
    In current implementation, newly created or swap-in anonymous page is
    started on the active list. Growing the active list results in
    rebalancing active/inactive list so old pages on the active list are
    demoted to the inactive list. Hence, hot page on the active list isn't
    protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list and system can contain total 100
    pages. Numbers denote the number of pages on active/inactive list (active
    | inactive). (h) stands for hot pages and (uo) stands for used-once
    pages.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    As we can see, hot pages are swapped-out and it would cause swap-in later.

    * SOLUTION
    Since this is what we want to avoid, this patchset implements workingset
    protection. Like as the file LRU list, newly created or swap-in anonymous
    page is started on the inactive list. Also, like as the file LRU list, if
    enough reference happens, the page will be promoted. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    hot pages remains in the active list. :)

    * EXPERIMENT
    I tested this scenario on my test bed and confirmed that this problem
    happens on current implementation. I also checked that it is fixed by
    this patchset.

    * SUBJECT
    workingset detection

    * PROBLEM
    Later part of the patchset implements the workingset detection for the
    anonymous LRU list. There is a corner case that workingset protection
    could cause thrashing. If we can avoid thrashing by workingset detection,
    we can get the better performance.

    Following is an example of thrashing due to the workingset protection.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (will be hot) pages
    50(h) | 50(wh)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(wh)

    4. workload: 50 (will be hot) pages
    50(h) | 50(wh), swap-in 50(wh)

    5. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(wh)

    6. repeat 4, 5

    Without workingset detection, this kind of workload cannot be promoted and
    thrashing happens forever.

    * SOLUTION
    Therefore, this patchset implements workingset detection. All the
    infrastructure for workingset detecion is already implemented, so there is
    not much work to do. First, extend workingset detection code to deal with
    the anonymous LRU list. Then, make swap cache handles the exceptional
    value for the shadow entry. Lastly, install/retrieve the shadow value
    into/from the swap cache and check the refault distance.

    * EXPERIMENT
    I made a test program to imitates above scenario and confirmed that
    problem exists. Then, I checked that this patchset fixes it.

    My test setup is a virtual machine with 8 cpus and 6100MB memory. But,
    the amount of the memory that the test program can use is about 280 MB.
    This is because the system uses large ram-backed swap and large ramdisk to
    capture the trace.

    Test scenario is like as below.

    1. allocate cold memory (512MB)
    2. allocate hot-1 memory (96MB)
    3. activate hot-1 memory (96MB)
    4. allocate another hot-2 memory (96MB)
    5. access cold memory (128MB)
    6. access hot-2 memory (96MB)
    7. repeat 5, 6

    Since hot-1 memory (96MB) is on the active list, the inactive list can
    contains roughly 190MB pages. hot-2 memory's re-access interval (96+128
    MB) is more 190MB, so it cannot be promoted without workingset detection
    and swap-in/out happens repeatedly. With this patchset, workingset
    detection works and promotion happens. Therefore, swap-in/out occurs
    less.

    Here is the result. (average of 5 runs)

    type swap-in swap-out
    base 863240 989945
    patch 681565 809273

    As we can see, patched kernel do less swap-in/out.

    * OVERALL TEST (ebizzy using modified random function)
    ebizzy is the test program that main thread allocates lots of memory and
    child threads access them randomly during the given times. Swap-in will
    happen if allocated memory is larger than the system memory.

    The random function that represents the zipf distribution is used to make
    hot/cold memory. Hot/cold ratio is controlled by the parameter. If the
    parameter is high, hot memory is accessed much larger than cold one. If
    the parameter is low, the number of access on each memory would be
    similar. I uses various parameters in order to show the effect of
    patchset on various hot/cold ratio workload.

    My test setup is a virtual machine with 8 cpus, 1024 MB memory and 5120 MB
    ram swap.

    Result format is as following.

    param: 1-1024-0.1
    - 1 (number of thread)
    - 1024 (allocated memory size, MB)
    - 0.1 (zipf distribution alpha,
    0.1 works like as roughly uniform random,
    1.3 works like as small portion of memory is hot and the others are cold)

    pswpin: smaller is better
    std: standard deviation
    improvement: negative is better

    * single thread
    param pswpin std improvement
    base 1-1024.0-0.1 14101983.40 79441.19
    prot 1-1024.0-0.1 14065875.80 136413.01 ( -0.26 )
    detect 1-1024.0-0.1 13910435.60 100804.82 ( -1.36 )
    base 1-1024.0-0.7 7998368.80 43469.32
    prot 1-1024.0-0.7 7622245.80 88318.74 ( -4.70 )
    detect 1-1024.0-0.7 7618515.20 59742.07 ( -4.75 )
    base 1-1024.0-1.3 1017400.80 38756.30
    prot 1-1024.0-1.3 940464.60 29310.69 ( -7.56 )
    detect 1-1024.0-1.3 945511.40 24579.52 ( -7.07 )
    base 1-1280.0-0.1 22895541.40 50016.08
    prot 1-1280.0-0.1 22860305.40 51952.37 ( -0.15 )
    detect 1-1280.0-0.1 22705565.20 93380.35 ( -0.83 )
    base 1-1280.0-0.7 13717645.60 46250.65
    prot 1-1280.0-0.7 12935355.80 64754.43 ( -5.70 )
    detect 1-1280.0-0.7 13040232.00 63304.00 ( -4.94 )
    base 1-1280.0-1.3 1654251.40 4159.68
    prot 1-1280.0-1.3 1522680.60 33673.50 ( -7.95 )
    detect 1-1280.0-1.3 1599207.00 70327.89 ( -3.33 )
    base 1-1536.0-0.1 31621775.40 31156.28
    prot 1-1536.0-0.1 31540355.20 62241.36 ( -0.26 )
    detect 1-1536.0-0.1 31420056.00 123831.27 ( -0.64 )
    base 1-1536.0-0.7 19620760.60 60937.60
    prot 1-1536.0-0.7 18337839.60 56102.58 ( -6.54 )
    detect 1-1536.0-0.7 18599128.00 75289.48 ( -5.21 )
    base 1-1536.0-1.3 2378142.40 20994.43
    prot 1-1536.0-1.3 2166260.60 48455.46 ( -8.91 )
    detect 1-1536.0-1.3 2183762.20 16883.24 ( -8.17 )
    base 1-1792.0-0.1 40259714.80 90750.70
    prot 1-1792.0-0.1 40053917.20 64509.47 ( -0.51 )
    detect 1-1792.0-0.1 39949736.40 104989.64 ( -0.77 )
    base 1-1792.0-0.7 25704884.40 69429.68
    prot 1-1792.0-0.7 23937389.00 79945.60 ( -6.88 )
    detect 1-1792.0-0.7 24271902.00 35044.30 ( -5.57 )
    base 1-1792.0-1.3 3129497.00 32731.86
    prot 1-1792.0-1.3 2796994.40 19017.26 ( -10.62 )
    detect 1-1792.0-1.3 2886840.40 33938.82 ( -7.75 )
    base 1-2048.0-0.1 48746924.40 50863.88
    prot 1-2048.0-0.1 48631954.40 24537.30 ( -0.24 )
    detect 1-2048.0-0.1 48509419.80 27085.34 ( -0.49 )
    base 1-2048.0-0.7 32046424.40 78624.22
    prot 1-2048.0-0.7 29764182.20 86002.26 ( -7.12 )
    detect 1-2048.0-0.7 30250315.80 101282.14 ( -5.60 )
    base 1-2048.0-1.3 3916723.60 24048.55
    prot 1-2048.0-1.3 3490781.60 33292.61 ( -10.87 )
    detect 1-2048.0-1.3 3585002.20 44942.04 ( -8.47 )

    * multi thread
    param pswpin std improvement
    base 8-1024.0-0.1 16219822.60 329474.01
    prot 8-1024.0-0.1 15959494.00 654597.45 ( -1.61 )
    detect 8-1024.0-0.1 15773790.80 502275.25 ( -2.75 )
    base 8-1024.0-0.7 9174107.80 537619.33
    prot 8-1024.0-0.7 8571915.00 385230.08 ( -6.56 )
    detect 8-1024.0-0.7 8489484.20 364683.00 ( -7.46 )
    base 8-1024.0-1.3 1108495.60 83555.98
    prot 8-1024.0-1.3 1038906.20 63465.20 ( -6.28 )
    detect 8-1024.0-1.3 941817.80 32648.80 ( -15.04 )
    base 8-1280.0-0.1 25776114.20 450480.45
    prot 8-1280.0-0.1 25430847.00 465627.07 ( -1.34 )
    detect 8-1280.0-0.1 25282555.00 465666.55 ( -1.91 )
    base 8-1280.0-0.7 15218968.00 702007.69
    prot 8-1280.0-0.7 13957947.80 492643.86 ( -8.29 )
    detect 8-1280.0-0.7 14158331.20 238656.02 ( -6.97 )
    base 8-1280.0-1.3 1792482.80 30512.90
    prot 8-1280.0-1.3 1577686.40 34002.62 ( -11.98 )
    detect 8-1280.0-1.3 1556133.00 22944.79 ( -13.19 )
    base 8-1536.0-0.1 33923761.40 575455.85
    prot 8-1536.0-0.1 32715766.20 300633.51 ( -3.56 )
    detect 8-1536.0-0.1 33158477.40 117764.51 ( -2.26 )
    base 8-1536.0-0.7 20628907.80 303851.34
    prot 8-1536.0-0.7 19329511.20 341719.31 ( -6.30 )
    detect 8-1536.0-0.7 20013934.00 385358.66 ( -2.98 )
    base 8-1536.0-1.3 2588106.40 130769.20
    prot 8-1536.0-1.3 2275222.40 89637.06 ( -12.09 )
    detect 8-1536.0-1.3 2365008.40 124412.55 ( -8.62 )
    base 8-1792.0-0.1 43328279.20 946469.12
    prot 8-1792.0-0.1 41481980.80 525690.89 ( -4.26 )
    detect 8-1792.0-0.1 41713944.60 406798.93 ( -3.73 )
    base 8-1792.0-0.7 27155647.40 536253.57
    prot 8-1792.0-0.7 24989406.80 502734.52 ( -7.98 )
    detect 8-1792.0-0.7 25524806.40 263237.87 ( -6.01 )
    base 8-1792.0-1.3 3260372.80 137907.92
    prot 8-1792.0-1.3 2879187.80 63597.26 ( -11.69 )
    detect 8-1792.0-1.3 2892962.20 33229.13 ( -11.27 )
    base 8-2048.0-0.1 50583989.80 710121.48
    prot 8-2048.0-0.1 49599984.40 228782.42 ( -1.95 )
    detect 8-2048.0-0.1 50578596.00 660971.66 ( -0.01 )
    base 8-2048.0-0.7 33765479.60 812659.55
    prot 8-2048.0-0.7 30767021.20 462907.24 ( -8.88 )
    detect 8-2048.0-0.7 32213068.80 211884.24 ( -4.60 )
    base 8-2048.0-1.3 3941675.80 28436.45
    prot 8-2048.0-1.3 3538742.40 76856.08 ( -10.22 )
    detect 8-2048.0-1.3 3579397.80 58630.95 ( -9.19 )

    As we can see, all the cases show improvement. Especially, test case with
    zipf distribution 1.3 show more improvements. It means that if there is a
    hot/cold tendency in anon pages, this patchset works better.

    This patch (of 6):

    Current implementation of LRU management for anonymous page has some
    problems. Most important one is that it doesn't protect the workingset,
    that is, pages on the active LRU list. Although, this problem will be
    fixed in the following patchset, the preparation is required and this
    patch does it.

    What following patch does is to implement workingset protection. After
    the following patchset, newly created or swap-in pages will start their
    lifetime on the inactive list. If inactive list is too small, there is
    not enough chance to be referenced and the page cannot become the
    workingset.

    In order to provide the newly anonymous or swap-in pages enough chance to
    be referenced again, this patch makes active/inactive LRU ratio as 1:1.

    This is just a temporary measure. Later patch in the series introduces
    workingset detection for anonymous LRU that will be used to better decide
    if pages should start on the active and inactive list. Afterwards this
    patch is effectively reverted.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/1595490560-15117-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1595490560-15117-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

7 commits

  • The vmstat pgrefill is useful together with pgscan and pgsteal stats to
    measure the reclaim efficiency. However vmstat's pgrefill is not updated
    consistently at system level. It gets updated for both global and memcg
    reclaim however pgscan and pgsteal are updated for only global reclaim.
    So, update pgrefill only for global reclaim. If someone is interested in
    the stats representing both system level as well as memcg level reclaim,
    then consult the root memcg's memory.stat instead of /proc/vmstat.

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Acked-by: Yafang Shao
    Acked-by: Roman Gushchin
    Acked-by: Chris Down
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200711011459.1159929-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Change "optizimation" to "optimization".

    Signed-off-by: dylan-meiners
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200609185144.10049-1-spacct.spacct@gmail.com
    Signed-off-by: Linus Torvalds

    dylan-meiners
     
  • The global variable "vm_total_pages" is a relic from older days. There is
    only a single user that reads the variable - build_all_zonelists() - and
    the first thing it does is update it.

    Use a local variable in build_all_zonelists() instead and remove the
    global variable.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Huang Ying
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • When an outside process lowers one of the memory limits of a cgroup (or
    uses the force_empty knob in cgroup1), direct reclaim is performed in the
    context of the write(), in order to directly enforce the new limit and
    have it being met by the time the write() returns.

    Currently, this reclaim activity is accounted as memory pressure in the
    cgroup that the writer(!) belongs to. This is unexpected. It
    specifically causes problems for senpai
    (https://github.com/facebookincubator/senpai), which is an agent that
    routinely adjusts the memory limits and performs associated reclaim work
    in tens or even hundreds of cgroups running on the host. The cgroup that
    senpai is running in itself will report elevated levels of memory
    pressure, even though it itself is under no memory shortage or any sort of
    distress.

    Move the psi annotation from the central cgroup reclaim function to
    callsites in the allocation context, and thereby no longer count any
    limit-setting reclaim as memory pressure. If the newly set limit causes
    the workload inside the cgroup into direct reclaim, that of course will
    continue to count as memory pressure.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Chris Down
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200728135210.379885-2-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_protected currently is both used to set effective low and min
    and return a mem_cgroup_protection based on the result. As a user, this
    can be a little unexpected: it appears to be a simple predicate function,
    if not for the big warning in the comment above about the order in which
    it must be executed.

    This change makes it so that we separate the state mutations from the
    actual protection checks, which makes it more obvious where we need to be
    careful mutating internal state, and where we are simply checking and
    don't need to worry about that.

    [mhocko@suse.com - don't check protection on root memcgs]

    Suggested-by: Johannes Weiner
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Yafang Shao
    Link: http://lkml.kernel.org/r/ff3f915097fcee9f6d7041c084ef92d16aaeb56a.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • Patch series "mm, memcg: memory.{low,min} reclaim fix & cleanup", v4.

    This series contains a fix for a edge case in my earlier protection
    calculation patches, and a patch to make the area overall a little more
    robust to hopefully help avoid this in future.

    This patch (of 2):

    A cgroup can have both memory protection and a memory limit to isolate it
    from its siblings in both directions - for example, to prevent it from
    being shrunk below 2G under high pressure from outside, but also from
    growing beyond 4G under low pressure.

    Commit 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    implemented proportional scan pressure so that multiple siblings in excess
    of their protection settings don't get reclaimed equally but instead in
    accordance to their unprotected portion.

    During limit reclaim, this proportionality shouldn't apply of course:
    there is no competition, all pressure is from within the cgroup and should
    be applied as such. Reclaim should operate at full efficiency.

    However, mem_cgroup_protected() never expected anybody to look at the
    effective protection values when it indicated that the cgroup is above its
    protection. As a result, a query during limit reclaim may return stale
    protection values that were calculated by a previous reclaim cycle in
    which the cgroup did have siblings.

    When this happens, reclaim is unnecessarily hesitant and potentially slow
    to meet the desired limit. In theory this could lead to premature OOM
    kills, although it's not obvious this has occurred in practice.

    Workaround the problem by special casing reclaim roots in
    mem_cgroup_protection. These memcgs are never participating in the
    reclaim protection because the reclaim is internal.

    We have to ignore effective protection values for reclaim roots because
    mem_cgroup_protected might be called from racing reclaim contexts with
    different roots. Calculation is relying on root -> leaf tree traversal
    therefore top-down reclaim protection invariants should hold. The only
    exception is the reclaim root which should have effective protection set
    to 0 but that would be problematic for the following setup:

    Let's have global and A's reclaim in parallel:
    |
    A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
    |\
    | C (low = 1G, usage = 2.5G)
    B (low = 1G, usage = 0.5G)

    for A reclaim we have
    B.elow = B.low
    C.elow = C.low

    For the global reclaim
    A.elow = A.low
    B.elow = min(B.usage, B.low) because children_low_usage A.elow

    Which means that protected memcgs would get reclaimed.

    In future we would like to make mem_cgroup_protected more robust against
    racing reclaim contexts but that is likely more complex solution than this
    simple workaround.

    [hannes@cmpxchg.org - large part of the changelog]
    [mhocko@suse.com - workaround explanation]
    [chris@chrisdown.name - retitle]

    Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
    Signed-off-by: Yafang Shao
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Chris Down
    Acked-by: Roman Gushchin
    Link: http://lkml.kernel.org/r/cover.1594638158.git.chris@chrisdown.name
    Link: http://lkml.kernel.org/r/044fb8ecffd001c7905d27c0c2ad998069fdc396.1594638158.git.chris@chrisdown.name
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

26 Jun, 2020

1 commit

  • Patch series "fix for "mm: balance LRU lists based on relative
    thrashing" patchset"

    This patchset fixes some problems of the patchset, "mm: balance LRU
    lists based on relative thrashing", which is now merged on the mainline.

    Patch "mm: workingset: let cache workingset challenge anon fix" is the
    result of discussion with Johannes. See following link.

    http://lkml.kernel.org/r/20200520232525.798933-6-hannes@cmpxchg.org

    And, the other two are minor things which are found when I try to rebase
    my patchset.

    This patch (of 3):

    After ("mm: workingset: let cache workingset challenge anon fix"), we
    compare refault distances to active_file + anon. But age of the
    non-resident information is only driven by the file LRU. As a result,
    we may overestimate the recency of any incoming refaults and activate
    them too eagerly, causing unnecessary LRU churn in certain situations.

    Make anon aging drive nonresident age as well to address that.

    Link: http://lkml.kernel.org/r/1592288204-27734-1-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1592288204-27734-2-git-send-email-iamjoonsoo.kim@lge.com
    Fixes: 34e58cac6d8f2a ("mm: workingset: let cache workingset challenge anon")
    Reported-by: Joonsoo Kim
    Signed-off-by: Johannes Weiner
    Signed-off-by: Joonsoo Kim
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jun, 2020

1 commit

  • There are some typos, fix them.

    s/regsitration/registration
    s/santity/sanity
    s/decremeting/decrementing

    Signed-off-by: Ethon Paul
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200411071544.16222-1-ethp@qq.com
    Signed-off-by: Linus Torvalds

    Ethon Paul
     

04 Jun, 2020

14 commits

  • When LRU cost only shows up on one list, we abruptly stop scanning that
    list altogether. That's an extreme reaction: by the time the other list
    starts thrashing and the pendulum swings back, we may have no recent age
    information on the first list anymore, and we could have significant
    latencies until the scanner has caught up.

    Soften this change in the feedback system by ensuring that no list
    receives less than a third of overall pressure, and only distribute the
    other 66% according to LRU cost. This ensures that we maintain a minimum
    rate of aging on the entire workingset while it's being pressured, while
    still allowing a generous rate of convergence when the relative sizes of
    the lists need to adjust.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-15-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The VM tries to balance reclaim pressure between anon and file so as to
    reduce the amount of IO incurred due to the memory shortage. It already
    counts refaults and swapins, but in addition it should also count
    writepage calls during reclaim.

    For swap, this is obvious: it's IO that wouldn't have occurred if the
    anonymous memory hadn't been under memory pressure. From a relative
    balancing point of view this makes sense as well: even if anon is cold and
    reclaimable, a cache that isn't thrashing may have equally cold pages that
    don't require IO to reclaim.

    For file writeback, it's trickier: some of the reclaim writepage IO would
    have likely occurred anyway due to dirty expiration. But not all of it -
    premature writeback reduces batching and generates additional writes.
    Since the flushers are already woken up by the time the VM starts writing
    cache pages one by one, let's assume that we'e likely causing writes that
    wouldn't have happened without memory pressure. In addition, the per-page
    cost of IO would have probably been much cheaper if written in larger
    batches from the flusher thread rather than the single-page-writes from
    kswapd.

    For our purposes - getting the trend right to accelerate convergence on a
    stable state that doesn't require paging at all - this is sufficiently
    accurate. If we later wanted to optimize for sustained thrashing, we can
    still refine the measurements.

    Count all writepage calls from kswapd as IO cost toward the LRU that the
    page belongs to.

    Why do this dynamically? Don't we know in advance that anon pages require
    IO to reclaim, and so could build in a static bias?

    First, scanning is not the same as reclaiming. If all the anon pages are
    referenced, we may not swap for a while just because we're scanning the
    anon list. During this time, however, it's important that we age
    anonymous memory and the page cache at the same rate so that their
    hot-cold gradients are comparable. Everything else being equal, we still
    want to reclaim the coldest memory overall.

    Second, we keep copies in swap unless the page changes. If there is
    swap-backed data that's mostly read (tmpfs file) and has been swapped out
    before, we can reclaim it without incurring additional IO.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-14-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We split the LRU lists into anon and file, and we rebalance the scan
    pressure between them when one of them begins thrashing: if the file cache
    experiences workingset refaults, we increase the pressure on anonymous
    pages; if the workload is stalled on swapins, we increase the pressure on
    the file cache instead.

    With cgroups and their nested LRU lists, we currently don't do this
    correctly. While recursive cgroup reclaim establishes a relative LRU
    order among the pages of all involved cgroups, LRU pressure balancing is
    done on an individual cgroup LRU level. As a result, when one cgroup is
    thrashing on the filesystem cache while a sibling may have cold anonymous
    pages, pressure doesn't get equalized between them.

    This patch moves LRU balancing decision to the root of reclaim - the same
    level where the LRU order is established.

    It does this by tracking LRU cost recursively, so that every level of the
    cgroup tree knows the aggregate LRU cost of all memory within its domain.
    When the page scanner calculates the scan balance for any given individual
    cgroup's LRU list, it uses the values from the ancestor cgroup that
    initiated the reclaim cycle.

    If one sibling is then thrashing on the cache, it will tip the pressure
    balance inside its ancestors, and the next hierarchical reclaim iteration
    will go more after the anon pages in the tree.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since the LRUs were split into anon and file lists, the VM has been
    balancing between page cache and anonymous pages based on per-list ratios
    of scanned vs. rotated pages. In most cases that tips page reclaim
    towards the list that is easier to reclaim and has the fewest actively
    used pages, but there are a few problems with it:

    1. Refaults and LRU rotations are weighted the same way, even though
    one costs IO and the other costs a bit of CPU.

    2. The less we scan an LRU list based on already observed rotations,
    the more we increase the sampling interval for new references, and
    rotations become even more likely on that list. This can enter a
    death spiral in which we stop looking at one list completely until
    the other one is all but annihilated by page reclaim.

    Since commit a528910e12ec ("mm: thrash detection-based file cache sizing")
    we have refault detection for the page cache. Along with swapin events,
    they are good indicators of when the file or anon list, respectively, is
    too small for its workingset and needs to grow.

    For example, if the page cache is thrashing, the cache pages need more
    time in memory, while there may be colder pages on the anonymous list.
    Likewise, if swapped pages are faulting back in, it indicates that we
    reclaim anonymous pages too aggressively and should back off.

    Replace LRU rotations with refaults and swapins as the basis for relative
    reclaim cost of the two LRUs. This will have the VM target list balances
    that incur the least amount of IO on aggregate.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When shrinking the active file list we rotate referenced pages only when
    they're in an executable mapping. The others get deactivated. When it
    comes to balancing scan pressure, though, we count all referenced pages as
    rotated, even the deactivated ones. Yet they do not carry the same cost
    to the system: the deactivated page *might* refault later on, but the
    deactivation is tangible progress toward freeing pages; rotations on the
    other hand cost time and effort without getting any closer to freeing
    memory.

    Don't treat both events as equal. The following patch will hook up LRU
    balancing to cache and anon refaults, which are a much more concrete cost
    signal for reclaiming one list over the other. Thus, remove the maybe-IO
    cost bias from page references, and only note the CPU cost for actual
    rotations that prevent the pages from getting reclaimed.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently, scan pressure between the anon and file LRU lists is balanced
    based on a mixture of reclaim efficiency and a somewhat vague notion of
    "value" of having certain pages in memory over others. That concept of
    value is problematic, because it has caused us to count any event that
    remotely makes one LRU list more or less preferrable for reclaim, even
    when these events are not directly comparable and impose very different
    costs on the system. One example is referenced file pages that we still
    deactivate and referenced anonymous pages that we actually rotate back to
    the head of the list.

    There is also conceptual overlap with the LRU algorithm itself. By
    rotating recently used pages instead of reclaiming them, the algorithm
    already biases the applied scan pressure based on page value. Thus, when
    rebalancing scan pressure due to rotations, we should think of reclaim
    cost, and leave assessing the page value to the LRU algorithm.

    Lastly, considering both value-increasing as well as value-decreasing
    events can sometimes cause the same type of event to be counted twice,
    i.e. how rotating a page increases the LRU value, while reclaiming it
    succesfully decreases the value. In itself this will balance out fine,
    but it quietly skews the impact of events that are only recorded once.

    The abstract metric of "value", the murky relationship with the LRU
    algorithm, and accounting both negative and positive events make the
    current pressure balancing model hard to reason about and modify.

    This patch switches to a balancing model of accounting the concrete,
    actually observed cost of reclaiming one LRU over another. For now, that
    cost includes pages that are scanned but rotated back to the list head.
    Subsequent patches will add consideration for IO caused by refaulting of
    recently evicted pages.

    Replace struct zone_reclaim_stat with two cost counters in the lruvec, and
    make everything that affects cost go through a new lru_note_cost()
    function.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-9-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When we calculate the relative scan pressure between the anon and file LRU
    lists, we have to assume that reclaim_stat can contain zeroes. To avoid
    div0 crashes, we add 1 to all denominators like so:

    anon_prio = swappiness;
    file_prio = 200 - anon_prio;

    [...]

    /*
    * The amount of pressure on anon vs file pages is inversely
    * proportional to the fraction of recently scanned pages on
    * each list that were recently referenced and in active use.
    */
    ap = anon_prio * (reclaim_stat->recent_scanned[0] + 1);
    ap /= reclaim_stat->recent_rotated[0] + 1;

    fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
    fp /= reclaim_stat->recent_rotated[1] + 1;
    spin_unlock_irq(&pgdat->lru_lock);

    fraction[0] = ap;
    fraction[1] = fp;
    denominator = ap + fp + 1;

    While reclaim_stat can contain 0, it's not actually possible for ap + fp
    to be 0. One of anon_prio or file_prio could be zero, but they must still
    add up to 200. And the reclaim_stat fraction, due to the +1 in there, is
    always at least 1. So if one of the two numerators is 0, the other one
    can't be. ap + fp is always at least 1. Drop the + 1.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-8-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the advent of fast random IO devices (SSDs, PMEM) and in-memory swap
    devices such as zswap, it's possible for swap to be much faster than
    filesystems, and for swapping to be preferable over thrashing filesystem
    caches.

    Allow setting swappiness - which defines the rough relative IO cost of
    cache misses between page cache and swap-backed pages - to reflect such
    situations by making the swap-preferred range configurable.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having statistics on pages scanned and pages reclaimed for both anon and
    file pages makes it easier to evaluate changes to LRU balancing.

    While at it, clean up the stat-keeping mess for isolation, putback,
    reclaim stats etc. a bit: first the physical LRU operation (isolation and
    putback), followed by vmstats, reclaim_stats, and then vm events.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200520232525.798933-3-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • try_to_compact_zone() has been replaced by try_to_compact_pages(), which
    is necessary to be updated in the comment of should_continue_reclaim().

    Signed-off-by: Qiwu Chen
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200501034907.22991-1-chenqiwu@xiaomi.com
    Signed-off-by: Linus Torvalds

    Qiwu Chen
     
  • commit 3c710c1ad11b ("mm, vmscan extract shrink_page_list reclaim counters
    into a struct") changed data type for the function, so changing return
    type for funciton and its caller.

    Signed-off-by: Vaneet Narang
    Signed-off-by: Maninder Singh
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Amit Sahrawat
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/1588168259-25604-1-git-send-email-maninder1.s@samsung.com
    Signed-off-by: Linus Torvalds

    Maninder Singh
     
  • Fix an nr_isolate_* mismatch problem between cma and dirty lazyfree pages.

    If try_to_unmap_one is used for reclaim and it detects a dirty lazyfree
    page, then the lazyfree page is changed to a normal anon page having
    SwapBacked by commit 802a3a92ad7a ("mm: reclaim MADV_FREE pages"). Even
    with the change, reclaim context correctly counts isolated files because
    it uses is_file_lru to distinguish file. And the change to anon is not
    happened if try_to_unmap_one is used for migration. So migration context
    like compaction also correctly counts isolated files even though it uses
    page_is_file_lru insted of is_file_lru. Recently page_is_file_cache was
    renamed to page_is_file_lru by commit 9de4f22a60f7 ("mm: code cleanup for
    MADV_FREE").

    But the nr_isolate_* mismatch problem happens on cma alloc. There is
    reclaim_clean_pages_from_list which is being used only by cma. It was
    introduced by commit 02c6de8d757c ("mm: cma: discard clean pages during
    contiguous allocation instead of migration") to reclaim clean file pages
    without migration. The cma alloc uses both reclaim_clean_pages_from_list
    and migrate_pages, and it uses page_is_file_lru to count isolated files.
    If there are dirty lazyfree pages allocated from cma memory region, the
    pages are counted as isolated file at the beginging but are counted as
    isolated anon after finished.

    Mem-Info:
    Node 0 active_anon:3045904kB inactive_anon:611448kB active_file:14892kB inactive_file:205636kB unevictable:10416kB isolated(anon):0kB isolated(file):37664kB mapped:630216kB dirty:384kB writeback:0kB shmem:42576kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    Like log above, there were too much isolated files, 37664kB, which
    triggers too_many_isolated in reclaim even when there is no actually
    isolated file in system wide. It could be reproducible by running two
    programs, writing on MADV_FREE page and doing cma alloc, respectively.
    Although isolated anon is 0, I found that the internal value of isolated
    anon was the negative value of isolated file.

    Fix this by compensating the isolated count for both LRU lists. Count
    non-discarded lazyfree pages in shrink_page_list, then compensate the
    counted number in reclaim_clean_pages_from_list.

    Reported-by: Yong-Taek Lee
    Suggested-by: Minchan Kim
    Signed-off-by: Jaewon Kim
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200426011718.30246-1-jaewon31.kim@samsung.com
    Signed-off-by: Linus Torvalds

    Jaewon Kim
     
  • We already defined the helper update_lru_size().

    Let's use this to reduce code duplication.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20200331221550.1011-1-richard.weiyang@gmail.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • None of the three callers of get_compound_page_dtor() want to know the
    value; they just want to call the function. Replace it with
    destroy_compound_page() which calls the dtor for them.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Kirill A. Shutemov
    Link: http://lkml.kernel.org/r/20200517105051.9352-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)