03 Dec, 2016

1 commit

  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") has made the workingset shadow nodes shrinker memcg aware. The
    implementation is not correct though because memcg_kmem_enabled() might
    become true while we are doing a global reclaim when the sc->memcg might
    be NULL which is exactly what Marek has seen:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
    IP: [] mem_cgroup_node_nr_lru_pages+0x20/0x40
    PGD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1
    task: ffff880011863b00 task.stack: ffff880011868000
    RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP: e02b:ffff88001186bc70 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
    RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
    R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
    R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
    Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
    Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
    RIP mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP
    CR2: 0000000000000400
    ---[ end trace 100494b9edbdfc4d ]---

    This patch fixes the issue by checking sc->memcg rather than
    memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
    that only memcg aware shrinkers will get non-NULL memcgs and only if
    memcg_kmem_enabled is true.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Marek Marczykowski-Górecki
    Tested-by: Marek Marczykowski-Górecki
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Oct, 2016

1 commit

  • Antonio reports the following crash when using fuse under memory pressure:

    kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: all of them
    CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
    Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
    task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
    RIP: shadow_lru_isolate+0x181/0x190
    Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

    which corresponds to the following sanity check in the shadow node
    tracking:

    BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

    The workingset code tracks radix tree nodes that exclusively contain
    shadow entries of evicted pages in them, and this (somewhat obscure)
    line checks whether there are real pages left that would interfere with
    reclaim of the radix tree node under memory pressure.

    While discussing ways how fuse might sneak pages into the radix tree
    past the workingset code, Miklos pointed to replace_page_cache_page(),
    and indeed there is a problem there: it properly accounts for the old
    page being removed - __delete_from_page_cache() does that - but then
    does a raw raw radix_tree_insert(), not accounting for the replacement
    page. Eventually the page count bits in node->count underflow while
    leaving the node incorrectly linked to the shadow node LRU.

    To address this, make sure replace_page_cache_page() uses the tracked
    page insertion code, page_cache_tree_insert(). This fixes the page
    accounting and makes sure page-containing nodes are properly unlinked
    from the shadow node LRU again.

    Also, make the sanity checks a bit less obscure by using the helpers for
    checking the number of pages and shadows in a radix tree node.

    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Antonio SJ Musumeci
    Debugged-by: Miklos Szeredi
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

29 Jul, 2016

6 commits

  • Working set and refault detection is still zone-based, fix it.

    Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patchset: "Move LRU page reclaim from zones to nodes v9"

    This series moves LRUs from the zones to the node. While this is a
    current rebase, the test results were based on mmotm as of June 23rd.
    Conceptually, this series is simple but there are a lot of details.
    Some of the broad motivations for this are;

    1. The residency of a page partially depends on what zone the page was
    allocated from. This is partially combatted by the fair zone allocation
    policy but that is a partial solution that introduces overhead in the
    page allocator paths.

    2. Currently, reclaim on node 0 behaves slightly different to node 1. For
    example, direct reclaim scans in zonelist order and reclaims even if
    the zone is over the high watermark regardless of the age of pages
    in that LRU. Kswapd on the other hand starts reclaim on the highest
    unbalanced zone. A difference in distribution of file/anon pages due
    to when they were allocated results can result in a difference in
    again. While the fair zone allocation policy mitigates some of the
    problems here, the page reclaim results on a multi-zone node will
    always be different to a single-zone node.
    it was scheduled on as a result.

    3. kswapd and the page allocator scan zones in the opposite order to
    avoid interfering with each other but it's sensitive to timing. This
    mitigates the page allocator using pages that were allocated very recently
    in the ideal case but it's sensitive to timing. When kswapd is allocating
    from lower zones then it's great but during the rebalancing of the highest
    zone, the page allocator and kswapd interfere with each other. It's worse
    if the highest zone is small and difficult to balance.

    4. slab shrinkers are node-based which makes it harder to identify the exact
    relationship between slab reclaim and LRU reclaim.

    The reason we have zone-based reclaim is that we used to have
    large highmem zones in common configurations and it was necessary
    to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
    less of a concern as machines with lots of memory will (or should) use
    64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
    rare. Machines that do use highmem should have relatively low highmem:lowmem
    ratios than we worried about in the past.

    Conceptually, moving to node LRUs should be easier to understand. The
    page allocator plays fewer tricks to game reclaim and reclaim behaves
    similarly on all nodes.

    The series has been tested on a 16 core UMA machine and a 2-socket 48
    core NUMA machine. The UMA results are presented in most cases as the NUMA
    machine behaved similarly.

    pagealloc
    ---------

    This is a microbenchmark that shows the benefit of removing the fair zone
    allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
    shown as the other orders were comparable.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
    Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
    Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
    Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
    Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
    Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
    Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
    Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
    Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
    Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
    Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
    Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
    Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
    Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
    Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
    Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
    Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
    Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
    Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
    Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
    Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
    Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
    Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
    Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
    Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
    Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
    Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

    This shows a steady improvement throughout. The primary benefit is from
    reduced system CPU usage which is obvious from the overall times;

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    User 189.19 191.80
    System 2604.45 2533.56
    Elapsed 2855.30 2786.39

    The vmstats also showed that the fair zone allocation policy was definitely
    removed as can be seen here;

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v8
    DMA32 allocs 28794729769 0
    Normal allocs 48432501431 77227309877
    Movable allocs 0 0

    tiobench on ext4
    ----------------

    tiobench is a benchmark that artifically benefits if old pages remain resident
    while new pages get reclaimed. The fair zone allocation policy mitigates this
    problem so pages age fairly. While the benchmark has problems, it is important
    that tiobench performance remains constant as it implies that page aging
    problems that the fair zone allocation policy fixes are not re-introduced.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
    Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
    Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
    Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
    Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
    Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
    Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
    Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
    Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
    Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
    Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
    Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
    Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
    Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
    Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
    Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
    Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
    Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
    Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
    Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
    Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 approx-v9
    User 645.72 525.90
    System 403.85 331.75
    Elapsed 6795.36 6783.67

    This shows that the series has little or not impact on tiobench which is
    desirable and a reduction in system CPU usage. It indicates that the fair
    zone allocation policy was removed in a manner that didn't reintroduce
    one class of page aging bug. There were only minor differences in overall
    reclaim activity

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Minor Faults 645838 647465
    Major Faults 573 640
    Swap Ins 0 0
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 46041453 44190646
    Normal allocs 78053072 79887245
    Movable allocs 0 0
    Allocation stalls 24 67
    Stall zone DMA 0 0
    Stall zone DMA32 0 0
    Stall zone Normal 0 2
    Stall zone HighMem 0 0
    Stall zone Movable 0 65
    Direct pages scanned 10969 30609
    Kswapd pages scanned 93375144 93492094
    Kswapd pages reclaimed 93372243 93489370
    Direct pages reclaimed 10969 30609
    Kswapd efficiency 99% 99%
    Kswapd velocity 13741.015 13781.934
    Direct efficiency 100% 100%
    Direct velocity 1.614 4.512
    Percentage direct scans 0% 0%

    kswapd activity was roughly comparable. There were differences in direct
    reclaim activity but negligible in the context of the overall workload
    (velocity of 4 pages per second with the patches applied, 1.6 pages per
    second in the baseline kernel).

    pgbench read-only large configuration on ext4
    ---------------------------------------------

    pgbench is a database benchmark that can be sensitive to page reclaim
    decisions. This also checks if removing the fair zone allocation policy
    is safe

    pgbench Transactions
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
    Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
    Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
    Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
    Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
    Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

    Negligible differences again. As with tiobench, overall reclaim activity
    was comparable.

    bonnie++ on ext4
    ----------------

    No interesting performance difference, negligible differences on reclaim
    stats.

    paralleldd on ext4
    ------------------

    This workload uses varying numbers of dd instances to read large amounts of
    data from disk.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
    Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
    Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
    Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
    Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
    Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    User 1548.01 1552.44
    System 8609.71 8515.08
    Elapsed 3587.10 3594.54

    There is little or no change in performance but some drop in system CPU usage.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Minor Faults 362662 367360
    Major Faults 1204 1143
    Swap Ins 22 0
    Swap Outs 2855 1029
    DMA allocs 0 0
    DMA32 allocs 31409797 28837521
    Normal allocs 46611853 49231282
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 40845270 40869088
    Kswapd pages reclaimed 40830976 40855294
    Direct pages reclaimed 0 0
    Kswapd efficiency 99% 99%
    Kswapd velocity 11386.711 11369.769
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Page writes by reclaim 2855 1029
    Page writes file 0 0
    Page writes anon 2855 1029
    Page reclaim immediate 771 1628
    Sector Reads 293312636 293536360
    Sector Writes 18213568 18186480
    Page rescued immediate 0 0
    Slabs scanned 128257 132747
    Direct inode steals 181 56
    Kswapd inode steals 59 1131

    It basically shows that kswapd was active at roughly the same rate in
    both kernels. There was also comparable slab scanning activity and direct
    reclaim was avoided in both cases. There appears to be a large difference
    in numbers of inodes reclaimed but the workload has few active inodes and
    is likely a timing artifact.

    stutter
    -------

    stutter simulates a simple workload. One part uses a lot of anonymous
    memory, a second measures mmap latency and a third copies a large file.
    The primary metric is checking for mmap latency.

    stutter
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
    1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
    2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
    3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
    Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
    Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
    Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
    Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
    Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
    Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
    Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
    Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
    Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
    Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
    Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
    Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
    Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

    This shows a number of improvements with the worst-case outlier greatly
    improved.

    Some of the vmstats are interesting

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Swap Ins 163 502
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 618719206 1381662383
    Normal allocs 891235743 564138421
    Movable allocs 0 0
    Allocation stalls 2603 1
    Direct pages scanned 216787 2
    Kswapd pages scanned 50719775 41778378
    Kswapd pages reclaimed 41541765 41777639
    Direct pages reclaimed 209159 0
    Kswapd efficiency 81% 99%
    Kswapd velocity 16859.554 14329.059
    Direct efficiency 96% 0%
    Direct velocity 72.061 0.001
    Percentage direct scans 0% 0%
    Page writes by reclaim 6215049 0
    Page writes file 6215049 0
    Page writes anon 0 0
    Page reclaim immediate 70673 90
    Sector Reads 81940800 81680456
    Sector Writes 100158984 98816036
    Page rescued immediate 0 0
    Slabs scanned 1366954 22683

    While this is not guaranteed in all cases, this particular test showed
    a large reduction in direct reclaim activity. It's also worth noting
    that no page writes were issued from reclaim context.

    This series is not without its hazards. There are at least three areas
    that I'm concerned with even though I could not reproduce any problems in
    that area.

    1. Reclaim/compaction is going to be affected because the amount of reclaim is
    no longer targetted at a specific zone. Compaction works on a per-zone basis
    so there is no guarantee that reclaiming a few THP's worth page pages will
    have a positive impact on compaction success rates.

    2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
    are called is now different. This may or may not be a problem but if it
    is, it'll be because shrinkers are not called enough and some balancing
    is required.

    3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
    distributed between zones and the fair zone allocation policy used to do
    something very similar for anon. The distribution is now different but not
    necessarily in any way that matters but it's still worth bearing in mind.

    VM statistic counters for reclaim decisions are zone-based. If the kernel
    is to reclaim on a per-node basis then we need to track per-node
    statistics but there is no infrastructure for that. The most notable
    change is that the old node_page_state is renamed to
    sum_zone_node_page_state. The new node_page_state takes a pglist_data and
    uses per-node stats but none exist yet. There is some renaming such as
    vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
    of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
    patch with no functional change. There is a lot of similarity between the
    node and zone helpers which is unfortunate but there was no obvious way of
    reusing the code and maintaining type safety.

    Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2016

1 commit

  • Commit 612e44939c3c ("mm: workingset: eviction buckets for bigmem/lowbit
    machines") added a printk without a log level. Quieten it by using
    pr_info().

    Link: http://lkml.kernel.org/r/1466982072-29836-2-git-send-email-anton@ozlabs.org
    Signed-off-by: Anton Blanchard
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

18 Mar, 2016

2 commits

  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A page is activated on refault if the refault distance stored in the
    corresponding shadow entry is less than the number of active file pages.
    Since active file pages can't occupy more than half memory, we assume
    that the maximal effective refault distance can't be greater than half
    the number of present pages and size the shadow nodes lru list
    appropriately. Generally speaking, this assumption is correct, but it
    can result in wasting a considerable chunk of memory on stale shadow
    nodes in case the portion of file pages is small, e.g. if a workload
    mostly uses anonymous memory.

    To sort this out, we need to compute the size of shadow nodes lru basing
    not on the maximal possible, but the current size of file cache. We
    could take the size of active file lru for the maximal refault distance,
    but active lru is pretty unstable - it can shrink dramatically at
    runtime possibly disrupting workingset detection logic.

    Instead we assume that the maximal refault distance equals half the
    total number of file cache pages. This will protect us against active
    file lru size fluctuations while still being correct, because size of
    active lru is normally maintained lower than size of inactive lru.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Mar, 2016

5 commits

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Cache thrash detection (see a528910e12ec "mm: thrash detection-based
    file cache sizing" for details) currently only works on the system
    level, not inside cgroups. Worse, as the refaults are compared to the
    global number of active cache, cgroups might wrongfully get all their
    refaults activated when their pages are hotter than those of others.

    Move the refault machinery from the zone to the lruvec, and then tag
    eviction entries with the memcg ID. This makes the thrash detection
    work correctly inside cgroups.

    [sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Sergey Senozhatsky
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For per-cgroup thrash detection, we need to store the memcg ID inside
    the radix tree cookie as well. However, on 32 bit that doesn't leave
    enough bits for the eviction timestamp to cover the necessary range of
    recently evicted pages. The radix tree entry would look like this:

    [ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

    12 bits means 4096 pages, means 16M worth of recently evicted pages.
    But refaults are actionable up to distances covering half of memory. To
    not miss refaults, we have to stretch out the range at the cost of how
    precisely we can tell when a page was evicted. This way we can shave
    off lower bits from the eviction timestamp until the necessary range is
    covered. E.g. grouping evictions into 1M buckets (256 pages) will
    stretch the longest representable refault distance to 4G.

    This patch implements eviction buckets that are automatically sized
    according to the available bits and the necessary refault range, in
    preparation for per-cgroup thrash detection.

    The maximum actionable distance is currently half of memory, but to
    support memory hotplug of up to 200% of boot-time memory, we size the
    buckets to cover double the distance. Beyond that, thrashing won't be
    detectable anymore.

    During boot, the kernel will print out the exact parameters, like so:

    [ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

    In this example, there are 12 radix entry bits available for the
    eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
    1G machine). Consequently, evictions must be grouped into buckets of
    2^6 pages, or 256K.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Per-cgroup thrash detection will need to derive a live memcg from the
    eviction cookie, and doing that inside unpack_shadow() will get nasty
    with the reference handling spread over two functions.

    In preparation, make unpack_shadow() clearly about extracting static
    data, and let workingset_refault() do all the higher-level handling.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This is a compile-time constant, no need to calculate it on refault.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

1 commit

  • Add support for tracking dirty DAX entries in the struct address_space
    radix tree. This tree is already used for dirty page writeback, and it
    already supports the use of exceptional (non struct page*) entries.

    In order to properly track dirty DAX pages we will insert new
    exceptional entries into the radix tree that represent dirty DAX PTE or
    PMD pages. These exceptional entries will also contain the writeback
    addresses for the PTE or PMD faults that we can use at fsync/msync time.

    There are currently two types of exceptional entries (shmem and shadow)
    that can be placed into the radix tree, and this adds a third. We rely
    on the fact that only one type of exceptional entry can be found in a
    given radix tree based on its usage. This happens for free with DAX vs
    shmem but we explicitly prevent shadow entries from being added to radix
    trees for DAX mappings.

    The only shadow entries that would be generated for DAX radix trees
    would be to track zero page mappings that were created for holes. These
    pages would receive minimal benefit from having shadow entries, and the
    choice to have only one type of exceptional entry in a given radix tree
    makes the logic simpler both in clear_exceptional_entry() and in the
    rest of DAX.

    Signed-off-by: Ross Zwisler
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

13 Feb, 2015

2 commits

  • Currently, the isolate callback passed to the list_lru_walk family of
    functions is supposed to just delete an item from the list upon returning
    LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
    __list_lru_walk_one after the callback returns. Since the callback is
    allowed to drop the lock after removing an item (it has to return
    LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
    of elements on the list even if we check them under the lock. This makes
    it difficult to move items from one list_lru_one to another, which is
    required for per-memcg list_lru reparenting - we can't just splice the
    lists, we have to move entries one by one.

    This patch therefore introduces helpers that must be used by callback
    functions to isolate items instead of raw list_del/list_move. These are
    list_lru_isolate and list_lru_isolate_move. They not only remove the
    entry from the list, but also fix the nr_items counter, making sure
    nr_items always reflects the actual number of elements on the list if
    checked under the appropriate lock.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Kmem accounting of memcg is unusable now, because it lacks slab shrinker
    support. That means when we hit the limit we will get ENOMEM w/o any
    chance to recover. What we should do then is to call shrink_slab, which
    would reclaim old inode/dentry caches from this cgroup. This is what
    this patch set is intended to do.

    Basically, it does two things. First, it introduces the notion of
    per-memcg slab shrinker. A shrinker that wants to reclaim objects per
    cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
    passed the memory cgroup to scan from in shrink_control->memcg. For
    such shrinkers shrink_slab iterates over the whole cgroup subtree under
    the target cgroup and calls the shrinker for each kmem-active memory
    cgroup.

    Secondly, this patch set makes the list_lru structure per-memcg. It's
    done transparently to list_lru users - everything they have to do is to
    tell list_lru_init that they want memcg-aware list_lru. Then the
    list_lru will automatically distribute objects among per-memcg lists
    basing on which cgroup the object is accounted to. This way to make FS
    shrinkers (icache, dcache) memcg-aware we only need to make them use
    memcg-aware list_lru, and this is what this patch set does.

    As before, this patch set only enables per-memcg kmem reclaim when the
    pressure goes from memory.limit, not from memory.kmem.limit. Handling
    memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
    it is still unclear whether we will have this knob in the unified
    hierarchy.

    This patch (of 9):

    NUMA aware slab shrinkers use the list_lru structure to distribute
    objects coming from different NUMA nodes to different lists. Whenever
    such a shrinker needs to count or scan objects from a particular node,
    it issues commands like this:

    count = list_lru_count_node(lru, sc->nid);
    freed = list_lru_walk_node(lru, sc->nid, isolate_func,
    isolate_arg, &sc->nr_to_scan);

    where sc is an instance of the shrink_control structure passed to it
    from vmscan.

    To simplify this, let's add special list_lru functions to be used by
    shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
    consolidate the nid and nr_to_scan arguments in the shrink_control
    structure.

    This will also allow us to avoid patching shrinkers that use list_lru
    when we make shrink_slab() per-memcg - all we will have to do is extend
    the shrink_control structure to include the target memcg and make
    list_lru_shrink_{count,walk} handle this appropriately.

    Signed-off-by: Vladimir Davydov
    Suggested-by: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Glauber Costa
    Cc: Alexander Viro
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

04 Apr, 2014

2 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently usedbut
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This patch solves one half of the problem by decoupling the ability to
    detect working set changes from the inactive list size. By maintaining
    a history of recently evicted file pages it can detect frequently used
    pages with an arbitrarily small inactive list size, and subsequently
    apply pressure on the active list based on actual demand for cache, not
    just overall eviction speed.

    Every zone maintains a counter that tracks inactive list aging speed.
    When a page is evicted, a snapshot of this counter is stored in the
    now-empty page cache radix tree slot. On refault, the minimum access
    distance of the page can be assessed, to evaluate whether the page
    should be part of the active list or not.

    This fixes the VM's blindness towards working set changes in excess of
    the inactive list. And it's the foundation to further improve the
    protection ability and reduce the minimum inactive list size of 50%.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner