29 Jul, 2016

12 commits

  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When I did stress test with hackbench, I got OOM message frequently
    which didn't ever happen in zone-lru.

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    ..
    ..
    __alloc_pages_nodemask+0xe52/0xe60
    ? new_slab+0x39c/0x3b0
    new_slab+0x39c/0x3b0
    ___slab_alloc.constprop.87+0x6da/0x840
    ? __alloc_skb+0x3c/0x260
    ? _raw_spin_unlock_irq+0x27/0x60
    ? trace_hardirqs_on_caller+0xec/0x1b0
    ? finish_task_switch+0xa6/0x220
    ? poll_select_copy_remaining+0x140/0x140
    __slab_alloc.isra.81.constprop.86+0x40/0x6d
    ? __alloc_skb+0x3c/0x260
    kmem_cache_alloc+0x22c/0x260
    ? __alloc_skb+0x3c/0x260
    __alloc_skb+0x3c/0x260
    alloc_skb_with_frags+0x4e/0x1a0
    sock_alloc_send_pskb+0x16a/0x1b0
    ? wait_for_unix_gc+0x31/0x90
    ? alloc_set_pte+0x2ad/0x310
    unix_stream_sendmsg+0x28d/0x340
    sock_sendmsg+0x2d/0x40
    sock_write_iter+0x6c/0xc0
    __vfs_write+0xc0/0x120
    vfs_write+0x9b/0x1a0
    ? __might_fault+0x49/0xa0
    SyS_write+0x44/0x90
    do_fast_syscall_32+0xa6/0x1e0
    sysenter_past_esp+0x45/0x74

    Mem-Info:
    active_anon:104698 inactive_anon:105791 isolated_anon:192
    active_file:433 inactive_file:283 isolated_file:22
    unevictable:0 dirty:0 writeback:296 unstable:0
    slab_reclaimable:6389 slab_unreclaimable:78927
    mapped:474 shmem:0 pagetables:101426 bounce:0
    free:10518 free_pcp:334 free_cma:0
    Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
    DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
    Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
    HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    25121 total pagecache pages
    24160 pages in swap cache
    Swap cache stats: add 86371, delete 62211, find 42865/60187
    Free swap = 4015560kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9658 pages reserved
    0 pages cma reserved

    The order-0 allocation for normal zone failed while there are a lot of
    reclaimable memory(i.e., anonymous memory with free swap). I wanted to
    analyze the problem but it was hard because we removed per-zone lru stat
    so I couldn't know how many of anonymous memory there are in normal/dma
    zone.

    When we investigate OOM problem, reclaimable memory count is crucial
    stat to find a problem. Without it, it's hard to parse the OOM message
    so I believe we should keep it.

    With per-zone lru stat,

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    Mem-Info:
    active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    54409 total pagecache pages
    53215 pages in swap cache
    Swap cache stats: add 300982, delete 247765, find 157978/226539
    Free swap = 3803244kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9642 pages reserved
    0 pages cma reserved

    With that, we can see normal zone has a 86M reclaimable memory so we can
    know something goes wrong(I will fix the problem in next patch) in
    reclaim.

    [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
    Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
    Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are a number of stats that were previously accessible via zoneinfo
    that are now invisible. While it is possible to create a new file for
    the node stats, this may be missed by users. Instead this patch prints
    the stats under the first populated zone in /proc/zoneinfo.

    Link: http://lkml.kernel.org/r/1467970510-21195-34-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The vmstat allocstall was fairly useful in the general sense but
    node-based LRUs change that. It's important to know if a stall was for
    an address-limited allocation request as this will require skipping
    pages from other zones. This patch adds pgstall_* counters to replace
    allocstall. The sum of the counters will equal the old allocstall so it
    can be trivially recalculated. A high number of address-limited
    allocation requests may result in a lot of useless LRU scanning for
    suitable pages.

    As address-limited allocations require pages to be skipped, it's
    important to know how much useless LRU scanning took place so this patch
    adds pgskip* counters. This yields the following model

    1. The number of address-space limited stalls can be accounted for (pgstall)
    2. The amount of useless work required to reclaim the data is accounted (pgskip)
    3. The total number of scans is available from pgscan_kswapd and pgscan_direct
    so from that the ratio of useful to useless scans can be calculated.

    [mgorman@techsingularity.net: s/pgstall/allocstall/]
    Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fair zone allocation policy interleaves allocation requests between
    zones to avoid an age inversion problem whereby new pages are reclaimed
    to balance a zone. Reclaim is now node-based so this should no longer
    be an issue and the fair zone allocation policy is not free. This patch
    removes it.

    Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim makes decisions based on the number of pages that are mapped but
    it's mixing node and zone information. Account NR_FILE_MAPPED and
    NR_ANON_PAGES pages on the node.

    Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Working set and refault detection is still zone-based, fix it.

    Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patchset: "Move LRU page reclaim from zones to nodes v9"

    This series moves LRUs from the zones to the node. While this is a
    current rebase, the test results were based on mmotm as of June 23rd.
    Conceptually, this series is simple but there are a lot of details.
    Some of the broad motivations for this are;

    1. The residency of a page partially depends on what zone the page was
    allocated from. This is partially combatted by the fair zone allocation
    policy but that is a partial solution that introduces overhead in the
    page allocator paths.

    2. Currently, reclaim on node 0 behaves slightly different to node 1. For
    example, direct reclaim scans in zonelist order and reclaims even if
    the zone is over the high watermark regardless of the age of pages
    in that LRU. Kswapd on the other hand starts reclaim on the highest
    unbalanced zone. A difference in distribution of file/anon pages due
    to when they were allocated results can result in a difference in
    again. While the fair zone allocation policy mitigates some of the
    problems here, the page reclaim results on a multi-zone node will
    always be different to a single-zone node.
    it was scheduled on as a result.

    3. kswapd and the page allocator scan zones in the opposite order to
    avoid interfering with each other but it's sensitive to timing. This
    mitigates the page allocator using pages that were allocated very recently
    in the ideal case but it's sensitive to timing. When kswapd is allocating
    from lower zones then it's great but during the rebalancing of the highest
    zone, the page allocator and kswapd interfere with each other. It's worse
    if the highest zone is small and difficult to balance.

    4. slab shrinkers are node-based which makes it harder to identify the exact
    relationship between slab reclaim and LRU reclaim.

    The reason we have zone-based reclaim is that we used to have
    large highmem zones in common configurations and it was necessary
    to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
    less of a concern as machines with lots of memory will (or should) use
    64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
    rare. Machines that do use highmem should have relatively low highmem:lowmem
    ratios than we worried about in the past.

    Conceptually, moving to node LRUs should be easier to understand. The
    page allocator plays fewer tricks to game reclaim and reclaim behaves
    similarly on all nodes.

    The series has been tested on a 16 core UMA machine and a 2-socket 48
    core NUMA machine. The UMA results are presented in most cases as the NUMA
    machine behaved similarly.

    pagealloc
    ---------

    This is a microbenchmark that shows the benefit of removing the fair zone
    allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
    shown as the other orders were comparable.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
    Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
    Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
    Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
    Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
    Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
    Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
    Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
    Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
    Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
    Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
    Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
    Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
    Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
    Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
    Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
    Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
    Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
    Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
    Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
    Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
    Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
    Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
    Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
    Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
    Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
    Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

    This shows a steady improvement throughout. The primary benefit is from
    reduced system CPU usage which is obvious from the overall times;

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    User 189.19 191.80
    System 2604.45 2533.56
    Elapsed 2855.30 2786.39

    The vmstats also showed that the fair zone allocation policy was definitely
    removed as can be seen here;

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v8
    DMA32 allocs 28794729769 0
    Normal allocs 48432501431 77227309877
    Movable allocs 0 0

    tiobench on ext4
    ----------------

    tiobench is a benchmark that artifically benefits if old pages remain resident
    while new pages get reclaimed. The fair zone allocation policy mitigates this
    problem so pages age fairly. While the benchmark has problems, it is important
    that tiobench performance remains constant as it implies that page aging
    problems that the fair zone allocation policy fixes are not re-introduced.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
    Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
    Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
    Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
    Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
    Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
    Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
    Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
    Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
    Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
    Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
    Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
    Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
    Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
    Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
    Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
    Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
    Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
    Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
    Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
    Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 approx-v9
    User 645.72 525.90
    System 403.85 331.75
    Elapsed 6795.36 6783.67

    This shows that the series has little or not impact on tiobench which is
    desirable and a reduction in system CPU usage. It indicates that the fair
    zone allocation policy was removed in a manner that didn't reintroduce
    one class of page aging bug. There were only minor differences in overall
    reclaim activity

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Minor Faults 645838 647465
    Major Faults 573 640
    Swap Ins 0 0
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 46041453 44190646
    Normal allocs 78053072 79887245
    Movable allocs 0 0
    Allocation stalls 24 67
    Stall zone DMA 0 0
    Stall zone DMA32 0 0
    Stall zone Normal 0 2
    Stall zone HighMem 0 0
    Stall zone Movable 0 65
    Direct pages scanned 10969 30609
    Kswapd pages scanned 93375144 93492094
    Kswapd pages reclaimed 93372243 93489370
    Direct pages reclaimed 10969 30609
    Kswapd efficiency 99% 99%
    Kswapd velocity 13741.015 13781.934
    Direct efficiency 100% 100%
    Direct velocity 1.614 4.512
    Percentage direct scans 0% 0%

    kswapd activity was roughly comparable. There were differences in direct
    reclaim activity but negligible in the context of the overall workload
    (velocity of 4 pages per second with the patches applied, 1.6 pages per
    second in the baseline kernel).

    pgbench read-only large configuration on ext4
    ---------------------------------------------

    pgbench is a database benchmark that can be sensitive to page reclaim
    decisions. This also checks if removing the fair zone allocation policy
    is safe

    pgbench Transactions
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
    Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
    Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
    Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
    Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
    Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

    Negligible differences again. As with tiobench, overall reclaim activity
    was comparable.

    bonnie++ on ext4
    ----------------

    No interesting performance difference, negligible differences on reclaim
    stats.

    paralleldd on ext4
    ------------------

    This workload uses varying numbers of dd instances to read large amounts of
    data from disk.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
    Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
    Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
    Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
    Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
    Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    User 1548.01 1552.44
    System 8609.71 8515.08
    Elapsed 3587.10 3594.54

    There is little or no change in performance but some drop in system CPU usage.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Minor Faults 362662 367360
    Major Faults 1204 1143
    Swap Ins 22 0
    Swap Outs 2855 1029
    DMA allocs 0 0
    DMA32 allocs 31409797 28837521
    Normal allocs 46611853 49231282
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 40845270 40869088
    Kswapd pages reclaimed 40830976 40855294
    Direct pages reclaimed 0 0
    Kswapd efficiency 99% 99%
    Kswapd velocity 11386.711 11369.769
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Page writes by reclaim 2855 1029
    Page writes file 0 0
    Page writes anon 2855 1029
    Page reclaim immediate 771 1628
    Sector Reads 293312636 293536360
    Sector Writes 18213568 18186480
    Page rescued immediate 0 0
    Slabs scanned 128257 132747
    Direct inode steals 181 56
    Kswapd inode steals 59 1131

    It basically shows that kswapd was active at roughly the same rate in
    both kernels. There was also comparable slab scanning activity and direct
    reclaim was avoided in both cases. There appears to be a large difference
    in numbers of inodes reclaimed but the workload has few active inodes and
    is likely a timing artifact.

    stutter
    -------

    stutter simulates a simple workload. One part uses a lot of anonymous
    memory, a second measures mmap latency and a third copies a large file.
    The primary metric is checking for mmap latency.

    stutter
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
    1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
    2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
    3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
    Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
    Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
    Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
    Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
    Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
    Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
    Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
    Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
    Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
    Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
    Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
    Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
    Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

    This shows a number of improvements with the worst-case outlier greatly
    improved.

    Some of the vmstats are interesting

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Swap Ins 163 502
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 618719206 1381662383
    Normal allocs 891235743 564138421
    Movable allocs 0 0
    Allocation stalls 2603 1
    Direct pages scanned 216787 2
    Kswapd pages scanned 50719775 41778378
    Kswapd pages reclaimed 41541765 41777639
    Direct pages reclaimed 209159 0
    Kswapd efficiency 81% 99%
    Kswapd velocity 16859.554 14329.059
    Direct efficiency 96% 0%
    Direct velocity 72.061 0.001
    Percentage direct scans 0% 0%
    Page writes by reclaim 6215049 0
    Page writes file 6215049 0
    Page writes anon 0 0
    Page reclaim immediate 70673 90
    Sector Reads 81940800 81680456
    Sector Writes 100158984 98816036
    Page rescued immediate 0 0
    Slabs scanned 1366954 22683

    While this is not guaranteed in all cases, this particular test showed
    a large reduction in direct reclaim activity. It's also worth noting
    that no page writes were issued from reclaim context.

    This series is not without its hazards. There are at least three areas
    that I'm concerned with even though I could not reproduce any problems in
    that area.

    1. Reclaim/compaction is going to be affected because the amount of reclaim is
    no longer targetted at a specific zone. Compaction works on a per-zone basis
    so there is no guarantee that reclaiming a few THP's worth page pages will
    have a positive impact on compaction success rates.

    2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
    are called is now different. This may or may not be a problem but if it
    is, it'll be because shrinkers are not called enough and some balancing
    is required.

    3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
    distributed between zones and the fair zone allocation policy used to do
    something very similar for anon. The distribution is now different but not
    necessarily in any way that matters but it's still worth bearing in mind.

    VM statistic counters for reclaim decisions are zone-based. If the kernel
    is to reclaim on a per-node basis then we need to track per-node
    statistics but there is no infrastructure for that. The most notable
    change is that the old node_page_state is renamed to
    sum_zone_node_page_state. The new node_page_state takes a pglist_data and
    uses per-node stats but none exist yet. There is some renaming such as
    vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
    of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
    patch with no functional change. There is a lot of similarity between the
    node and zone helpers which is unfortunate but there was no obvious way of
    reusing the code and maintaining type safety.

    Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

3 commits

  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • THP_FILE_ALLOC: how many times huge page was allocated and put page
    cache.

    THP_FILE_MAPPED: how many times file huge page was mapped.

    Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • zram is very popular for some of the embedded world (e.g., TV, mobile
    phones). On those system, zsmalloc's consumed memory size is never
    trivial (one of example from real product system, total memory: 800M,
    zsmalloc consumed: 150M), so we have used this out of tree patch to
    monitor system memory behavior via /proc/vmstat.

    With zsmalloc in vmstat, it helps in tracking down system behavior due
    to memory usage.

    [minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat]
    Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox
    [akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m]
    Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sangseok Lee
    Cc: Chanho Min
    Cc: Chan Gyun Jeong
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

04 Jun, 2016

1 commit

  • Per the discussion with Joonsoo Kim [1], we need check the return value
    of lookup_page_ext() for all call sites since it might return NULL in
    some cases, although it is unlikely, i.e. memory hotplug.

    Tested with ltp with "page_owner=0".

    [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

    [akpm@linux-foundation.org: fix build-breaking typos]
    [arnd@arndb.de: fix build problems from lookup_page_ext]
    Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org
    Signed-off-by: Yang Shi
    Signed-off-by: Arnd Bergmann
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

21 May, 2016

1 commit

  • The cpu_stat_off variable is unecessary since we can check if a
    workqueue request is pending otherwise. Removal of cpu_stat_off makes
    it pretty easy for the vmstat shepherd to ensure that the proper things
    happen.

    Removing the state also removes all races related to it. Should a
    workqueue not be scheduled as needed for vmstat_update then the shepherd
    will notice and schedule it as needed. Should a workqueue be
    unecessarily scheduled then the vmstat updater will disable it.

    [akpm@linux-foundation.org: fix indentation, per Michal]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

20 May, 2016

6 commits

  • The function call overhead of get_pfnblock_flags_mask() is measurable in
    the page free paths. This patch uses an inlined version that is faster.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • zone_statistics has one call-site but it's a public function. Make it
    static and inline.

    The performance difference on a page allocator microbenchmark is;

    4.6.0-rc2 4.6.0-rc2
    statbranch-v1r20 statinline-v1r20
    Min alloc-odr0-1 419.00 ( 0.00%) 412.00 ( 1.67%)
    Min alloc-odr0-2 305.00 ( 0.00%) 301.00 ( 1.31%)
    Min alloc-odr0-4 250.00 ( 0.00%) 247.00 ( 1.20%)
    Min alloc-odr0-8 219.00 ( 0.00%) 215.00 ( 1.83%)
    Min alloc-odr0-16 203.00 ( 0.00%) 199.00 ( 1.97%)
    Min alloc-odr0-32 195.00 ( 0.00%) 191.00 ( 2.05%)
    Min alloc-odr0-64 191.00 ( 0.00%) 187.00 ( 2.09%)
    Min alloc-odr0-128 189.00 ( 0.00%) 185.00 ( 2.12%)
    Min alloc-odr0-256 198.00 ( 0.00%) 193.00 ( 2.53%)
    Min alloc-odr0-512 210.00 ( 0.00%) 207.00 ( 1.43%)
    Min alloc-odr0-1024 216.00 ( 0.00%) 213.00 ( 1.39%)
    Min alloc-odr0-2048 221.00 ( 0.00%) 220.00 ( 0.45%)
    Min alloc-odr0-4096 227.00 ( 0.00%) 226.00 ( 0.44%)
    Min alloc-odr0-8192 232.00 ( 0.00%) 229.00 ( 1.29%)
    Min alloc-odr0-16384 232.00 ( 0.00%) 229.00 ( 1.29%)

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • zone_statistics has more branches than it really needs to take an
    unlikely GFP flag into account. Reduce the number and annotate the
    unlikely flag.

    The performance difference on a page allocator microbenchmark is;

    4.6.0-rc2 4.6.0-rc2
    nocompound-v1r10 statbranch-v1r10
    Min alloc-odr0-1 417.00 ( 0.00%) 419.00 ( -0.48%)
    Min alloc-odr0-2 308.00 ( 0.00%) 305.00 ( 0.97%)
    Min alloc-odr0-4 253.00 ( 0.00%) 250.00 ( 1.19%)
    Min alloc-odr0-8 221.00 ( 0.00%) 219.00 ( 0.90%)
    Min alloc-odr0-16 205.00 ( 0.00%) 203.00 ( 0.98%)
    Min alloc-odr0-32 199.00 ( 0.00%) 195.00 ( 2.01%)
    Min alloc-odr0-64 193.00 ( 0.00%) 191.00 ( 1.04%)
    Min alloc-odr0-128 191.00 ( 0.00%) 189.00 ( 1.05%)
    Min alloc-odr0-256 200.00 ( 0.00%) 198.00 ( 1.00%)
    Min alloc-odr0-512 212.00 ( 0.00%) 210.00 ( 0.94%)
    Min alloc-odr0-1024 219.00 ( 0.00%) 216.00 ( 1.37%)
    Min alloc-odr0-2048 225.00 ( 0.00%) 221.00 ( 1.78%)
    Min alloc-odr0-4096 231.00 ( 0.00%) 227.00 ( 1.73%)
    Min alloc-odr0-8192 234.00 ( 0.00%) 232.00 ( 0.85%)
    Min alloc-odr0-16384 234.00 ( 0.00%) 232.00 ( 0.85%)

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Provide /proc/sys/vm/stat_refresh to force an immediate update of
    per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
    before checking counts when testing. Originally added to work around a
    bug which left counts stranded indefinitely on a cpu going idle (an
    inaccuracy magnified when small below-batch numbers represent "huge"
    amounts of memory), but I believe that bug is now fixed: nonetheless,
    this is still a useful knob.

    Its schedule_on_each_cpu() is probably too expensive just to fold into
    reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
    Allow a write or a read to do the same: nothing to read, but "grep -h
    Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and
    since global_page_state() itself is careful to disguise any underflow as
    0, hack in an "Invalid argument" and pr_warn() if a counter is negative
    after the refresh - this helped to fix a misaccounting of
    NR_ISOLATED_FILE in my migration code.

    But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
    often go negative some of the time. I have not yet worked out why, but
    have no evidence that it's actually harmful. Punt for the moment by
    just ignoring the anomaly on those.

    Signed-off-by: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Andres Lagar-Cavilla
    Cc: Yang Shi
    Cc: Ning Qu
    Cc: Mel Gorman
    Cc: Andres Lagar-Cavilla
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • node_page_state() manually adds statistics per each zone and returns
    total value for all zones. Whenever we add a new zone, we need to
    consider this function and it's really troublesome. Make it handle all
    zones by itself.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Laura Abbott
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a system thats node's pfns are overlapped as follows:

    -----pfn-------->
    N0 N1 N2 N0 N1 N2

    Therefore, we need to care this overlapping when iterating pfn range.

    There are two places in vmstat.c that iterates pfn range and they don't
    consider this overlapping. Add it.

    Without this patch, above system could over count pageblock number on a
    zone.

    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Laura Abbott
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: "Aneesh Kumar K.V"
    Cc: "Rafael J. Wysocki"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

18 Mar, 2016

2 commits

  • Count how many times we put a THP in split queue. Currently, it happens
    on partial unmap of a THP.

    Rapidly growing value can indicate that an application behaves
    unfriendly wrt THP: often fault in huge page and then unmap part of it.
    This leads to unnecessary memory fragmentation and the application may
    require tuning.

    The event also can help with debugging kernel [mis-]behaviour.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

16 Mar, 2016

2 commits

  • CONFIG_PAGE_OWNER attempts to impose negligible runtime overhead when
    enabled during compilation, but not actually enabled during runtime by
    boot param page_owner=on. This overhead can be further reduced using
    the static key mechanism, which this patch does.

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The information in /sys/kernel/debug/page_owner includes the migratetype
    of the pageblock the page belongs to. This is also checked against the
    page's migratetype (as declared by gfp_flags during its allocation), and
    the page is reported as Fallback if its migratetype differs from the
    pageblock's one. t This is somewhat misleading because in fact fallback
    allocation is not the only reason why these two can differ. It also
    doesn't direcly provide the page's migratetype, although it's possible
    to derive that from the gfp_flags.

    It's arguably better to print both page and pageblock's migratetype and
    leave the interpretation to the consumer than to suggest fallback
    allocation as the only possible reason. While at it, we can print the
    migratetypes as string the same way as /proc/pagetypeinfo does, as some
    of the numeric values depend on kernel configuration. For that, this
    patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
    of mm/vmstat.c to mm/page_alloc.c and exports it.

    With the new format strings for flags, we can now also provide symbolic
    page and gfp flags in the /sys/kernel/debug/page_owner file. This
    replaces the positional printing of page flags as single letters, which
    might have looked nicer, but was limited to a subset of flags, and
    required the user to remember the letters.

    Example page_owner entry after the patch:

    Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
    PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
    [] __alloc_pages_nodemask+0x134/0x230
    [] alloc_pages_current+0x88/0x120
    [] __page_cache_alloc+0xe6/0x120
    [] __do_page_cache_readahead+0xdc/0x240
    [] ondemand_readahead+0x135/0x260
    [] page_cache_sync_readahead+0x31/0x50
    [] generic_file_read_iter+0x453/0x760
    [] __vfs_read+0xa7/0xd0

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

06 Feb, 2016

2 commits

  • Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and
    shut down on idle") made vmstat_shepherd deferrable. vmstat_update
    itself is still useing standard timer which might interrupt idle task.
    This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
    cancel_delayed_work from the quiet_vmstat.

    Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
    wakeups from the idle context.

    Acked-by: Christoph Lameter
    Signed-off-by: Michal Hocko
    Cc: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Mike has reported a considerable overhead of refresh_cpu_vm_stats from
    the idle entry during pipe test:

    12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
    4.75% [kernel] [k] __schedule
    4.70% [kernel] [k] mutex_unlock
    3.14% [kernel] [k] __switch_to

    This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
    deferrable again and shut down on idle") which has placed quiet_vmstat
    into cpu_idle_loop. The main reason here seems to be that the idle
    entry has to get over all zones and perform atomic operations for each
    vmstat entry even though there might be no per cpu diffs. This is a
    pointless overhead for _each_ idle entry.

    Make sure that quiet_vmstat is as light as possible.

    First of all it doesn't make any sense to do any local sync if the
    current cpu is already set in oncpu_stat_off because vmstat_update puts
    itself there only if there is nothing to do.

    Then we can check need_update which should be a cheap way to check for
    potential per-cpu diffs and only then do refresh_cpu_vm_stats.

    The original patch also did cancel_delayed_work which we are not doing
    here. There are two reasons for that. Firstly cancel_delayed_work from
    idle context will blow up on RT kernels (reported by Mike):

    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7
    Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
    Call Trace:
    dump_stack+0x49/0x67
    ___might_sleep+0xf5/0x180
    rt_spin_lock+0x20/0x50
    try_to_grab_pending+0x69/0x240
    cancel_delayed_work+0x26/0xe0
    quiet_vmstat+0x75/0xa0
    cpu_idle_loop+0x38/0x3e0
    cpu_startup_entry+0x13/0x20
    start_secondary+0x114/0x140

    And secondly, even on !RT kernels it might add some non trivial overhead
    which is not necessary. Even if the vmstat worker wakes up and preempts
    idle then it will be most likely a single shot noop because the stats
    were already synced and so it would end up on the oncpu_stat_off anyway.
    We just need to teach both vmstat_shepherd and vmstat_update to stop
    scheduling the worker if there is nothing to do.

    [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU]
    Signed-off-by: Michal Hocko
    Reported-by: Mike Galbraith
    Acked-by: Christoph Lameter
    Signed-off-by: Mike Galbraith
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Jan, 2016

1 commit

  • If we detect that there is nothing to do just set the flag and do not
    check if it was already set before. Races really do not matter. If the
    flag is set by any code then the shepherd will start dealing with the
    situation and reenable the vmstat workers when necessary again.

    Since commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again
    and shut down on idle") quiet_vmstat might update cpu_stat_off and mark
    a particular cpu to be handled by vmstat_shepherd. This might trigger a
    VM_BUG_ON in vmstat_update because the work item might have been
    sleeping during the idle period and see the cpu_stat_off updated after
    the wake up. The VM_BUG_ON is therefore misleading and no more
    appropriate. Moreover it doesn't really suite any protection from real
    bugs because vmstat_shepherd will simply reschedule the vmstat_work
    anytime it sees a particular cpu set or vmstat_update would do the same
    from the worker context directly. Even when the two would race the
    result wouldn't be incorrect as the counters update is fully idempotent.

    Reported-by: Sasha Levin
    Signed-off-by: Christoph Lameter
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

16 Jan, 2016

2 commits

  • Linux doesn't have an ability to free pages lazy while other OS already
    have been supported that named by madvise(MADV_FREE).

    The gain is clear that kernel can discard freed pages rather than
    swapping out or OOM if memory pressure happens.

    Without memory pressure, freed pages would be reused by userspace
    without another additional overhead(ex, page fault + allocation +
    zeroing).

    Jason Evans said:

    : Facebook has been using MAP_UNINITIALIZED
    : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
    : several years, but there are operational costs to maintaining this
    : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
    : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
    : increased throughput for much of our workload by ~5%, and although the
    : benefit has decreased using newer hardware and kernels, there is still
    : enough benefit that we cannot reasonably retire it without a replacement.
    :
    : Aside from Facebook operations, there are numerous broadly used
    : applications that would benefit from MADV_FREE. The ones that immediately
    : come to mind are redis, varnish, and MariaDB. I don't have much insight
    : into Android internals and development process, but I would hope to see
    : MADV_FREE support eventually end up there as well to benefit applications
    : linked with the integrated jemalloc.
    :
    : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
    : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
    : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
    : (and AIX, but I'm not sure it even compiles on AIX). The lack of
    : MADV_FREE on Linux forced me down a long series of increasingly
    : sophisticated heuristics for madvise() volume reduction, and even so this
    : remains a common performance issue for people using jemalloc on Linux.
    : Please integrate MADV_FREE; many people will benefit substantially.

    How it works:

    When madvise syscall is called, VM clears dirty bit of ptes of the
    range. If memory pressure happens, VM checks dirty bit of page table
    and if it found still "clean", it means it's a "lazyfree pages" so VM
    could discard the page instead of swapping out. Once there was store
    operation for the page before VM peek a page to reclaim, dirty bit is
    set so VM can swap out the page instead of discarding.

    One thing we should notice is that basically, MADV_FREE relies on dirty
    bit in page table entry to decide whether VM allows to discard the page
    or not. IOW, if page table entry includes marked dirty bit, VM
    shouldn't discard the page.

    However, as a example, if swap-in by read fault happens, page table
    entry doesn't have dirty bit so MADV_FREE could discard the page
    wrongly.

    For avoiding the problem, MADV_FREE did more checks with PageDirty and
    PageSwapCache. It worked out because swapped-in page lives on swap
    cache and since it is evicted from the swap cache, the page has PG_dirty
    flag. So both page flags check effectively prevent wrong discarding by
    MADV_FREE.

    However, a problem in above logic is that swapped-in page has PG_dirty
    still after they are removed from swap cache so VM cannot consider the
    page as freeable any more even if madvise_free is called in future.

    Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
    swapcache. Then, page table doesn't mark
    dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

    To solve the problem, this patch clears PG_dirty if only the page is
    owned exclusively by current process when madvise is called because
    PG_dirty represents ptes's dirtiness in several processes so we could
    clear it only if we own it exclusively.

    Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
    and hope glibc supports it) and jemalloc/tcmalloc already have supported
    the feature for other OS(ex, FreeBSD)

    barrios@blaptop:~/benchmark/ebizzy$ lscpu
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 12
    On-line CPU(s) list: 0-11
    Thread(s) per core: 1
    Core(s) per socket: 1
    Socket(s): 12
    NUMA node(s): 1
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 2
    Stepping: 3
    CPU MHz: 3200.185
    BogoMIPS: 6400.53
    Virtualization: VT-x
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 4096K
    NUMA node0 CPU(s): 0-11
    ebizzy benchmark(./ebizzy -S 10 -n 512)

    Higher avg is better.

    vanilla-jemalloc MADV_free-jemalloc

    1 thread
    records: 10 records: 10
    avg: 2961.90 avg: 12069.70
    std: 71.96(2.43%) std: 186.68(1.55%)
    max: 3070.00 max: 12385.00
    min: 2796.00 min: 11746.00

    2 thread
    records: 10 records: 10
    avg: 5020.00 avg: 17827.00
    std: 264.87(5.28%) std: 358.52(2.01%)
    max: 5244.00 max: 18760.00
    min: 4251.00 min: 17382.00

    4 thread
    records: 10 records: 10
    avg: 8988.80 avg: 27930.80
    std: 1175.33(13.08%) std: 3317.33(11.88%)
    max: 9508.00 max: 30879.00
    min: 5477.00 min: 21024.00

    8 thread
    records: 10 records: 10
    avg: 13036.50 avg: 33739.40
    std: 170.67(1.31%) std: 5146.22(15.25%)
    max: 13371.00 max: 40572.00
    min: 12785.00 min: 24088.00

    16 thread
    records: 10 records: 10
    avg: 11092.40 avg: 31424.20
    std: 710.60(6.41%) std: 3763.89(11.98%)
    max: 12446.00 max: 36635.00
    min: 9949.00 min: 25669.00

    32 thread
    records: 10 records: 10
    avg: 11067.00 avg: 34495.80
    std: 971.06(8.77%) std: 2721.36(7.89%)
    max: 12010.00 max: 38598.00
    min: 9002.00 min: 30636.00

    In summary, MADV_FREE is about much faster than MADV_DONTNEED.

    This patch (of 12):

    Add core MADV_FREE implementation.

    [akpm@linux-foundation.org: small cleanups]
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: Mika Penttil
    Cc: Michael Kerrisk
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Jason Evans
    Cc: Daniel Micay
    Cc: "Kirill A. Shutemov"
    Cc: Shaohua Li
    Cc:
    Cc: Andy Lutomirski
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc: "Shaohua Li"
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Shaohua Li
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The patch replaces THP_SPLIT with tree events: THP_SPLIT_PAGE,
    THP_SPLIT_PAGE_FAILED and THP_SPLIT_PMD. It reflects the fact that we
    are going to be able split PMD without the compound page and that
    split_huge_page() can fail.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Christoph Lameter
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • Currently the vmstat updater is not deferrable as a result of commit
    ba4877b9ca51 ("vmstat: do not use deferrable delayed work for
    vmstat_update"). This in turn can cause multiple interruptions of the
    applications because the vmstat updater may run at

    Make vmstate_update deferrable again and provide a function that folds
    the differentials when the processor is going to idle mode thus
    addressing the issue of the above commit in a clean way.

    Note that the shepherd thread will continue scanning the differentials
    from another processor and will reenable the vmstat workers if it
    detects any changes.

    Fixes: ba4877b9ca51 ("vmstat: do not use deferrable delayed work for vmstat_update")
    Signed-off-by: Christoph Lameter
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2016

1 commit

  • kernel test robot has reported the following crash:

    BUG: unable to handle kernel NULL pointer dereference at 00000100
    IP: [] __queue_work+0x26/0x390
    *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
    Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
    CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1
    Workqueue: events vmstat_shepherd
    task: cb684600 ti: cb7ba000 task.ti: cb7ba000
    EIP: 0060:[] EFLAGS: 00010046 CPU: 0
    EIP is at __queue_work+0x26/0x390
    EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
    ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
    DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
    CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
    Stack:
    Call Trace:
    __queue_delayed_work+0xa1/0x160
    queue_delayed_work_on+0x36/0x60
    vmstat_shepherd+0xad/0xf0
    process_one_work+0x1aa/0x4c0
    worker_thread+0x41/0x440
    kthread+0xb0/0xd0
    ret_from_kernel_thread+0x21/0x40

    The reason is that start_shepherd_timer schedules the shepherd work item
    which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
    that workqueue so if the further initialization takes more than HZ we
    might end up scheduling on a NULL vmstat_wq. This is really unlikely
    but not impossible.

    Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
    Reported-by: kernel test robot
    Signed-off-by: Michal Hocko
    Tested-by: Tetsuo Handa
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Dec, 2015

1 commit

  • mod_zone_page_state() takes a "delta" integer argument. delta contains
    the number of pages that should be added or subtracted from a struct
    zone's vm_stat field.

    If a zone is larger than 8TB this will cause overflows. E.g. for a
    zone with a size slightly larger than 8TB the line

    mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);

    in mm/page_alloc.c:free_area_init_core() will result in a negative
    result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
    contain 0x8xxxxxxx pages which will be sign extended to a negative
    value.

    Fix this by changing the delta argument to long type.

    This could fix an early boot problem seen on s390, where we have a 9TB
    system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
    rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
    completely empty, so it tries to reclaim pages in an endless loop.

    This was seen on a heavily patched 3.10 kernel. One possible
    explaination seem to be the overflows caused by mod_zone_page_state().
    Unfortunately I did not have the chance to verify that this patch
    actually fixes the problem, since I don't have access to the system
    right now. However the overflow problem does exist anyway.

    Given the description that a system with slightly less than 8TB does
    work, this seems to be a candidate for the observed problem.

    Signed-off-by: Heiko Carstens
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

13 Dec, 2015

2 commits

  • Tetsuo Handa has reported that the system might basically livelock in
    OOM condition without triggering the OOM killer.

    The issue is caused by internal dependency of the direct reclaim on
    vmstat counter updates (via zone_reclaimable) which are performed from
    the workqueue context. If all the current workers get assigned to an
    allocation request, though, they will be looping inside the allocator
    trying to reclaim memory but zone_reclaimable can see stalled numbers so
    it will consider a zone reclaimable even though it has been scanned way
    too much. WQ concurrency logic will not consider this situation as a
    congested workqueue because it relies that worker would have to sleep in
    such a situation. This also means that it doesn't try to spawn new
    workers or invoke the rescuer thread if the one is assigned to the
    queue.

    In order to fix this issue we need to do two things. First we have to
    let wq concurrency code know that we are in trouble so we have to do a
    short sleep. In order to prevent from issues handled by 0e093d99763e
    ("writeback: do not sleep on the congestion queue if there are no
    congested BDIs or if significant congestion is not being encountered in
    the current zone") we limit the sleep only to worker threads which are
    the ones of the interest anyway.

    The second thing to do is to create a dedicated workqueue for vmstat and
    mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
    have a spare worker thread for it.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 016c13daa5c9 ("mm, page_alloc: use masks and shifts when
    converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
    MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names
    wasn't updated to reflect that.

    As a result, the file /proc/pagetypeinfo shows the counts for Movable as
    Reclaimable and vice versa.

    Additionally, commit 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks
    for high-order atomic allocations on demand") introduced
    MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
    show_migration_types(), so it doesn't appear in the listing of free
    areas during page alloc failures or oom kills.

    This patch fixes both problems. The atomic reserves will show with a
    letter 'H' in the free areas listings.

    Fixes: 016c13daa5c9 ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
    Fixes: 0aaa29a56e4f ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

07 Nov, 2015

2 commits

  • High-order watermark checking exists for two reasons -- kswapd high-order
    awareness and protection for high-order atomic requests. Historically the
    kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
    high-order free pages for as long as possible. This patch introduces
    MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
    allocations on demand and avoids using those blocks for order-0
    allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

    A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
    allocation request steals a pageblock but limits the total number to 1% of
    the zone. Callers that speculatively abuse atomic allocations for
    long-lived high-order allocations to access the reserve will quickly fail.
    Note that SLUB is currently not such an abuser as it reclaims at least
    once. It is possible that the pageblock stolen has few suitable
    high-order pages and will need to steal again in the near future but there
    would need to be strong justification to search all pageblocks for an
    ideal candidate.

    The pageblocks are unreserved if an allocation fails after a direct
    reclaim attempt.

    The watermark checks account for the reserved pageblocks when the
    allocation request is not a high-order atomic allocation.

    The reserved pageblocks can not be used for order-0 allocations. This may
    allow temporary wastage until a failed reclaim reassigns the pageblock.
    This is deliberate as the intent of the reservation is to satisfy a
    limited number of atomic high-order short-lived requests if the system
    requires them.

    The stutter benchmark was used to evaluate this but while it was running
    there was a systemtap script that randomly allocated between 1 high-order
    page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
    is much larger than the potential reserve and it does not attempt to be
    realistic. It is intended to stress random high-order allocations from an
    unknown source, show that there is a reduction in failures without
    introducing an anomaly where atomic allocations are more reliable than
    regular allocations. The amount of memory reserved varied throughout the
    workload as reserves were created and reclaimed under memory pressure.
    The allocation failures once the workload warmed up were as follows;

    4.2-rc5-vanilla 70%
    4.2-rc5-atomic-reserve 56%

    The failure rate was also measured while building multiple kernels. The
    failure rate was 14% but is 6% with this patch applied.

    Overall, this is a small reduction but the reserves are small relative to
    the number of allocation requests. In early versions of the patch, the
    failure rate reduced by a much larger amount but that required much larger
    reserves and perversely made atomic allocations seem more reliable than
    regular allocations.

    [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • MIGRATE_RESERVE preserves an old property of the buddy allocator that
    existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
    tended to remain contiguous until the only alternative was to fail the
    allocation. At the time it was discovered that high-order atomic
    allocations relied on this property so MIGRATE_RESERVE was introduced. A
    later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
    deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
    Note that this patch in isolation may look like a false regression if
    someone was bisecting high-order atomic allocation failures.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vitaly Wool
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

06 Nov, 2015

1 commit

  • With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
    (4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
    stack. Uninlining node_page_state() reduces this to 440 bytes.

    The stack consumption issue is fixed by newer gcc (4.8.4) however with
    that compiler this patch reduces the node.o text size from 7314 bytes to
    4578.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton