05 Aug, 2016

1 commit

  • Paul Mackerras and Reza Arbab reported that machines with memoryless
    nodes fail when vmstats are refreshed. Paul reported an oops as follows

    Unable to handle kernel paging request for data at address 0xff7a10000
    Faulting instruction address: 0xc000000000270cd0
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
    task: c000000ff0680010 task.stack: c000000ff0704000
    NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
    REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
    MSR: 9000000102009033 CR: 846b6824 XER: 20000000
    CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
    NIP refresh_zone_stat_thresholds+0x80/0x240
    LR refresh_zone_stat_thresholds+0x98/0x240
    Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

    Both supplied potential fixes but one potentially misses checks and
    another had redundant initialisations. This version initialises
    per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

    Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Paul Mackerras
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Aug, 2016

1 commit

  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     

29 Jul, 2016

28 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In the context of direct compaction, for some types of allocations we
    would like the compaction to either succeed or definitely fail while
    trying as hard as possible. Current async/sync_light migration mode is
    insufficient, as there are heuristics such as caching scanner positions,
    marking pageblocks as unsuitable or deferring compaction for a zone. At
    least the final compaction attempt should be able to override these
    heuristics.

    To communicate how hard compaction should try, we replace migration mode
    with a new enum compact_priority and change the relevant function
    signatures. In compact_zone_order() where struct compact_control is
    constructed, the priority is mapped to suitable control flags. This
    patch itself has no functional change, as the current priority levels
    are mapped back to the same migration modes as before. Expanding them
    will be done next.

    Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
    removed, as the only caller exists under CONFIG_COMPACTION.

    Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Since THP allocations during page faults can be costly, extra decisions
    are employed for them to avoid excessive reclaim and compaction, if the
    initial compaction doesn't look promising. The detection has never been
    perfect as there is no gfp flag specific to THP allocations. At this
    moment it checks the whole combination of flags that makes up
    GFP_TRANSHUGE, and hopes that no other users of such combination exist,
    or would mind being treated the same way. Extra care is also taken to
    separate allocations from khugepaged, where latency doesn't matter that
    much.

    It is however possible to distinguish these allocations in a simpler and
    more reliable way. The key observation is that after the initial
    compaction followed by the first iteration of "standard"
    reclaim/compaction, both __GFP_NORETRY allocations and costly
    allocations without __GFP_REPEAT are declared as failures:

    /* Do not loop if specifically requested */
    if (gfp_mask & __GFP_NORETRY)
    goto nopage;

    /*
    * Do not retry costly high order allocations unless they are
    * __GFP_REPEAT
    */
    if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
    goto nopage;

    This means we can further distinguish allocations that are costly order
    *and* additionally include the __GFP_NORETRY flag. As it happens,
    GFP_TRANSHUGE allocations do already fall into this category. This will
    also allow other costly allocations with similar high-order benefit vs
    latency considerations to use this semantic. Furthermore, we can
    distinguish THP allocations that should try a bit harder (such as from
    khugepageed) by removing __GFP_NORETRY, as will be done in the next
    patch.

    Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The retry loop in __alloc_pages_slowpath is supposed to keep trying
    reclaim and compaction (and OOM), until either the allocation succeeds,
    or returns with failure. Success here is more probable when reclaim
    precedes compaction, as certain watermarks have to be met for compaction
    to even try, and more free pages increase the probability of compaction
    success. On the other hand, starting with light async compaction (if
    the watermarks allow it), can be more efficient, especially for smaller
    orders, if there's enough free memory which is just fragmented.

    Thus, the current code starts with compaction before reclaim, and to
    make sure that the last reclaim is always followed by a final
    compaction, there's another direct compaction call at the end of the
    loop. This makes the code hard to follow and adds some duplicated
    handling of migration_mode decisions. It's also somewhat inefficient
    that even if reclaim or compaction decides not to retry, the final
    compaction is still attempted. Some gfp flags combination also shortcut
    these retry decisions by "goto noretry;", making it even harder to
    follow.

    This patch attempts to restructure the code with only minimal functional
    changes. The call to the first compaction and THP-specific checks are
    now placed above the retry loop, and the "noretry" direct compaction is
    removed.

    The initial compaction is additionally restricted only to costly orders,
    as we can expect smaller orders to be held back by watermarks, and only
    larger orders to suffer primarily from fragmentation. This better
    matches the checks in reclaim's shrink_zones().

    There are two other smaller functional changes. One is that the upgrade
    from async migration to light sync migration will always occur after the
    initial compaction. This is how it has been until recent patch "mm,
    oom: protect !costly allocations some more", which introduced upgrading
    the mode based on COMPACT_COMPLETE result, but kept the final compaction
    always upgraded, which made it even more special. It's better to return
    to the simpler handling for now, as migration modes will be further
    modified later in the series.

    The second change is that once both reclaim and compaction declare it's
    not worth to retry the reclaim/compact loop, there is no final
    compaction attempt. As argued above, this is intentional. If that
    final compaction were to succeed, it would be due to a wrong retry
    decision, or simply a race with somebody else freeing memory for us.

    The main outcome of this patch should be simpler code. Logically, the
    initial compaction without reclaim is the exceptional case to the
    reclaim/compaction scheme, but prior to the patch, it was the last loop
    iteration that was exceptional. Now the code matches the logic better.
    The change also enable the following patches.

    Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
    kswapd, it first tries get_page_from_freelist() with the new
    alloc_flags, as it may succeed e.g. due to using min watermark instead
    of low watermark. It makes sense to to do this attempt before adjusting
    zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
    path if we just wake up kswapd and successfully allocate.

    This patch therefore moves the initial attempt above the retry label and
    reorganizes a bit the part below the retry label. We still have to
    attempt get_page_from_freelist() on each retry, as some allocations
    cannot do that as part of direct reclaim or compaction, and yet are not
    allowed to fail (even though they do a WARN_ON_ONCE() and thus should
    not exist). We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
    and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
    it. As a side-effect, the attempts from direct reclaim/compaction will
    also no longer obey watermarks once this is set, but there's little harm
    in that.

    Kswapd wakeups are also done on each retry to be safe from potential
    races resulting in kswapd going to sleep while a process (that may not
    be able to reclaim by itself) is still looping.

    Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
    initialized, so move the initialization above the retry: label. Also
    make the comment above the initialization more descriptive.

    The only exception in the alloc_flags being constant is
    ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
    allocating thread. We can fix this, and make the code simpler and a bit
    more effective at the same time, by moving the part that determines
    ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

    This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
    places in __alloc_pages_slowpath() anymore. The only two tests for the
    flag can instead call gfp_pfmemalloc_allowed().

    Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
    This only makes sense if each kernel stack exists entirely in one zone,
    and allowing vmapped stacks could break this assumption.

    Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
    allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
    architectures. Keep it simple and use KiB.

    Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
    Signed-off-by: Andy Lutomirski
    Cc: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: Josh Poimboeuf
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When I did stress test with hackbench, I got OOM message frequently
    which didn't ever happen in zone-lru.

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    ..
    ..
    __alloc_pages_nodemask+0xe52/0xe60
    ? new_slab+0x39c/0x3b0
    new_slab+0x39c/0x3b0
    ___slab_alloc.constprop.87+0x6da/0x840
    ? __alloc_skb+0x3c/0x260
    ? _raw_spin_unlock_irq+0x27/0x60
    ? trace_hardirqs_on_caller+0xec/0x1b0
    ? finish_task_switch+0xa6/0x220
    ? poll_select_copy_remaining+0x140/0x140
    __slab_alloc.isra.81.constprop.86+0x40/0x6d
    ? __alloc_skb+0x3c/0x260
    kmem_cache_alloc+0x22c/0x260
    ? __alloc_skb+0x3c/0x260
    __alloc_skb+0x3c/0x260
    alloc_skb_with_frags+0x4e/0x1a0
    sock_alloc_send_pskb+0x16a/0x1b0
    ? wait_for_unix_gc+0x31/0x90
    ? alloc_set_pte+0x2ad/0x310
    unix_stream_sendmsg+0x28d/0x340
    sock_sendmsg+0x2d/0x40
    sock_write_iter+0x6c/0xc0
    __vfs_write+0xc0/0x120
    vfs_write+0x9b/0x1a0
    ? __might_fault+0x49/0xa0
    SyS_write+0x44/0x90
    do_fast_syscall_32+0xa6/0x1e0
    sysenter_past_esp+0x45/0x74

    Mem-Info:
    active_anon:104698 inactive_anon:105791 isolated_anon:192
    active_file:433 inactive_file:283 isolated_file:22
    unevictable:0 dirty:0 writeback:296 unstable:0
    slab_reclaimable:6389 slab_unreclaimable:78927
    mapped:474 shmem:0 pagetables:101426 bounce:0
    free:10518 free_pcp:334 free_cma:0
    Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
    DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
    Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
    HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    25121 total pagecache pages
    24160 pages in swap cache
    Swap cache stats: add 86371, delete 62211, find 42865/60187
    Free swap = 4015560kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9658 pages reserved
    0 pages cma reserved

    The order-0 allocation for normal zone failed while there are a lot of
    reclaimable memory(i.e., anonymous memory with free swap). I wanted to
    analyze the problem but it was hard because we removed per-zone lru stat
    so I couldn't know how many of anonymous memory there are in normal/dma
    zone.

    When we investigate OOM problem, reclaimable memory count is crucial
    stat to find a problem. Without it, it's hard to parse the OOM message
    so I believe we should keep it.

    With per-zone lru stat,

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    Mem-Info:
    active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    54409 total pagecache pages
    53215 pages in swap cache
    Swap cache stats: add 300982, delete 247765, find 157978/226539
    Free swap = 3803244kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9642 pages reserved
    0 pages cma reserved

    With that, we can see normal zone has a 86M reclaimable memory so we can
    know something goes wrong(I will fix the problem in next patch) in
    reclaim.

    [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
    Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
    Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The node_pages_scanned represents the number of scanned pages of node
    for reclaim so it's pointless to show it as kilobytes.

    As well, node_pages_scanned is per-node value, not per-zone.

    This patch changes node_pages_scanned per-zone-killobytes with
    per-node-count.

    [minchan@kernel.org: fix node_pages_scanned]
    Link: http://lkml.kernel.org/r/20160716101431.GA10305@bbox
    Link: http://lkml.kernel.org/r/1468588165-12461-5-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is partially a preparation patch for more vmstat work but it also
    has the slight advantage that __count_zid_vm_events is cheaper to
    calculate than __count_zone_vm_events().

    Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a page is about to be dirtied then the page allocator attempts to
    limit the total number of dirty pages that exists in any given zone.
    The call to node_dirty_ok is expensive so this patch records if the last
    pgdat examined hit the dirty limits. In some cases, this reduces the
    number of calls to node_dirty_ok().

    Link: http://lkml.kernel.org/r/1467970510-21195-31-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fair zone allocation policy interleaves allocation requests between
    zones to avoid an age inversion problem whereby new pages are reclaimed
    to balance a zone. Reclaim is now node-based so this should no longer
    be an issue and the fair zone allocation policy is not free. This patch
    removes it.

    Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The ac_classzone_idx is used as the basis for waking kswapd and that is
    based on the preferred zoneref. If the preferred zoneref's first zone
    is lower than what is available on other nodes, it's possible that
    kswapd is woken on a zone with only higher, but still eligible, zones.
    As classzone_idx is strictly adhered to now, it causes a problem because
    eligible pages are skipped.

    For example, node 0 has only DMA32 and node 1 has only NORMAL. An
    allocating context running on node 0 may wake kswapd on node 1 telling
    it to skip all NORMAL pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd is woken when zones are below the low watermark but the wakeup
    decision is not taking the classzone into account. Now that reclaim is
    node-based, it is only required to wake kswapd once per node and only if
    all zones are unbalanced for the requested classzone.

    Note that one node might be checked multiple times if the zonelist is
    ordered by node because there is no cheap way of tracking what nodes
    have already been visited. For zone-ordering, each node should be
    checked only once.

    Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim makes decisions based on the number of pages that are mapped but
    it's mixing node and zone information. Account NR_FILE_MAPPED and
    NR_ANON_PAGES pages on the node.

    Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Historically dirty pages were spread among zones but now that LRUs are
    per-node it is more appropriate to consider dirty pages in a node.

    Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd goes through some complex steps trying to figure out if it should
    stay awake based on the classzone_idx and the requested order. It is
    unnecessarily complex and passes in an invalid classzone_idx to
    balance_pgdat(). What matters most of all is whether a larger order has
    been requsted and whether kswapd successfully reclaimed at the previous
    order. This patch irons out the logic to check just that and the end
    result is less headache inducing.

    Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patchset: "Move LRU page reclaim from zones to nodes v9"

    This series moves LRUs from the zones to the node. While this is a
    current rebase, the test results were based on mmotm as of June 23rd.
    Conceptually, this series is simple but there are a lot of details.
    Some of the broad motivations for this are;

    1. The residency of a page partially depends on what zone the page was
    allocated from. This is partially combatted by the fair zone allocation
    policy but that is a partial solution that introduces overhead in the
    page allocator paths.

    2. Currently, reclaim on node 0 behaves slightly different to node 1. For
    example, direct reclaim scans in zonelist order and reclaims even if
    the zone is over the high watermark regardless of the age of pages
    in that LRU. Kswapd on the other hand starts reclaim on the highest
    unbalanced zone. A difference in distribution of file/anon pages due
    to when they were allocated results can result in a difference in
    again. While the fair zone allocation policy mitigates some of the
    problems here, the page reclaim results on a multi-zone node will
    always be different to a single-zone node.
    it was scheduled on as a result.

    3. kswapd and the page allocator scan zones in the opposite order to
    avoid interfering with each other but it's sensitive to timing. This
    mitigates the page allocator using pages that were allocated very recently
    in the ideal case but it's sensitive to timing. When kswapd is allocating
    from lower zones then it's great but during the rebalancing of the highest
    zone, the page allocator and kswapd interfere with each other. It's worse
    if the highest zone is small and difficult to balance.

    4. slab shrinkers are node-based which makes it harder to identify the exact
    relationship between slab reclaim and LRU reclaim.

    The reason we have zone-based reclaim is that we used to have
    large highmem zones in common configurations and it was necessary
    to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
    less of a concern as machines with lots of memory will (or should) use
    64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
    rare. Machines that do use highmem should have relatively low highmem:lowmem
    ratios than we worried about in the past.

    Conceptually, moving to node LRUs should be easier to understand. The
    page allocator plays fewer tricks to game reclaim and reclaim behaves
    similarly on all nodes.

    The series has been tested on a 16 core UMA machine and a 2-socket 48
    core NUMA machine. The UMA results are presented in most cases as the NUMA
    machine behaved similarly.

    pagealloc
    ---------

    This is a microbenchmark that shows the benefit of removing the fair zone
    allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
    shown as the other orders were comparable.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
    Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
    Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
    Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
    Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
    Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
    Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
    Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
    Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
    Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
    Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
    Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
    Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
    Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
    Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
    Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
    Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
    Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
    Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
    Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
    Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
    Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
    Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
    Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
    Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
    Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
    Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

    This shows a steady improvement throughout. The primary benefit is from
    reduced system CPU usage which is obvious from the overall times;

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    User 189.19 191.80
    System 2604.45 2533.56
    Elapsed 2855.30 2786.39

    The vmstats also showed that the fair zone allocation policy was definitely
    removed as can be seen here;

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v8
    DMA32 allocs 28794729769 0
    Normal allocs 48432501431 77227309877
    Movable allocs 0 0

    tiobench on ext4
    ----------------

    tiobench is a benchmark that artifically benefits if old pages remain resident
    while new pages get reclaimed. The fair zone allocation policy mitigates this
    problem so pages age fairly. While the benchmark has problems, it is important
    that tiobench performance remains constant as it implies that page aging
    problems that the fair zone allocation policy fixes are not re-introduced.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
    Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
    Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
    Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
    Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
    Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
    Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
    Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
    Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
    Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
    Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
    Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
    Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
    Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
    Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
    Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
    Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
    Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
    Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
    Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
    Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 approx-v9
    User 645.72 525.90
    System 403.85 331.75
    Elapsed 6795.36 6783.67

    This shows that the series has little or not impact on tiobench which is
    desirable and a reduction in system CPU usage. It indicates that the fair
    zone allocation policy was removed in a manner that didn't reintroduce
    one class of page aging bug. There were only minor differences in overall
    reclaim activity

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Minor Faults 645838 647465
    Major Faults 573 640
    Swap Ins 0 0
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 46041453 44190646
    Normal allocs 78053072 79887245
    Movable allocs 0 0
    Allocation stalls 24 67
    Stall zone DMA 0 0
    Stall zone DMA32 0 0
    Stall zone Normal 0 2
    Stall zone HighMem 0 0
    Stall zone Movable 0 65
    Direct pages scanned 10969 30609
    Kswapd pages scanned 93375144 93492094
    Kswapd pages reclaimed 93372243 93489370
    Direct pages reclaimed 10969 30609
    Kswapd efficiency 99% 99%
    Kswapd velocity 13741.015 13781.934
    Direct efficiency 100% 100%
    Direct velocity 1.614 4.512
    Percentage direct scans 0% 0%

    kswapd activity was roughly comparable. There were differences in direct
    reclaim activity but negligible in the context of the overall workload
    (velocity of 4 pages per second with the patches applied, 1.6 pages per
    second in the baseline kernel).

    pgbench read-only large configuration on ext4
    ---------------------------------------------

    pgbench is a database benchmark that can be sensitive to page reclaim
    decisions. This also checks if removing the fair zone allocation policy
    is safe

    pgbench Transactions
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
    Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
    Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
    Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
    Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
    Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

    Negligible differences again. As with tiobench, overall reclaim activity
    was comparable.

    bonnie++ on ext4
    ----------------

    No interesting performance difference, negligible differences on reclaim
    stats.

    paralleldd on ext4
    ------------------

    This workload uses varying numbers of dd instances to read large amounts of
    data from disk.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
    Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
    Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
    Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
    Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
    Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    User 1548.01 1552.44
    System 8609.71 8515.08
    Elapsed 3587.10 3594.54

    There is little or no change in performance but some drop in system CPU usage.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Minor Faults 362662 367360
    Major Faults 1204 1143
    Swap Ins 22 0
    Swap Outs 2855 1029
    DMA allocs 0 0
    DMA32 allocs 31409797 28837521
    Normal allocs 46611853 49231282
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 40845270 40869088
    Kswapd pages reclaimed 40830976 40855294
    Direct pages reclaimed 0 0
    Kswapd efficiency 99% 99%
    Kswapd velocity 11386.711 11369.769
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Page writes by reclaim 2855 1029
    Page writes file 0 0
    Page writes anon 2855 1029
    Page reclaim immediate 771 1628
    Sector Reads 293312636 293536360
    Sector Writes 18213568 18186480
    Page rescued immediate 0 0
    Slabs scanned 128257 132747
    Direct inode steals 181 56
    Kswapd inode steals 59 1131

    It basically shows that kswapd was active at roughly the same rate in
    both kernels. There was also comparable slab scanning activity and direct
    reclaim was avoided in both cases. There appears to be a large difference
    in numbers of inodes reclaimed but the workload has few active inodes and
    is likely a timing artifact.

    stutter
    -------

    stutter simulates a simple workload. One part uses a lot of anonymous
    memory, a second measures mmap latency and a third copies a large file.
    The primary metric is checking for mmap latency.

    stutter
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
    1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
    2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
    3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
    Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
    Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
    Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
    Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
    Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
    Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
    Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
    Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
    Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
    Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
    Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
    Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
    Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

    This shows a number of improvements with the worst-case outlier greatly
    improved.

    Some of the vmstats are interesting

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Swap Ins 163 502
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 618719206 1381662383
    Normal allocs 891235743 564138421
    Movable allocs 0 0
    Allocation stalls 2603 1
    Direct pages scanned 216787 2
    Kswapd pages scanned 50719775 41778378
    Kswapd pages reclaimed 41541765 41777639
    Direct pages reclaimed 209159 0
    Kswapd efficiency 81% 99%
    Kswapd velocity 16859.554 14329.059
    Direct efficiency 96% 0%
    Direct velocity 72.061 0.001
    Percentage direct scans 0% 0%
    Page writes by reclaim 6215049 0
    Page writes file 6215049 0
    Page writes anon 0 0
    Page reclaim immediate 70673 90
    Sector Reads 81940800 81680456
    Sector Writes 100158984 98816036
    Page rescued immediate 0 0
    Slabs scanned 1366954 22683

    While this is not guaranteed in all cases, this particular test showed
    a large reduction in direct reclaim activity. It's also worth noting
    that no page writes were issued from reclaim context.

    This series is not without its hazards. There are at least three areas
    that I'm concerned with even though I could not reproduce any problems in
    that area.

    1. Reclaim/compaction is going to be affected because the amount of reclaim is
    no longer targetted at a specific zone. Compaction works on a per-zone basis
    so there is no guarantee that reclaiming a few THP's worth page pages will
    have a positive impact on compaction success rates.

    2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
    are called is now different. This may or may not be a problem but if it
    is, it'll be because shrinkers are not called enough and some balancing
    is required.

    3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
    distributed between zones and the fair zone allocation policy used to do
    something very similar for anon. The distribution is now different but not
    necessarily in any way that matters but it's still worth bearing in mind.

    VM statistic counters for reclaim decisions are zone-based. If the kernel
    is to reclaim on a per-node basis then we need to track per-node
    statistics but there is no infrastructure for that. The most notable
    change is that the old node_page_state is renamed to
    sum_zone_node_page_state. The new node_page_state takes a pglist_data and
    uses per-node stats but none exist yet. There is some renaming such as
    vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
    of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
    patch with no functional change. There is a lot of similarity between the
    node and zone helpers which is unfortunate but there was no obvious way of
    reusing the code and maintaining type safety.

    Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The helper early_page_nid_uninitialised() has been dead since commit
    974a786e63c9 ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
    dead code.

    Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • We need to assure the comment is consistent with the code.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

27 Jul, 2016

10 commits

  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently, to charge a non-slab allocation to kmemcg one has to use
    alloc_kmem_pages helper with __GFP_ACCOUNT flag. A page allocated with
    this helper should finally be freed using free_kmem_pages, otherwise it
    won't be uncharged.

    This API suits its current users fine, but it turns out to be impossible
    to use along with page reference counting, i.e. when an allocation is
    supposed to be freed with put_page, as it is the case with pipe or unix
    socket buffers.

    To overcome this limitation, this patch moves charging/uncharging to
    generic page allocator paths, i.e. to __alloc_pages_nodemask and
    free_pages_prepare, and zaps alloc/free_kmem_pages helpers. This way,
    one can use any of the available page allocation functions to get the
    allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
    just like in case of kmalloc and friends. A charged page will be
    automatically uncharged on free.

    To make it possible, we need to mark pages charged to kmemcg somehow.
    To avoid introducing a new page flag, we make use of page->_mapcount for
    marking such pages. Since pages charged to kmemcg are not supposed to
    be mapped to userspace, it should work just fine. There are other
    (ab)users of page->_mapcount - buddy and balloon pages - but we don't
    conflict with them.

    In case kmemcg is compiled out or not used at runtime, this patch
    introduces no overhead to generic page allocator paths. If kmemcg is
    used, it will be plus one gfp flags check on alloc and plus one
    page->_mapcount check on free, which shouldn't hurt performance, because
    the data accessed are hot.

    Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • - Handle memcg_kmem_enabled check out to the caller. This reduces the
    number of function definitions making the code easier to follow. At
    the same time it doesn't result in code bloat, because all of these
    functions are used only in one or two places.

    - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
    have to dive deep into memcg implementation to see which allocations
    are charged and which are not.

    - Refresh comments.

    Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Eric Dumazet
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch is motivated from Hugh and Vlastimil's concern [1].

    There are two ways to get freepage from the allocator. One is using
    normal memory allocation API and the other is __isolate_free_page()
    which is internally used for compaction and pageblock isolation. Later
    usage is rather tricky since it doesn't do whole post allocation
    processing done by normal API.

    One problematic thing I already know is that poisoned page would not be
    checked if it is allocated by __isolate_free_page(). Perhaps, there
    would be more.

    We could add more debug logic for allocated page in the future and this
    separation would cause more problem. I'd like to fix this situation at
    this time. Solution is simple. This patch commonize some logic for
    newly allocated page and uses it on all sites. This will solve the
    problem.

    [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

    [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • split_page() calls set_page_owner() to set up page_owner to each pages.
    But, it has a drawback that head page and the others have different
    stacktrace because callsite of set_page_owner() is slightly differnt.
    To avoid this problem, this patch copies head page's page_owner to the
    others. It needs to introduce new function, split_page_owner() but it
    also remove the other function, get_page_owner_gfp() so looks good to
    do.

    Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • It's not necessary to initialized page_owner with holding the zone lock.
    It would cause more contention on the zone lock although it's not a big
    problem since it is just debug feature. But, it is better than before
    so do it. This is also preparation step to use stackdepot in page owner
    feature. Stackdepot allocates new pages when there is no reserved space
    and holding the zone lock in this case will cause deadlock.

    Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We don't need to split freepages with holding the zone lock. It will
    cause more contention on zone lock so not desirable.

    [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We have allowed migration for only LRU pages until now and it was enough
    to make high-order pages. But recently, embedded system(e.g., webOS,
    android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
    have seen several reports about troubles of small high-order allocation.
    For fixing the problem, there were several efforts (e,g,. enhance
    compaction algorithm, SLUB fallback to 0-order page, reserved memory,
    vmalloc and so on) but if there are lots of non-movable pages in system,
    their solutions are void in the long run.

    So, this patch is to support facility to change non-movable pages with
    movable. For the feature, this patch introduces functions related to
    migration to address_space_operations as well as some page flags.

    If a driver want to make own pages movable, it should define three
    functions which are function pointers of struct
    address_space_operations.

    1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

    What VM expects on isolate_page function of driver is to return *true*
    if driver isolates page successfully. On returing true, VM marks the
    page as PG_isolated so concurrent isolation in several CPUs skip the
    page for isolation. If a driver cannot isolate the page, it should
    return *false*.

    Once page is successfully isolated, VM uses page.lru fields so driver
    shouldn't expect to preserve values in that fields.

    2. int (*migratepage) (struct address_space *mapping,
    struct page *newpage, struct page *oldpage, enum migrate_mode);

    After isolation, VM calls migratepage of driver with isolated page. The
    function of migratepage is to move content of the old page to new page
    and set up fields of struct page newpage. Keep in mind that you should
    indicate to the VM the oldpage is no longer movable via
    __ClearPageMovable() under page_lock if you migrated the oldpage
    successfully and returns 0. If driver cannot migrate the page at the
    moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page
    migration in a short time because VM interprets -EAGAIN as "temporal
    migration failure". On returning any error except -EAGAIN, VM will give
    up the page migration without retrying in this time.

    Driver shouldn't touch page.lru field VM using in the functions.

    3. void (*putback_page)(struct page *);

    If migration fails on isolated page, VM should return the isolated page
    to the driver so VM calls driver's putback_page with migration failed
    page. In this function, driver should put the isolated page back to the
    own data structure.

    4. non-lru movable page flags

    There are two page flags for supporting non-lru movable page.

    * PG_movable

    Driver should use the below function to make page movable under
    page_lock.

    void __SetPageMovable(struct page *page, struct address_space *mapping)

    It needs argument of address_space for registering migration family
    functions which will be called by VM. Exactly speaking, PG_movable is
    not a real flag of struct page. Rather than, VM reuses page->mapping's
    lower bits to represent it.

    #define PAGE_MAPPING_MOVABLE 0x2
    page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

    so driver shouldn't access page->mapping directly. Instead, driver
    should use page_mapping which mask off the low two bits of page->mapping
    so it can get right struct address_space.

    For testing of non-lru movable page, VM supports __PageMovable function.
    However, it doesn't guarantee to identify non-lru movable page because
    page->mapping field is unified with other variables in struct page. As
    well, if driver releases the page after isolation by VM, page->mapping
    doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
    __ClearPageMovable). But __PageMovable is cheap to catch whether page
    is LRU or non-lru movable once the page has been isolated. Because LRU
    pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
    good for just peeking to test non-lru movable pages before more
    expensive checking with lock_page in pfn scanning to select victim.

    For guaranteeing non-lru movable page, VM provides PageMovable function.
    Unlike __PageMovable, PageMovable functions validates page->mapping and
    mapping->a_ops->isolate_page under lock_page. The lock_page prevents
    sudden destroying of page->mapping.

    Driver using __SetPageMovable should clear the flag via
    __ClearMovablePage under page_lock before the releasing the page.

    * PG_isolated

    To prevent concurrent isolation among several CPUs, VM marks isolated
    page as PG_isolated under lock_page. So if a CPU encounters PG_isolated
    non-lru movable page, it can skip it. Driver doesn't need to manipulate
    the flag because VM will set/clear it automatically. Keep in mind that
    if driver sees PG_isolated page, it means the page have been isolated by
    VM so it shouldn't touch page.lru field. PG_isolated is alias with
    PG_reclaim flag so driver shouldn't use the flag for own purpose.

    [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
    Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
    Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
    Signed-off-by: Gioh Kim
    Signed-off-by: Minchan Kim
    Signed-off-by: Ganesh Mahendran
    Acked-by: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Rafael Aquini
    Cc: Jonathan Corbet
    Cc: John Einar Reitan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • It's a part of oom context just like allocation order and nodemask, so
    let's move it to oom_control instead of passing it in the argument list.

    Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov