20 Jan, 2017

1 commit

  • commit b4536f0c829c8586544c94735c343f9b5070bd01 upstream.

    Nils Holland and Klaus Ethgen have reported unexpected OOM killer
    invocations with 32b kernel starting with 4.8 kernels

    kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
    kworker/u4:5 cpuset=/ mems_allowed=0
    CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
    [...]
    Mem-Info:
    active_anon:58685 inactive_anon:90 isolated_anon:0
    active_file:274324 inactive_file:281962 isolated_file:0
    unevictable:0 dirty:649 writeback:0 unstable:0
    slab_reclaimable:40662 slab_unreclaimable:17754
    mapped:7382 shmem:202 pagetables:351 bounce:0
    free:206736 free_pcp:332 free_cma:0
    Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
    DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 813 3474 3474
    Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
    lowmem_reserve[]: 0 0 21292 21292
    HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB

    the oom killer is clearly pre-mature because there there is still a lot
    of page cache in the zone Normal which should satisfy this lowmem
    request. Further debugging has shown that the reclaim cannot make any
    forward progress because the page cache is hidden in the active list
    which doesn't get rotated because inactive_list_is_low is not memcg
    aware.

    The code simply subtracts per-zone highmem counters from the respective
    memcg's lru sizes which doesn't make any sense. We can simply end up
    always seeing the resulting active and inactive counts 0 and return
    false. This issue is not limited to 32b kernels but in practice the
    effect on systems without CONFIG_HIGHMEM would be much harder to notice
    because we do not invoke the OOM killer for allocations requests
    targeting < ZONE_NORMAL.

    Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
    and subtract per-memcg highmem counts when memcg is enabled. Introduce
    helper lruvec_zone_lru_size which redirects to either zone counters or
    mem_cgroup_get_zone_lru_size when appropriate.

    We are losing empty LRU but non-zero lru size detection introduced by
    ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
    of the inherent zone vs. node discrepancy.

    Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones inactive ratio")
    Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Tested-by: Nils Holland
    Reported-by: Klaus Ethgen
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

06 Jan, 2017

1 commit

  • commit 5f33a0803bbd781de916f5c7448cbbbbc763d911 upstream.

    Our system uses significantly more slab memory with memcg enabled with
    the latest kernel. With 3.10 kernel, slab uses 2G memory, while with
    4.6 kernel, 6G memory is used. The shrinker has problem. Let's see we
    have two memcg for one shrinker. In do_shrink_slab:

    1. Check cg1. nr_deferred = 0, assume total_scan = 700. batch size
    is 1024, then no memory is freed. nr_deferred = 700

    2. Check cg2. nr_deferred = 700. Assume freeable = 20, then
    total_scan = 10 or 40. Let's assume it's 10. No memory is freed.
    nr_deferred = 10.

    The deferred share of cg1 is lost in this case. kswapd will free no
    memory even run above steps again and again.

    The fix makes sure one memcg's deferred share isn't lost.

    Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

03 Dec, 2016

1 commit

  • Boris Zhmurov has reported RCU stalls during the kswapd reclaim:

    INFO: rcu_sched detected stalls on CPUs/tasks:
    23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
    (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
    Task dump for CPU 23:
    kswapd1 R running task 0 148 2 0x00000008
    Call Trace:
    shrink_node+0xd2/0x2f0
    kswapd+0x2cb/0x6a0
    mem_cgroup_shrink_node+0x160/0x160
    kthread+0xbd/0xe0
    __switch_to+0x1fa/0x5c0
    ret_from_fork+0x1f/0x40
    kthread_create_on_node+0x180/0x180

    a closer code inspection has shown that we might indeed miss all the
    scheduling points in the reclaim path if no pages can be isolated from
    the LRU list. This is a pathological case but other reports from Donald
    Buczek have shown that we might indeed hit such a path:

    clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
    kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
    [...]
    kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

    this is minute long snapshot which didn't take a single page from the
    LRU. It is not entirely clear why only 1303 pages have been scanned
    during that time (maybe there was a heavy IRQ activity interfering).

    In any case it looks like we can really hit long periods without
    scheduling on non preemptive kernels so an explicit cond_resched() in
    shrink_node_memcg which is independent on the reclaim operation is due.

    Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Boris Zhmurov
    Tested-by: Boris Zhmurov
    Reported-by: Donald Buczek
    Reported-by: "Christopher S. Aker"
    Reported-by: Paul Menzel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 Oct, 2016

1 commit

  • On 4.0, we saw a stack corruption from a page fault entering direct
    memory cgroup reclaim, calling into btrfs_releasepage(), which then
    tried to allocate an extent and recursed back into a kmem charge ad
    nauseam:

    [...]
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    memcg_charge_kmem+0x40/0x80
    new_slab+0x2d9/0x5a0
    __slab_alloc+0x2fd/0x44f
    kmem_cache_alloc+0x193/0x1e0
    alloc_extent_state+0x21/0xc0
    __clear_extent_bit+0x2b5/0x400
    try_release_extent_mapping+0x1a3/0x220
    __btrfs_releasepage+0x31/0x70
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    mem_cgroup_try_charge+0x65/0x1c0
    handle_mm_fault+0x117f/0x1510
    __do_page_fault+0x177/0x420
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30

    On later kernels, kmem charging is opt-in rather than opt-out, and that
    particular kmem allocation in btrfs_releasepage() is no longer being
    charged and won't recurse and overrun the stack anymore.

    But it's not impossible for an accounted allocation to happen from the
    memcg direct reclaim context, and we needed to reproduce this crash many
    times before we even got a useful stack trace out of it.

    Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
    avoid recursing into any other form of direct reclaim. Then let
    recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

    Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Oct, 2016

5 commits

  • Use the existing enums instead of hardcoded index when looking at the
    zonelist. This makes it more readable. No functionality change by this
    patch.

    Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
    excessive pageout activity during the reclaim. Too many pages could be
    put under writeback therefore LRUs would be full of unreclaimable pages
    until the IO completes and in turn the OOM killer could be invoked.

    There have been some important changes introduced since then in the
    reclaim path though. Writers are throttled by balance_dirty_pages when
    initiating the buffered IO and later during the memory pressure, the
    direct reclaim is throttled by wait_iff_congested if the node is
    considered congested by dirty pages on LRUs and the underlying bdi is
    congested by the queued IO. The kswapd is throttled as well if it
    encounters pages marked for immediate reclaim or under writeback which
    signals that that there are too many pages under writeback already.
    Finally should_reclaim_retry does congestion_wait if the reclaim cannot
    make any progress and there are too many dirty/writeback pages.

    Another important aspect is that we do not issue any IO from the direct
    reclaim context anymore. In a heavy parallel load this could queue a
    lot of IO which would be very scattered and thus unefficient which would
    just make the problem worse.

    This three mechanisms should throttle and keep the amount of IO in a
    steady state even under heavy IO and memory pressure so yet another
    throttling point doesn't really seem helpful. Quite contrary, Mikulas
    Patocka has reported that swap backed by dm-crypt doesn't work properly
    because the swapout IO cannot make sufficient progress as the writeout
    path depends on dm_crypt worker which has to allocate memory to perform
    the encryption. In order to guarantee a forward progress it relies on
    the mempool allocator. mempool_alloc(), however, prefers to use the
    underlying (usually page) allocator before it grabs objects from the
    pool. Such an allocation can dive into the memory reclaim and
    consequently to throttle_vm_writeout. If there are too many dirty or
    pages under writeback it will get throttled even though it is in fact a
    flusher to clear pending pages.

    kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
    Workqueue: kcryptd kcryptd_crypt [dm_crypt]
    Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

    Let's just drop throttle_vm_writeout altogether. It is not very much
    helpful anymore.

    I have tried to test a potential writeback IO runaway similar to the one
    described in the original patch which has introduced that [1]. Small
    virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
    rather slow NFS in a sync mode on the host) with 8 parallel writers each
    writing 1G worth of data. As soon as the pagecache fills up and the
    direct reclaim hits then I start anon memory consumer in a loop
    (allocating 300M and exiting after populating it) in the background to
    make the memory pressure even stronger as well as to disrupt the steady
    state for the IO. The direct reclaim is throttled because of the
    congestion as well as kswapd hitting congestion_wait due to nr_immediate
    but throttle_vm_writeout doesn't ever trigger the sleep throughout the
    test. Dirty+writeback are close to nr_dirty_threshold with some
    fluctuations caused by the anon consumer.

    [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
    Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikulas Patocka
    Cc: Marcelo Tosatti
    Cc: NeilBrown
    Cc: Ondrej Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The compaction_ready() is used during direct reclaim for costly order
    allocations to skip reclaim for zones where compaction should be
    attempted instead. It's combining the standard compaction_suitable()
    check with its own watermark check based on high watermark with extra
    gap, and the result is confusing at best.

    This patch attempts to better structure and document the checks
    involved. First, compaction_suitable() can determine that the
    allocation should either succeed already, or that compaction doesn't
    have enough free pages to proceed. The third possibility is that
    compaction has enough free pages, but we still decide to reclaim first -
    unless we are already above the high watermark with gap. This does not
    mean that the reclaim will actually reach this watermark during single
    attempt, this is rather an over-reclaim protection. So document the
    code as such. The check for compaction_deferred() is removed
    completely, as it in fact had no proper role here.

    The result after this patch is mainly a less confusing code. We also
    skip some over-reclaim in cases where the allocation should already
    succed.

    Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Compaction uses a watermark gap of (2UL << order) pages at various
    places and it's not immediately obvious why. Abstract it through a
    compact_gap() wrapper to create a single place with a thorough
    explanation.

    [vbabka@suse.cz: clarify the comment of compact_gap()]
    Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
    Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • COMPACT_PARTIAL has historically meant that compaction returned after
    doing some work without fully compacting a zone. It however didn't
    distinguish if compaction terminated because it succeeded in creating
    the requested high-order page. This has changed recently and now we
    only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
    high-order watermark check in compaction_suitable() passes and no
    compaction needs to be done.

    So at this point we can make the return value clearer by renaming it to
    COMPACT_SUCCESS. The next patch will remove some redundant tests for
    success where compaction just returned COMPACT_SUCCESS.

    Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

25 Sep, 2016

1 commit

  • init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
    initialized with zeroes in the init_task, and copied from parent to
    child while it is quiescent in arch_dup_task_struct(); so I went to
    delete it.

    But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
    check that it was always empty at that point, and found them firing:
    because memcg reclaim can recurse into global reclaim (when allocating
    biosets for swapout in my case), and arrive back at the init_tlb_ubc()
    in shrink_node_memcg().

    Resetting tlb_ubc.flush_required at that point is wrong: if the upper
    level needs a deferred TLB flush, but the lower level turns out not to,
    we miss a TLB flush. But fortunately, that's the only part of the
    protocol that does not nest: with the initialization removed, cpumask
    collects bits from upper and lower levels, and flushes TLB when needed.

    Fixes: 72b252aed506 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org # 4.3+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Sep, 2016

1 commit

  • Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
    of memory when booting a secondary kernel. Srikar Dronamraju reported
    that multiple nodes may have no memory managed by the buddy allocator
    but still return true for populated_zone().

    Commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") was reported to cause kswapd to spin at 100% CPU usage when
    fadump was enabled. The old code happened to deal with the situation of
    a populated node with zero free pages by co-incidence but the current
    code tries to reclaim populated zones without realising that is
    impossible.

    We cannot just convert populated_zone() as many existing users really
    need to check for present_pages. This patch introduces a managed_zone()
    helper and uses it in the few cases where it is critical that the check
    is made for managed pages -- zonelist construction and page reclaim.

    Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Srikar Dronamraju
    Tested-by: Srikar Dronamraju
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Aug, 2016

1 commit

  • We must call shrink_slab() for each memory cgroup on both global and
    memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
    changed that so that now shrink_slab() is only called with memcg != NULL
    on memcg reclaim. As a result, memcg-aware shrinkers (including
    dentry/inode) are never invoked on global reclaim. Fix that.

    Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
    Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

29 Jul, 2016

28 commits

  • With node-lru, if there are enough reclaimable pages in highmem but
    nothing in lowmem, VM can try to shrink inactive list although the
    requested zone is lowmem.

    The problem is that if the inactive list is full of highmem pages then a
    direct reclaimer searching for a lowmem page waste CPU scanning
    uselessly. It just burns out CPU. Even, many direct reclaimers are
    stalled by too_many_isolated if lots of parallel reclaimer are going on
    although there are no reclaimable memory in inactive list.

    I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
    elapsed time.

    hackbench 500 process 2

    = Old =

    1st: 289s 2nd: 310s 3rd: 112s 4th: 272s

    = Now =

    1st: 31s 2nd: 132s 3rd: 162s 4th: 50s.

    [akpm@linux-foundation.org: fixes per Mel]
    Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Page reclaim determines whether a pgdat is unreclaimable by examining
    how many pages have been scanned since a page was freed and comparing
    that to the LRU sizes. Skipped pages are not reclaim candidates but
    contribute to scanned. This can prematurely mark a pgdat as
    unreclaimable and trigger an OOM kill.

    This patch accounts for skipped pages as a partial scan so that an
    unreclaimable pgdat will still be marked as such but by scaling the cost
    of a skip, it'll avoid the pgdat being marked prematurely.

    Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Minchan Kim reported that with per-zone lru state it was possible to
    identify that a normal zone with 8^M anonymous pages could trigger OOM
    with non-atomic order-0 allocations as all pages in the zone were in the
    active list.

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    Call Trace:
    __alloc_pages_nodemask+0xe52/0xe60
    ? new_slab+0x39c/0x3b0
    new_slab+0x39c/0x3b0
    ___slab_alloc.constprop.87+0x6da/0x840
    ? __alloc_skb+0x3c/0x260
    ? enqueue_task_fair+0x73/0xbf0
    ? poll_select_copy_remaining+0x140/0x140
    __slab_alloc.isra.81.constprop.86+0x40/0x6d
    ? __alloc_skb+0x3c/0x260
    kmem_cache_alloc+0x22c/0x260
    ? __alloc_skb+0x3c/0x260
    __alloc_skb+0x3c/0x260
    alloc_skb_with_frags+0x4e/0x1a0
    sock_alloc_send_pskb+0x16a/0x1b0
    ? wait_for_unix_gc+0x31/0x90
    unix_stream_sendmsg+0x28d/0x340
    sock_sendmsg+0x2d/0x40
    sock_write_iter+0x6c/0xc0
    __vfs_write+0xc0/0x120
    vfs_write+0x9b/0x1a0
    ? __might_fault+0x49/0xa0
    SyS_write+0x44/0x90
    do_fast_syscall_32+0xa6/0x1e0

    Mem-Info:
    active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    54409 total pagecache pages
    53215 pages in swap cache
    Swap cache stats: add 300982, delete 247765, find 157978/226539
    Free swap = 3803244kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9642 pages reserved
    0 pages cma reserved

    The problem is due to the active deactivation logic in
    inactive_list_is_low:

    Node 0 active_anon:404412kB inactive_anon:409040kB

    IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due
    to highmem anonymous stat so VM never deactivates normal zone's
    anonymous pages.

    This patch is a modified version of Minchan's original solution but
    based upon it. The problem with Minchan's patch is that any low zone
    with an imbalanced list could force a rotation.

    In this patch, a zone-constrained global reclaim will rotate the list if
    the inactive/active ratio of all eligible zones needs to be corrected.
    It is possible that higher zone pages will be initially rotated
    prematurely but this is the safer choice to maintain overall LRU age.

    Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When I did stress test with hackbench, I got OOM message frequently
    which didn't ever happen in zone-lru.

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    ..
    ..
    __alloc_pages_nodemask+0xe52/0xe60
    ? new_slab+0x39c/0x3b0
    new_slab+0x39c/0x3b0
    ___slab_alloc.constprop.87+0x6da/0x840
    ? __alloc_skb+0x3c/0x260
    ? _raw_spin_unlock_irq+0x27/0x60
    ? trace_hardirqs_on_caller+0xec/0x1b0
    ? finish_task_switch+0xa6/0x220
    ? poll_select_copy_remaining+0x140/0x140
    __slab_alloc.isra.81.constprop.86+0x40/0x6d
    ? __alloc_skb+0x3c/0x260
    kmem_cache_alloc+0x22c/0x260
    ? __alloc_skb+0x3c/0x260
    __alloc_skb+0x3c/0x260
    alloc_skb_with_frags+0x4e/0x1a0
    sock_alloc_send_pskb+0x16a/0x1b0
    ? wait_for_unix_gc+0x31/0x90
    ? alloc_set_pte+0x2ad/0x310
    unix_stream_sendmsg+0x28d/0x340
    sock_sendmsg+0x2d/0x40
    sock_write_iter+0x6c/0xc0
    __vfs_write+0xc0/0x120
    vfs_write+0x9b/0x1a0
    ? __might_fault+0x49/0xa0
    SyS_write+0x44/0x90
    do_fast_syscall_32+0xa6/0x1e0
    sysenter_past_esp+0x45/0x74

    Mem-Info:
    active_anon:104698 inactive_anon:105791 isolated_anon:192
    active_file:433 inactive_file:283 isolated_file:22
    unevictable:0 dirty:0 writeback:296 unstable:0
    slab_reclaimable:6389 slab_unreclaimable:78927
    mapped:474 shmem:0 pagetables:101426 bounce:0
    free:10518 free_pcp:334 free_cma:0
    Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
    DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
    Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
    HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    25121 total pagecache pages
    24160 pages in swap cache
    Swap cache stats: add 86371, delete 62211, find 42865/60187
    Free swap = 4015560kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9658 pages reserved
    0 pages cma reserved

    The order-0 allocation for normal zone failed while there are a lot of
    reclaimable memory(i.e., anonymous memory with free swap). I wanted to
    analyze the problem but it was hard because we removed per-zone lru stat
    so I couldn't know how many of anonymous memory there are in normal/dma
    zone.

    When we investigate OOM problem, reclaimable memory count is crucial
    stat to find a problem. Without it, it's hard to parse the OOM message
    so I believe we should keep it.

    With per-zone lru stat,

    gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
    Mem-Info:
    active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
    Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
    DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 809 1965 1965
    Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
    lowmem_reserve[]: 0 0 9247 9247
    HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
    Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
    HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    54409 total pagecache pages
    53215 pages in swap cache
    Swap cache stats: add 300982, delete 247765, find 157978/226539
    Free swap = 3803244kB
    Total swap = 4192252kB
    524186 pages RAM
    295934 pages HighMem/MovableOnly
    9642 pages reserved
    0 pages cma reserved

    With that, we can see normal zone has a 86M reclaimable memory so we can
    know something goes wrong(I will fix the problem in next patch) in
    reclaim.

    [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
    Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
    Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With node-lru, the locking is based on the pgdat. As Minchan pointed
    out, there is an opportunity to reduce LRU lock release/acquire in
    check_move_unevictable_pages by only changing lock on a pgdat change.

    [mgorman@techsingularity.net: remove double initialisation]
    Link: http://lkml.kernel.org/r/20160719074835.GC10438@techsingularity.net
    Link: http://lkml.kernel.org/r/1468853426-12858-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As pointed out by Minchan Kim, shrink_zones() checks for populated zones
    in a zonelist but a zonelist can never contain unpopulated zones. While
    it's not related to the node-lru series, it can be cleaned up now.

    Link: http://lkml.kernel.org/r/1468853426-12858-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Minchan Kim reported setting the following warning on a 32-bit system
    although it can affect 64-bit systems.

    WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
    mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
    Modules linked in:
    CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

    LRU list contents and counts are updated separately. Counts are updated
    before pages are added to the LRU and updated after pages are removed.
    The warning above is from a check in mem_cgroup_update_lru_size that
    ensures that list sizes of zero are empty.

    The problem is that node-lru needs to account for highmem pages if
    CONFIG_HIGHMEM is set. One impact of the implementation is that the
    sizes are updated in multiple passes when pages from multiple zones were
    isolated. This happens whether HIGHMEM is set or not. When multiple
    zones are isolated, it's possible for a debugging check in memcg to be
    tripped.

    This patch forces all the zone counts to be updated before the memcg
    function is called.

    Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Tested-by: Minchan Kim
    Reported-by: Minchan Kim
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The vmstat allocstall was fairly useful in the general sense but
    node-based LRUs change that. It's important to know if a stall was for
    an address-limited allocation request as this will require skipping
    pages from other zones. This patch adds pgstall_* counters to replace
    allocstall. The sum of the counters will equal the old allocstall so it
    can be trivially recalculated. A high number of address-limited
    allocation requests may result in a lot of useless LRU scanning for
    suitable pages.

    As address-limited allocations require pages to be skipped, it's
    important to know how much useless LRU scanning took place so this patch
    adds pgskip* counters. This yields the following model

    1. The number of address-space limited stalls can be accounted for (pgstall)
    2. The amount of useless work required to reclaim the data is accounted (pgskip)
    3. The total number of scans is available from pgscan_kswapd and pgscan_direct
    so from that the ratio of useful to useless scans can be calculated.

    [mgorman@techsingularity.net: s/pgstall/allocstall/]
    Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is convenient when tracking down why the skip count is high because
    it'll show what classzone kswapd woke up at and what zones are being
    isolated.

    Link: http://lkml.kernel.org/r/1467970510-21195-29-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The buffer_heads_over_limit limit in kswapd is inconsistent with direct
    reclaim behaviour. It may force an an attempt to reclaim from all zones
    and then not reclaim at all because higher zones were balanced than
    required by the original request.

    This patch will causes kswapd to consider reclaiming from all zones if
    buffer_heads_over_limit. However, if there are eligible zones for the
    allocation request that woke kswapd then no reclaim will occur even if
    buffer_heads_over_limit. This avoids kswapd over-reclaiming just
    because buffer_heads_over_limit.

    [mgorman@techsingularity.net: fix comment about buffer_heads_over_limit]
    Link: http://lkml.kernel.org/r/1468404004-5085-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-28-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep()
    always passes in 0 for `remaining' and the second call can trivially
    check the parameter in advance.

    Suggested-by: Minchan Kim
    Link: http://lkml.kernel.org/r/1467970510-21195-27-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The scan_control structure has enough information available for
    compaction_ready() to make a decision. The classzone_idx manipulations
    in shrink_zones() are no longer necessary as the highest populated zone
    is no longer used to determine if shrink_slab should be called or not.

    [mgorman@techsingularity.net remove redundant check in shrink_zones()]
    Link: http://lkml.kernel.org/r/1468588165-12461-3-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-26-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_node receives all information it needs about classzone_idx from
    sc->reclaim_idx so remove the aliases.

    Link: http://lkml.kernel.org/r/1467970510-21195-25-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd is woken when zones are below the low watermark but the wakeup
    decision is not taking the classzone into account. Now that reclaim is
    node-based, it is only required to wake kswapd once per node and only if
    all zones are unbalanced for the requested classzone.

    Note that one node might be checked multiple times if the zonelist is
    ordered by node because there is no cheap way of tracking what nodes
    have already been visited. For zone-ordering, each node should be
    checked only once.

    Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim makes decisions based on the number of pages that are mapped but
    it's mixing node and zone information. Account NR_FILE_MAPPED and
    NR_ANON_PAGES pages on the node.

    Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd scans from highest to lowest for a zone that requires balancing.
    This was necessary when reclaim was per-zone to fairly age pages on
    lower zones. Now that we are reclaiming on a per-node basis, any
    eligible zone can be used and pages will still be aged fairly. This
    patch avoids reclaiming excessively unless buffer_heads are over the
    limit and it's necessary to reclaim from a higher zone than requested by
    the waker of kswapd to relieve low memory pressure.

    [hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
    Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-13-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim may stall if there is too much dirty or congested data on a
    node. This was previously based on zone flags and the logic for
    clearing the flags is in two places. As congestion/dirty tracking is
    now tracked on a per-node basis, we can remove some duplicate logic.

    Link: http://lkml.kernel.org/r/1467970510-21195-12-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim iterates over all zones in the zonelist and shrinking
    them but this is in conflict with node-based reclaim. In the default
    case, only shrink once per node.

    Link: http://lkml.kernel.org/r/1467970510-21195-11-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd goes through some complex steps trying to figure out if it should
    stay awake based on the classzone_idx and the requested order. It is
    unnecessarily complex and passes in an invalid classzone_idx to
    balance_pgdat(). What matters most of all is whether a larger order has
    been requsted and whether kswapd successfully reclaimed at the previous
    order. This patch irons out the logic to check just that and the end
    result is less headache inducing.

    Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The balance gap was introduced to apply equal pressure to all zones when
    reclaiming for a higher zone. With node-based LRU, the need for the
    balance gap is removed and the code is dead so remove it.

    [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
    Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
    thinking of reclaim in terms of nodes but kswapd is still zone-centric.
    This patch gets rid of many of the node-based versus zone-based
    decisions.

    o A node is considered balanced when any eligible lower zone is balanced.
    This eliminates one class of age-inversion problem because we avoid
    reclaiming a newer page just because it's in the wrong zone
    o pgdat_balanced disappears because we now only care about one zone being
    balanced.
    o Some anomalies related to writeback and congestion tracking being based on
    zones disappear.
    o kswapd no longer has to take care to reclaim zones in the reverse order
    that the page allocator uses.
    o Most importantly of all, reclaim from node 0 with multiple zones will
    have similar aging and reclaiming characteristics as every
    other node.

    Link: http://lkml.kernel.org/r/1467970510-21195-8-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman