29 Dec, 2018

1 commit

  • commit 68600f623d69da428c6163275f97ca126e1a8ec5 upstream.

    I've noticed, that dying memory cgroups are often pinned in memory by a
    single pagecache page. Even under moderate memory pressure they sometimes
    stayed in such state for a long time. That looked strange.

    My investigation showed that the problem is caused by applying the LRU
    pressure balancing math:

    scan = div64_u64(scan * fraction[lru], denominator),

    where

    denominator = fraction[anon] + fraction[file] + 1.

    Because fraction[lru] is always less than denominator, if the initial scan
    size is 1, the result is always 0.

    This means the last page is not scanned and has
    no chances to be reclaimed.

    Fix this by rounding up the result of the division.

    In practice this change significantly improves the speed of dying cgroups
    reclaim.

    [guro@fb.com: prevent double calculation of DIV64_U64_ROUND_UP() arguments]
    Link: http://lkml.kernel.org/r/20180829213311.GA13501@castle
    Link: http://lkml.kernel.org/r/20180827162621.30187-3-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

05 Jun, 2018

1 commit

  • commit 145e1a71e090575c74969e3daa8136d1e5b99fc8 upstream.

    George Boole would have noticed a slight error in 4.16 commit
    69d763fc6d3a ("mm: pin address_space before dereferencing it while
    isolating an LRU page"). Fix it, to match both the comment above it,
    and the original behaviour.

    Although anonymous pages are not marked PageDirty at first, we have an
    old habit of calling SetPageDirty when a page is removed from swap
    cache: so there's a category of ex-swap pages that are easily
    migratable, but were inadvertently excluded from compaction's async
    migration in 4.16.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805302014001.12558@eggly.anvils
    Fixes: 69d763fc6d3a ("mm: pin address_space before dereferencing it while isolating an LRU page")
    Signed-off-by: Hugh Dickins
    Acked-by: Minchan Kim
    Acked-by: Mel Gorman
    Reported-by: Ivan Kalvachev
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

30 May, 2018

1 commit

  • [ Upstream commit e92bb4dd9673945179b1fc738c9817dd91bfb629 ]

    When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

26 Apr, 2018

1 commit

  • [ Upstream commit 69d763fc6d3aee787a3e8c8c35092b4f4960fa5d ]

    Minchan Kim asked the following question -- what locks protects
    address_space destroying when race happens between inode trauncation and
    __isolate_lru_page? Jan Kara clarified by describing the race as follows

    CPU1 CPU2

    truncate(inode) __isolate_lru_page()
    ...
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    if (mapping && !mapping->a_ops->migratepage)
    - we've dereferenced mapping which is potentially already free.

    The race is theoretically possible but unlikely. Before the
    delete_from_page_cache, truncate_cleanup_page is called so the page is
    likely to be !PageDirty or PageWriteback which gets skipped by the only
    caller that checks the mappping in __isolate_lru_page. Even if the race
    occurs, a substantial amount of work has to happen during a tiny window
    with no preemption but it could potentially be done using a virtual
    machine to artifically slow one CPU or halt it during the critical
    window.

    This patch should eliminate the race with truncation by try-locking the
    page before derefencing mapping and aborting if the lock was not
    acquired. There was a suggestion from Huang Ying to use RCU as a
    side-effect to prevent mapping being freed. However, I do not like the
    solution as it's an unconventional means of preserving a mapping and
    it's not a context where rcu_read_lock is obviously protecting rcu data.

    Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
    Fixes: c82449352854 ("mm: compaction: make isolate_lru_page() filter-aware again")
    Signed-off-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

29 Mar, 2018

1 commit

  • commit 1c610d5f93c709df56787f50b3576704ac271826 upstream.

    Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

25 Feb, 2018

1 commit

  • commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream.

    Syzbot caught an oops at unregister_shrinker() because combination of
    commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
    injection made register_shrinker() fail and the caller of
    register_shrinker() did not check for failure.

    ----------
    [ 554.881422] FAULT_INJECTION: forcing a failure.
    [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
    [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.881445] Call Trace:
    [ 554.881459] dump_stack+0x194/0x257
    [ 554.881474] ? arch_local_irq_restore+0x53/0x53
    [ 554.881486] ? find_held_lock+0x35/0x1d0
    [ 554.881507] should_fail+0x8c0/0xa40
    [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
    [ 554.881537] ? check_noncircular+0x20/0x20
    [ 554.881546] ? find_next_zero_bit+0x2c/0x40
    [ 554.881560] ? ida_get_new_above+0x421/0x9d0
    [ 554.881577] ? find_held_lock+0x35/0x1d0
    [ 554.881594] ? __lock_is_held+0xb6/0x140
    [ 554.881628] ? check_same_owner+0x320/0x320
    [ 554.881634] ? lock_downgrade+0x990/0x990
    [ 554.881649] ? find_held_lock+0x35/0x1d0
    [ 554.881672] should_failslab+0xec/0x120
    [ 554.881684] __kmalloc+0x63/0x760
    [ 554.881692] ? lock_downgrade+0x990/0x990
    [ 554.881712] ? register_shrinker+0x10e/0x2d0
    [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
    [ 554.881737] register_shrinker+0x10e/0x2d0
    [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
    [ 554.881755] ? _down_write_nest_lock+0x120/0x120
    [ 554.881765] ? memcpy+0x45/0x50
    [ 554.881785] sget_userns+0xbcd/0xe20
    (...snipped...)
    [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
    [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 554.898732] general protection fault: 0000 [#1] SMP KASAN
    [ 554.898737] Dumping ftrace buffer:
    [ 554.898741] (ftrace buffer empty)
    [ 554.898743] Modules linked in:
    [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
    [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
    [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
    [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
    [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
    [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
    [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
    [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
    [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    [ 554.898818] Call Trace:
    [ 554.898828] unregister_shrinker+0x79/0x300
    [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
    [ 554.898844] ? down_write+0x87/0x120
    [ 554.898851] ? deactivate_super+0x139/0x1b0
    [ 554.898857] ? down_read+0x150/0x150
    [ 554.898864] ? check_same_owner+0x320/0x320
    [ 554.898875] deactivate_locked_super+0x64/0xd0
    [ 554.898883] deactivate_super+0x141/0x1b0
    ----------

    Since allowing register_shrinker() callers to call unregister_shrinker()
    when register_shrinker() failed can simplify error recovery path, this
    patch makes unregister_shrinker() no-op when register_shrinker() failed.
    Also, reset shrinker->nr_deferred in case unregister_shrinker() was
    by error called twice.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Aliaksei Karaliou
    Reported-by: syzbot
    Cc: Glauber Costa
    Cc: Al Viro
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

07 Sep, 2017

4 commits

  • When swapping out THP (Transparent Huge Page), instead of swapping out
    the THP as a whole, sometimes we have to fallback to split the THP into
    normal pages before swapping, because no free swap clusters are
    available, or cgroup limit is exceeded, etc. To count the number of the
    fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
    we fallback to split the THP.

    Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In this patch, splitting transparent huge page (THP) during swapping out
    is delayed from after adding the THP into the swap cache to after
    swapping out finishes. After the patch, more operations for the
    anonymous THP reclaiming, such as writing the THP to the swap device,
    removing the THP from the swap cache could be batched. So that the
    performance of anonymous THP swapping out could be improved.

    This is the second step for the THP swap support. The plan is to delay
    splitting the THP step by step and avoid splitting the THP finally.

    With the patchset, the swap out throughput improves 42% (from about
    5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
    with 16 processes. At the same time, the IPI (reflect TLB flushing)
    reduced about 78.9%. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Link: http://lkml.kernel.org/r/20170724051840.2309-12-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Tetsuo Handa has reported[1][2][3] that direct reclaimers might get
    stuck in too_many_isolated loop basically for ever because the last few
    pages on the LRU lists are isolated by the kswapd which is stuck on fs
    locks when doing the pageout or slab reclaim. This in turn means that
    there is nobody to actually trigger the oom killer and the system is
    basically unusable.

    too_many_isolated has been introduced by commit 35cd78156c49 ("vmscan:
    throttle direct reclaim when too many pages are isolated already") to
    prevent from pre-mature oom killer invocations because back then no
    reclaim progress could indeed trigger the OOM killer too early.

    But since the oom detection rework in commit 0a0337e0d1d1 ("mm, oom:
    rework oom detection") the allocation/reclaim retry loop considers all
    the reclaimable pages and throttles the allocation at that layer so we
    can loosen the direct reclaim throttling.

    Make shrink_inactive_list loop over too_many_isolated bounded and
    returns immediately when the situation hasn't resolved after the first
    sleep.

    Replace congestion_wait by a simple schedule_timeout_interruptible
    because we are not really waiting on the IO congestion in this path.

    Please note that this patch can theoretically cause the OOM killer to
    trigger earlier while there are many pages isolated for the reclaim
    which makes progress only very slowly. This would be obvious from the
    oom report as the number of isolated pages are printed there. If we
    ever hit this should_reclaim_retry should consider those numbers in the
    evaluation in one way or another.

    [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
    [2] http://lkml.kernel.org/r/201702212335.DJB30777.JOFMHSFtVLQOOF@I-love.SAKURA.ne.jp
    [3] http://lkml.kernel.org/r/201706300914.CEH95859.FMQOLVFHJFtOOS@I-love.SAKURA.ne.jp

    [mhocko@suse.com: switch to uninterruptible sleep]
    Link: http://lkml.kernel.org/r/20170724065048.GB25221@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170710074842.23175-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some shrinkers may only be able to free a bunch of objects at a time,
    and so free more than the requested nr_to_scan in one pass.

    Whilst other shrinkers may find themselves even unable to scan as many
    objects as they counted, and so underreport. Account for the extra
    freed/scanned objects against the total number of objects we intend to
    scan, otherwise we may end up penalising the slab far more than
    intended. Similarly, we want to add the underperforming scan to the
    deferred pass so that we try harder and harder in future passes.

    Link: http://lkml.kernel.org/r/20170822135325.9191-1-chris@chris-wilson.co.uk
    Signed-off-by: Chris Wilson
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Joonas Lahtinen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     

10 Aug, 2017

1 commit

  • A while ago someone, and I cannot find the email just now, asked if we
    could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
    like we use for other things like workqueues etc. I think this should
    be possible which allows reducing the 'irq' states and will reduce the
    amount of __bfs() lookups we do.

    Removing the 1 IRQ state results in 4 less __bfs() walks per
    dependency, improving lockdep performance. And by moving this
    annotation out of the lockdep code it becomes easier for the mm people
    to extend.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: boqun.feng@gmail.com
    Cc: iamjoonsoo.kim@lge.com
    Cc: kernel-team@lge.com
    Cc: kirill@shutemov.name
    Cc: npiggin@gmail.com
    Cc: walken@google.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jul, 2017

1 commit

  • The purpose of the code that commit 623762517e23 ("revert 'mm: vmscan:
    do not swap anon pages just because free+file is low'") reintroduces is
    to prefer swapping anonymous memory rather than trashing the file lru.

    If the anonymous inactive lru for the set of eligible zones is
    considered low, however, or the length of the list for the given reclaim
    priority does not allow for effective anonymous-only reclaiming, then
    avoid forcing SCAN_ANON. Forcing SCAN_ANON will end up thrashing the
    small list and leave unreclaimed memory on the file lrus.

    If the inactive list is insufficient, fallback to balanced reclaim so
    the file lru doesn't remain untouched.

    [akpm@linux-foundation.org: fix build]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705011432220.137835@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Suggested-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Jul, 2017

7 commits

  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • If there is no compound map for a THP (Transparent Huge Page), it is
    possible that the map count of some sub-pages of the THP is 0. So it is
    better to split the THP before swapping out. In this way, the sub-pages
    not mapped will be freed, and we can avoid the unnecessary swap out
    operations for these sub-pages.

    Link: http://lkml.kernel.org/r/20170515112522.32457-6-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To swap out THP (Transparent Huage Page), before splitting the THP, the
    swap cluster will be allocated and the THP will be added into the swap
    cache. But it is possible that the THP cannot be split, so that we must
    delete the THP from the swap cache and free the swap cluster. To avoid
    that, in this patch, whether the THP can be split is checked firstly.
    The check can only be done racy, but it is good enough for most cases.

    With the patch, the swap out throughput improves 3.6% (from about
    4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Kirill A. Shutemov [for can_split_huge_page()]
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
    so if it fails due to lack of space in case of THP or something(hdd swap
    but tries THP swapout) *caller* rather than add_to_swap itself should
    split the THP page and retry it with base page which is more natural.

    Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Clang and its -Wunsequenced emits a warning

    mm/vmscan.c:2961:25: error: unsequenced modification and access to 'gfp_mask' [-Wunsequenced]
    .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
    ^

    While it is not clear to me whether the initialization code violates the
    specification (6.7.8 par 19 (ISO/IEC 9899) looks like it disagrees) the
    code is quite confusing and worth cleaning up anyway. Fix this by
    reusing sc.gfp_mask rather than the updated input gfp_mask parameter.

    Link: http://lkml.kernel.org/r/20170510154030.10720-1-nick.desaulniers@gmail.com
    Signed-off-by: Nick Desaulniers
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Desaulniers
     

23 May, 2017

1 commit

  • To enable smp_processor_id() and might_sleep() debug checks earlier, it's
    required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING.

    Adjust the system_state check in kswapd_run() to handle the extra states.

    Tested-by: Mark Rutland
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt (VMware)
    Acked-by: Vlastimil Babka
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170516184736.119158930@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

13 May, 2017

1 commit

  • Although there are a ton of free swap and anonymous LRU page in elgible
    zones, OOM happened.

    balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
    CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x21d/0x3f0
    out_of_memory+0xd8/0x390
    __alloc_pages_slowpath+0xbc1/0xc50
    __alloc_pages_nodemask+0x1a5/0x1c0
    pte_alloc_one+0x20/0x50
    __pte_alloc+0x1e/0x110
    __handle_mm_fault+0x919/0x960
    handle_mm_fault+0x77/0x120
    __do_page_fault+0x27a/0x550
    trace_do_page_fault+0x43/0x150
    do_async_page_fault+0x2c/0x90
    async_page_fault+0x28/0x30
    Mem-Info:
    active_anon:424716 inactive_anon:65314 isolated_anon:0
    active_file:52 inactive_file:46 isolated_file:0
    unevictable:0 dirty:27 writeback:0 unstable:0
    slab_reclaimable:3967 slab_unreclaimable:4125
    mapped:133 shmem:43 pagetables:1674 bounce:0
    free:4637 free_pcp:225 free_cma:0
    Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 992 992 1952
    DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 959
    Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
    DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
    Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
    378 total pagecache pages
    17 pages in swap cache
    Swap cache stats: add 17325, delete 17302, find 0/27
    Free swap = 978940kB
    Total swap = 1048572kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    12629 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned
    [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
    [ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
    [ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd

    With investigation, skipping page of isolate_lru_pages makes reclaim
    void because it returns zero nr_taken easily so LRU shrinking is
    effectively nothing and just increases priority aggressively. Finally,
    OOM happens.

    The problem is that get_scan_count determines nr_to_scan with eligible
    zones so although priority drops to zero, it couldn't reclaim any pages
    if the LRU contains mostly ineligible pages.

    get_scan_count:

    size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
    size = size >> sc->priority;

    Assumes sc->priority is 0 and LRU list is as follows.

    N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

    (Ie, small eligible pages are in the head of LRU but others are
    almost ineligible pages)

    In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
    tail of the LRU are not eligible pages. If get_scan_count counts
    skipped pages, it doesn't reclaim any pages remained after scanning 4
    pages so it ends up OOM happening.

    This patch makes isolate_lru_pages try to scan pages until it encounters
    eligible zones's pages.

    [akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
    Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
    Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

09 May, 2017

1 commit

  • The previous patch ("mm: prevent potential recursive reclaim due to
    clearing PF_MEMALLOC") has shown that simply setting and clearing
    PF_MEMALLOC in current->flags can result in wrongly clearing a
    pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
    Let's introduce helpers that support proper nesting by saving the
    previous stat of the flag, similar to the existing memalloc_noio_* and
    memalloc_nofs_* helpers. Convert existing setting/clearing of
    PF_MEMALLOC within mm to the new helpers.

    There are no known issues with the converted code, but the change makes
    it more robust.

    Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrey Ryabinin
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

04 May, 2017

16 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We only ever count single events, drop the @nr parameter. Rename the
    function accordingly. Remove low-information kerneldoc.

    Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") we noticed bigger IO spikes during changes in cache access
    patterns.

    The patch in question shrunk the inactive list size to leave more room
    for the current workingset in the presence of streaming IO. However,
    workingset transitions that previously happened on the inactive list are
    now pushed out of memory and incur more refaults to complete.

    This patch disables active list protection when refaults are being
    observed. This accelerates workingset transitions, and allows more of
    the new set to establish itself from memory, without eating into the
    ability to protect the established workingset during stable periods.

    The workloads that were measurably affected for us were hit pretty bad
    by it, with refault/majfault rates doubling and tripling during cache
    transitions, and the machines sustaining half-hour periods of 100% IO
    utilization, where they'd previously have sub-minute peaks at 60-90%.

    Stateful services that handle user data tend to be more conservative
    with kernel upgrades. As a result we hit most page cache issues with
    some delay, as was the case here.

    The severity seemed to warrant a stable tag.

    Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list")
    Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: [4.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In 2002, [1] introduced SWAP_AGAIN. At that time, try_to_unmap_one used
    spin_trylock(&mm->page_table_lock) so it's really easy to contend and
    fail to hold a lock so SWAP_AGAIN to keep LRU status makes sense.

    However, now we changed it to mutex-based lock and be able to block
    without skip pte so there is few of small window to return SWAP_AGAIN so
    remove SWAP_AGAIN and just return SWAP_FAIL.

    [1] c48c43e, minimal rmap

    Link: http://lkml.kernel.org/r/1489555493-14659-7-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • ttu doesn't need to return SWAP_MLOCK. Instead, just return SWAP_FAIL
    because it means the page is not-swappable so it should move to another
    LRU list(active or unevictable). putback friends will move it to right
    list depending on the page's LRU flag.

    Link: http://lkml.kernel.org/r/1489555493-14659-6-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If we found lazyfree page is dirty, try_to_unmap_one can just
    SetPageSwapBakced in there like PG_mlocked page and just return with
    SWAP_FAIL which is very natural because the page is not swappable right
    now so that vmscan can activate it. There is no point to introduce new
    return value SWAP_DIRTY in try_to_unmap at the moment.

    Link: http://lkml.kernel.org/r/1489555493-14659-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Hillf Danton
    Acked-by: Kirill A. Shutemov
    Cc: Anshuman Khandual
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • By reviewing code, I find that when enter do_try_to_free_pages, the
    may_thrash is always clear, and it will retry shrink zones to tap
    cgroup's reserves memory by setting may_thrash when the former
    shrink_zones reclaim nothing.

    However, when memcg is disabled or on legacy hierarchy, or there do not
    have any memcg protected by low limit, it should not do this useless
    retry at all, for we do not have any cgroup's reserves memory to tap,
    and we have already done hard work but made no progress, which as Michal
    pointed out in former version, we are trying hard to control the retry
    logical of page alloctor, and the current additional round of reclaim is
    just lame.

    Therefore, to avoid this unneeded retrying and make code more readable,
    we remove the may_thrash field in scan_control, instead, introduce
    memcg_low_reclaim and memcg_low_skipped, and only retry when
    memcg_low_skipped, by setting memcg_low_reclaim.

    [xieyisheng1@huawei.com: remove may_thrash field, introduce mem_cgroup_reclaim]
    Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com
    Link: http://lkml.kernel.org/r/1490191893-5923-1-git-send-email-ysxie@foxmail.com
    Signed-off-by: Yisheng Xie
    Acked-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Michal Hocko
    Suggested-by: Shakeel Butt
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • kswapd is woken to reclaim a node based on a failed allocation request
    from any eligible zone. Once reclaiming in balance_pgdat(), it will
    continue reclaiming until there is an eligible zone available for the
    zone it was woken for. kswapd tracks what zone it was recently woken
    for in pgdat->kswapd_classzone_idx. If it has not been woken recently,
    this zone will be 0.

    However, the decision on whether to sleep is made on
    kswapd_classzone_idx which is 0 without a recent wakeup request and that
    classzone does not account for lowmem reserves. This allows kswapd to
    sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
    request even if a stream of allocations cannot use that zone. While
    kswapd may be woken again shortly in the near future there are two
    consequences -- the pgdat bits that control congestion are cleared
    prematurely and direct reclaim is more likely as kswapd slept
    prematurely.

    This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
    invalid index) when there has been no recent wakeups. If there are no
    wakeups, it'll decide whether to sleep based on the highest possible
    zone available (MAX_NR_ZONES - 1). It then becomes critical that the
    "pgdat balanced" decisions during reclaim and when deciding to sleep are
    the same. If there is a mismatch, kswapd can stay awake continually
    trying to balance tiny zones.

    simoop was used to evaluate it again. Two of the preparation patches
    regressed the workload so they are included as the second set of
    results. Otherwise this patch looks artifically excellent

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%)
    Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%)
    Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%)
    Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%)
    Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%)
    Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%)
    Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%)
    Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%)
    Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%)

    With this patch on top, write and allocation latencies are massively
    improved. The read latencies are slightly impaired but it's worth
    noting that this is mostly due to the IO scheduler and not directly
    related to reclaim. The vmstats are a bit of a mix but the relevant
    ones are as follows;

    4.10.0-rc7 4.10.0-rc7 4.10.0-rc7
    mmots-20170209 clear-v1r25keepawake-v1r25
    Swap Ins 0 0 0
    Swap Outs 0 608 0
    Direct pages scanned 6910672 3132699 6357298
    Kswapd pages scanned 57036946 82488665 56986286
    Kswapd pages reclaimed 55993488 63474329 55939113
    Direct pages reclaimed 6905990 2964843 6352115
    Kswapd efficiency 98% 76% 98%
    Kswapd velocity 12494.375 17597.507 12488.065
    Direct efficiency 99% 94% 99%
    Direct velocity 1513.835 668.306 1393.148
    Page writes by reclaim 0.000 4410243.000 0.000
    Page writes file 0 4409635 0
    Page writes anon 0 608 0
    Page reclaim immediate 1036792 14175203 1042571

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla clear-v2 keepawake-v2
    Swap Ins 0 12 0
    Swap Outs 0 838 0
    Direct pages scanned 6579706 3237270 6256811
    Kswapd pages scanned 61853702 79961486 54837791
    Kswapd pages reclaimed 60768764 60755788 53849586
    Direct pages reclaimed 6579055 2987453 6256151
    Kswapd efficiency 98% 75% 98%
    Page writes by reclaim 0.000 4389496.000 0.000
    Page writes file 0 4388658 0
    Page writes anon 0 838 0
    Page reclaim immediate 1073573 14473009 982507

    Swap-outs are equivalent to baseline.

    Direct reclaim is reduced but not eliminated. It's worth noting that
    there are two periods of direct reclaim for this workload. The first is
    when it switches from preparing the files for the actual test itself.
    It's a lot of file IO followed by a lot of allocs that reclaims heavily
    for a brief window. While direct reclaim is lower with clear-v2, it is
    due to kswapd scanning aggressively and trying to reclaim the world
    which is not the right thing to do. With the patches applied, there is
    still direct reclaim but the phase change from "creating work files" to
    starting multiple threads that allocate a lot of anonymous memory faster
    than kswapd can reclaim.

    Scanning/reclaim efficiency is restored by this patch.

    Page writes from reclaim context are back at 0 which is ideal.

    Pages immediately reclaimed after IO completes is slightly improved but
    it is expected this will vary slightly.

    On UMA, there is almost no change so this is not expected to be a
    universal win.

    [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
    Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
    Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Shantanu Goel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • A pgdat tracks if recent reclaim encountered too many dirty, writeback
    or congested pages. The flags control whether kswapd writes pages back
    from reclaim context, tags pages for immediate reclaim when IO
    completes, whether processes block on wait_iff_congested and whether
    kswapd blocks when too many pages marked for immediate reclaim are
    encountered.

    The state is cleared in a check function with side-effects. With the
    patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the
    timing of when the bits get cleared changed. Due to the way the check
    works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA
    allocation because it does not account for lowmem reserves properly.

    For the simoop workload, kswapd is not stalling when it should due to
    the premature clearing, writing pages from reclaim context like crazy
    and generally being unhelpful.

    This patch resets the pgdat bits related to page reclaim only when
    kswapd is going to sleep. The comparison with simoop is then

    4.11.0-rc1 4.11.0-rc1 4.11.0-rc1
    vanilla fixcheck-v2 clear-v2
    Amean p50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%) 19786774.76 ( 8.69%)
    Amean p95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%) 24101956.27 ( 5.32%)
    Amean p99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%) 27691872.71 ( 5.71%)
    Amean p50-Write 1390.30 ( 0.00%) 1395.28 ( -0.36%) 1011.91 ( 27.22%)
    Amean p95-Write 412901.57 ( 0.00%) 37737.74 ( 90.86%) 34874.98 ( 91.55%)
    Amean p99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%) 575449.60 ( 91.37%)
    Amean p50-Allocation 78714.31 ( 0.00%) 86286.22 ( -9.62%) 84246.26 ( -7.03%)
    Amean p95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%) 400058.43 (-127.91%)
    Amean p99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%) 10905600.00 (-4315.17%)

    Read latency is improved, write latency is mostly improved but
    allocation latency is regressed. kswapd is still reclaiming
    inefficiently, pages are being written back from writeback context and a
    host of other issues. However, given the change, it needed to be
    spelled out why the side-effect was moved.

    Link: http://lkml.kernel.org/r/20170309075657.25121-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Shantanu Goel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Reduce amount of time kswapd sleeps prematurely", v2.

    The series is unusual in that the first patch fixes one problem and
    introduces other issues that are noted in the changelog. Patch 2 makes
    a minor modification that is worth considering on its own but leaves the
    kernel in a state where it behaves badly. It's not until patch 3 that
    there is an improvement against baseline.

    This was mostly motivated by examining Chris Mason's "simoop" benchmark
    which puts the VM under similar pressure to HADOOP. It has been
    reported that the benchmark has regressed severely during the last
    number of releases. While I cannot reproduce all the same problems
    Chris experienced due to hardware limitations, there was a number of
    problems on a 2-socket machine with a single disk.

    simoop latencies
    4.11.0-rc1 4.11.0-rc1
    vanilla keepawake-v2
    Amean p50-Read 21670074.18 ( 0.00%) 22668332.52 ( -4.61%)
    Amean p95-Read 25456267.64 ( 0.00%) 26738688.00 ( -5.04%)
    Amean p99-Read 29369064.73 ( 0.00%) 30991404.52 ( -5.52%)
    Amean p50-Write 1390.30 ( 0.00%) 924.91 ( 33.47%)
    Amean p95-Write 412901.57 ( 0.00%) 1362.62 ( 99.67%)
    Amean p99-Write 6668722.09 ( 0.00%) 16854.04 ( 99.75%)
    Amean p50-Allocation 78714.31 ( 0.00%) 74729.74 ( 5.06%)
    Amean p95-Allocation 175533.51 ( 0.00%) 101609.74 ( 42.11%)
    Amean p99-Allocation 247003.02 ( 0.00%) 125765.57 ( 49.08%)

    These are latencies. Read/write are threads reading fixed-size random
    blocks from a simulated database. The allocation latency is mmaping and
    faulting regions of memory. The p50, 95 and p99 reports the worst
    latencies for 50% of the samples, 95% and 99% respectively.

    For example, the report indicates that while the test was running 99% of
    writes completed 99.75% faster. It's worth noting that on a UMA machine
    that no difference in performance with simoop was observed so milage
    will vary.

    It's noted that there is a slight impact to read latencies but it's
    mostly due to IO scheduler decisions and offset by the large reduction
    in other latencies.

    This patch (of 3):

    The check in prepare_kswapd_sleep needs to match the one in
    balance_pgdat since the latter will return as soon as any one of the
    zones in the classzone is above the watermark. This is specially
    important for higher order allocations since balance_pgdat will
    typically reset the order to zero relying on compaction to create the
    higher order pages. Without this patch, prepare_kswapd_sleep fails to
    wake up kcompactd since the zone balance check fails.

    It was first reported against 4.9.7 that kswapd is failing to wake up
    kcompactd due to a mismatch in the zone balance check between
    balance_pgdat() and prepare_kswapd_sleep().

    balance_pgdat() returns as soon as a single zone satisfies the
    allocation but prepare_kswapd_sleep() requires all zones to do +the
    same. This causes prepare_kswapd_sleep() to never succeed except in the
    order == 0 case and consequently, wakeup_kcompactd() is never called.
    For the machine that originally motivated this patch, the state of
    compaction from /proc/vmstat looked this way after a day and a half +of
    uptime:

    compact_migrate_scanned 240496
    compact_free_scanned 76238632
    compact_isolated 123472
    compact_stall 1791
    compact_fail 29
    compact_success 1762
    compact_daemon_wake 0

    After applying the patch and about 10 hours of uptime the state looks
    like this:

    compact_migrate_scanned 59927299
    compact_free_scanned 2021075136
    compact_isolated 640926
    compact_stall 4
    compact_fail 2
    compact_success 2
    compact_daemon_wake 5160

    Further notes from Mel that motivated him to pick this patch up and
    resend it;

    It was observed for the simoop workload (pressures the VM similar to
    HADOOP) that kswapd was failing to keep ahead of direct reclaim. The
    investigation noted that there was a need to rationalise kswapd
    decisions to reclaim with kswapd decisions to sleep. With this patch on
    a 2-socket box, there was a 49% reduction in direct reclaim scanning.

    However, the impact otherwise is extremely negative. Kswapd reclaim
    efficiency dropped from 98% to 76%. simoop has three latency-related
    metrics for read, write and allocation (an anonymous mmap and fault).

    4.11.0-rc1 4.11.0-rc1
    vanilla fixcheck-v2
    Amean p50-Read 21670074.18 ( 0.00%) 20464344.18 ( 5.56%)
    Amean p95-Read 25456267.64 ( 0.00%) 25721423.64 ( -1.04%)
    Amean p99-Read 29369064.73 ( 0.00%) 30174230.76 ( -2.74%)
    Amean p50-Write 1390.30 ( 0.00%) 1395.28 ( -0.36%)
    Amean p95-Write 412901.57 ( 0.00%) 37737.74 ( 90.86%)
    Amean p99-Write 6668722.09 ( 0.00%) 666489.04 ( 90.01%)
    Amean p50-Allocation 78714.31 ( 0.00%) 86286.22 ( -9.62%)
    Amean p95-Allocation 175533.51 ( 0.00%) 351812.27 (-100.42%)
    Amean p99-Allocation 247003.02 ( 0.00%) 6291171.56 (-2447.00%)

    Of greater concern is that the patch causes swapping and page writes
    from kswapd context rose from 0 pages to 4189753 pages during the hour
    the workload ran for. By and large, the patch has very bad behaviour
    but easily missed as the impact on a UMA machine is negligible.

    This patch is included with the data in case a bisection leads to this
    area. This patch is also a pre-requisite for the rest of the series.

    Link: http://lkml.kernel.org/r/20170309075657.25121-2-mgorman@techsingularity.net
    Signed-off-by: Shantanu Goel
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shantanu Goel
     
  • GFP_NOFS context is used for the following 5 reasons currently:

    - to prevent from deadlocks when the lock held by the allocation
    context would be needed during the memory reclaim

    - to prevent from stack overflows during the reclaim because the
    allocation is performed from a deep context already

    - to prevent lockups when the allocation context depends on other
    reclaimers to make a forward progress indirectly

    - just in case because this would be safe from the fs POV

    - silence lockdep false positives

    Unfortunately overuse of this allocation context brings some problems to
    the MM. Memory reclaim is much weaker (especially during heavy FS
    metadata workloads), OOM killer cannot be invoked because the MM layer
    doesn't have enough information about how much memory is freeable by the
    FS layer.

    In many cases it is far from clear why the weaker context is even used
    and so it might be used unnecessarily. We would like to get rid of
    those as much as possible. One way to do that is to use the flag in
    scopes rather than isolated cases. Such a scope is declared when really
    necessary, tracked per task and all the allocation requests from within
    the context will simply inherit the GFP_NOFS semantic.

    Not only this is easier to understand and maintain because there are
    much less problematic contexts than specific allocation requests, this
    also helps code paths where FS layer interacts with other layers (e.g.
    crypto, security modules, MM etc...) and there is no easy way to convey
    the allocation context between the layers.

    Introduce memalloc_nofs_{save,restore} API to control the scope of
    GFP_NOFS allocation context. This is basically copying
    memalloc_noio_{save,restore} API we have for other restricted allocation
    context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
    just an alias for PF_FSTRANS which has been xfs specific until recently.
    There are no more PF_FSTRANS users anymore so let's just drop it.

    PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
    implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
    is renamed to current_gfp_context because it now cares about both
    PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
    their semantic. kmem_flags_convert() doesn't need to evaluate the flag
    anymore.

    This patch shouldn't introduce any functional changes.

    Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
    usage as much as possible and only use a properly documented
    memalloc_nofs_{save,restore} checkpoints where they are appropriate.

    [akpm@linux-foundation.org: fix comment typo, reflow comment]
    Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Patch series "mm: fix some MADV_FREE issues", v5.

    We are trying to use MADV_FREE in jemalloc. Several issues are found.
    Without solving the issues, jemalloc can't use the MADV_FREE feature.

    - Doesn't support system without swap enabled. Because if swap is off,
    we can't or can't efficiently age anonymous pages. And since
    MADV_FREE pages are mixed with other anonymous pages, we can't
    reclaim MADV_FREE pages. In current implementation, MADV_FREE will
    fallback to MADV_DONTNEED without swap enabled. But in our
    environment, a lot of machines don't enable swap. This will prevent
    our setup using MADV_FREE.

    - Increases memory pressure. page reclaim bias file pages reclaim
    against anonymous pages. This doesn't make sense for MADV_FREE pages,
    because those pages could be freed easily and refilled with very
    slight penality. Even page reclaim doesn't bias file pages, there is
    still an issue, because MADV_FREE pages and other anonymous pages are
    mixed together. To reclaim a MADV_FREE page, we probably must scan a
    lot of other anonymous pages, which is inefficient. In our test, we
    usually see oom with MADV_FREE enabled and nothing without it.

    - Accounting. There are two accounting problems. We don't have a global
    accounting. If the system is abnormal, we don't know if it's a
    problem from MADV_FREE side. The other problem is RSS accounting.
    MADV_FREE pages are accounted as normal anon pages and reclaimed
    lazily, so application's RSS becomes bigger. This confuses our
    workloads. We have monitoring daemon running and if it finds
    applications' RSS becomes abnormal, the daemon will kill the
    applications even kernel can reclaim the memory easily.

    To address the first the two issues, we can either put MADV_FREE pages
    into a separate LRU list (Minchan's previous patches and V1 patches), or
    put them into LRU_INACTIVE_FILE list (suggested by Johannes). The
    patchset use the second idea. The reason is LRU_INACTIVE_FILE list is
    tiny nowadays and should be full of used once file pages. So we can
    still efficiently reclaim MADV_FREE pages there without interference
    with other anon and active file pages. Putting the pages into inactive
    file list also has an advantage which allows page reclaim to prioritize
    MADV_FREE pages and used once file pages. MADV_FREE pages are put into
    the lru list and clear SwapBacked flag, so PageAnon(page) &&
    !PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
    directly freed without pageout if they are clean, otherwise normal swap
    will reclaim them.

    For the third issue, the previous post adds global accounting and a
    separate RSS count for MADV_FREE pages. The problem is we never get
    accurate accounting for MADV_FREE pages. The pages are mapped to
    userspace, can be dirtied without notice from kernel side. To get
    accurate accounting, we could write protect the page, but then there is
    extra page fault overhead, which people don't want to pay. Jemalloc
    guys have concerns about the inaccurate accounting, so this post drops
    the accounting patches temporarily. The info exported to
    /proc/pid/smaps for MADV_FREE pages are kept, which is the only place we
    can get accurate accounting right now.

    This patch (of 6):

    Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
    always have the flag set if we want to do an unmap. For cases we don't
    do an unmap, the TTU_LZFREE part of code should never run.

    Also the TTU_UNMAP is unnecessary. If no other flags set (for example,
    TTU_MIGRATION), an unmap is implied.

    The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
    code

    Link: http://lkml.kernel.org/r/4be3ea1bc56b26fd98a54d0a6f70bec63f6d8980.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • This reverts commit d7f05528eedb047efe2288cff777676b028747b6.

    Now that reclaimability of a node is no longer based on the ratio
    between pages scanned and theoretically reclaimable pages, we can remove
    accounting tricks for pages skipped due to zone constraints.

    Link: http://lkml.kernel.org/r/20170228214007.5621-9-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner