07 Nov, 2019

2 commits

  • pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
    This is not really nice because it blocks both any interrupts on that
    cpu and the page allocator. On large machines this might even trigger
    the hard lockup detector.

    Considering the pagetypeinfo is a debugging tool we do not really need
    exact numbers here. The primary reason to look at the outuput is to see
    how pageblocks are spread among different migratetypes and low number of
    pages is much more interesting therefore putting a bound on the number
    of pages on the free_list sounds like a reasonable tradeoff.

    The new output will simply tell
    [...]
    Node 6, zone Normal, type Movable >100000 >100000 >100000 >100000 41019 31560 23996 10054 3229 983 648

    instead of
    Node 6, zone Normal, type Movable 399568 294127 221558 102119 41019 31560 23996 10054 3229 983 648

    The limit has been chosen arbitrary and it is a subject of a future
    change should there be a need for that.

    While we are at it, also drop the zone lock after each free_list
    iteration which will help with the IRQ and page allocator responsiveness
    even further as the IRQ lock held time is always bound to those 100k
    pages.

    [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
    Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Reviewed-by: Waiman Long
    Acked-by: Vlastimil Babka
    Acked-by: David Hildenbrand
    Acked-by: Rafael Aquini
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: Jann Horn
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Roman Gushchin
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • /proc/pagetypeinfo is a debugging tool to examine internal page
    allocator state wrt to fragmentation. It is not very useful for any
    other use so normal users really do not need to read this file.

    Waiman Long has noticed that reading this file can have negative side
    effects because zone->lock is necessary for gathering data and that a)
    interferes with the page allocator and its users and b) can lead to hard
    lockups on large machines which have very long free_list.

    Reduce both issues by simply not exporting the file to regular users.

    Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
    Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
    Signed-off-by: Michal Hocko
    Reported-by: Waiman Long
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Waiman Long
    Acked-by: Rafael Aquini
    Acked-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Cc: Konstantin Khlebnikov
    Cc: Jann Horn
    Cc: Song Liu
    Cc: Greg Kroah-Hartman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Sep, 2019

1 commit

  • In preparation for non-shmem THP, this patch adds a few stats and exposes
    them in /proc/meminfo, /sys/bus/node/devices//meminfo, and
    /proc//task//smaps.

    This patch is mostly a rewrite of Kirill A. Shutemov's earlier version:
    https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/

    Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 Apr, 2019

1 commit

  • Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    depends on skipping vmstat entries with empty name introduced in
    7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
    /proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
    semantics of nr_indirectly_reclaimable_bytes").

    So skipping no longer works and /proc/vmstat has misformatted lines " 0".

    This patch simply shows debug counters "nr_tlb_remote_*" for UP.

    Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
    Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Vlastimil Babka
    Cc: Roman Gushchin
    Cc: Jann Horn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

06 Mar, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

29 Dec, 2018

1 commit

  • totalram_pages, zone->managed_pages and totalhigh_pages updates are
    protected by managed_page_count_lock, but readers never care about it.
    Convert these variables to atomic to avoid readers potentially seeing a
    store tear.

    This patch converts zone->managed_pages. Subsequent patches will convert
    totalram_panges, totalhigh_pages and eventually managed_page_count_lock
    will be removed.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

19 Nov, 2018

1 commit

  • Scan through the whole array to see if an update is needed. While we're
    at it, use sizeof() to be safe against any possible type changes in the
    future.

    The bug here is that we wouldn't sync per-cpu counters into global ones
    if there was an update of numa_stats for higher cpus. Highly
    theoretical one though because it is much more probable that zone_stats
    are updated so we would refresh anyway. So I wouldn't bother to mark
    this for stable, yet something nice to fix.

    [mhocko@suse.com: changelog enhancement]
    Link: http://lkml.kernel.org/r/1541601517-17282-1-git-send-email-janne.huttunen@nokia.com
    Fixes: 1d90ca897cb0 ("mm: update NUMA counter threshold size")
    Signed-off-by: Janne Huttunen
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Janne Huttunen
     

27 Oct, 2018

4 commits

  • Having two gigantic arrays that must manually be kept in sync, including
    ifdefs, isn't exactly robust. To make it easier to catch such issues in
    the future, add a BUILD_BUG_ON().

    Link: http://lkml.kernel.org/r/20181001143138.95119-3-jannh@google.com
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Make it easier to catch bugs in the shadow node shrinker by adding a
    counter for the shadow nodes in circulation.

    [akpm@linux-foundation.org: assert that irqs are disabled, for __inc_lruvec_page_state()]
    [akpm@linux-foundation.org: s/WARN_ON_ONCE/VM_WARN_ON_ONCE/, per Johannes]
    Link: http://lkml.kernel.org/r/20181009184732.762-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Refaults happen during transitions between workingsets as well as in-place
    thrashing. Knowing the difference between the two has a range of
    applications, including measuring the impact of memory shortage on the
    system performance, as well as the ability to smarter balance pressure
    between the filesystem cache and the swap-backed workingset.

    During workingset transitions, inactive cache refaults and pushes out
    established active cache. When that active cache isn't stale, however,
    and also ends up refaulting, that's bonafide thrashing.

    Introduce a new page flag that tells on eviction whether the page has been
    active or not in its lifetime. This bit is then stored in the shadow
    entry, to classify refaults as transitioning or thrashing.

    How many page->flags does this leave us with on 32-bit?

    20 bits are always page flags

    21 if you have an MMU

    23 with the zone bits for DMA, Normal, HighMem, Movable

    29 with the sparsemem section bits

    30 if PAE is enabled

    31 with this patch.

    So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes. If
    that's not enough, the system can switch to discontigmem and re-gain the 6
    or 7 sparsemem section bits.

    Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Drake
    Tested-by: Suren Baghdasaryan
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
    commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
    the goal of accounting objects that can be reclaimed, but cannot be
    allocated via a SLAB_RECLAIM_ACCOUNT cache. This is now possible via
    kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
    is converted.

    The counter is however still useful for accounting direct page allocations
    (i.e. not slab) with a shrinker, such as the ION page pool. So keep it,
    and:

    - change granularity to pages to be more like other counters; sub-page
    allocations should be able to use kmalloc
    - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
    - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
    again remove the check for not printing "hidden" counters

    Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: Roman Gushchin
    Cc: Vijayanand Jitta
    Cc: Laura Abbott
    Cc: Sumit Semwal
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

06 Oct, 2018

2 commits

  • 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
    on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
    the kernel unconditional to reduce #ifdef soup, but (either to avoid
    showing dummy zero counters to userspace, or because that code was missed)
    didn't update the vmstat_array, meaning that all following counters would
    be shown with incorrect values.

    This only affects kernel builds with
    CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

    Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
    Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
    VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
    entry in vmstat_text. This causes an out-of-bounds access in
    vmstat_show().

    Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
    is probably very rare.

    Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
    Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
    Signed-off-by: Jann Horn
    Reviewed-by: Kees Cook
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Davidlohr Bueso
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Kemi Wang
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

29 Jun, 2018

1 commit

  • Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
    BUG"). Steven saw a "using smp_processor_id() in preemptible" message
    and added a preempt_disable() section around it to keep it quiet. This
    is not the right thing to do it does not fix the real problem.

    vmstat_update() is invoked by a kworker on a specific CPU. This worker
    it bound to this CPU. The name of the worker was "kworker/1:1" so it
    should have been a worker which was bound to CPU1. A worker which can
    run on any CPU would have a `u' before the first digit.

    smp_processor_id() can be used in a preempt-enabled region as long as
    the task is bound to a single CPU which is the case here. If it could
    run on an arbitrary CPU then this is the problem we have an should seek
    to resolve.

    Not only this smp_processor_id() must not be migrated to another CPU but
    also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
    Not to mention that other code relies on the fact that such a worker
    runs on one specific CPU only.

    Therefore revert that commit and we should look instead what broke the
    affinity mask of the kworker.

    Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Steven J. Hill
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

16 May, 2018

1 commit


12 May, 2018

1 commit

  • Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
    no need to export this vm counter to userspace, and some changes are
    expected in reclaimable object accounting, which can alter this counter.

    Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
    Signed-off-by: Roman Gushchin
    Acked-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

12 Apr, 2018

1 commit

  • Patch series "indirectly reclaimable memory", v2.

    This patchset introduces the concept of indirectly reclaimable memory
    and applies it to fix the issue of when a big number of dentries with
    external names can significantly affect the MemAvailable value.

    This patch (of 3):

    Introduce a concept of indirectly reclaimable memory and adds the
    corresponding memory counter and /proc/vmstat item.

    Indirectly reclaimable memory is any sort of memory, used by the kernel
    (except of reclaimable slabs), which is actually reclaimable, i.e. will
    be released under memory pressure.

    The counter is in bytes, as it's not always possible to count such
    objects in pages. The name contains BYTES by analogy to
    NR_KERNEL_STACK_KB.

    Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
    Signed-off-by: Roman Gushchin
    Reviewed-by: Andrew Morton
    Cc: Alexander Viro
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

29 Mar, 2018

1 commit

  • Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
    cause vmstat_update() to report a BUG due to preemption not being
    disabled around smp_processor_id().

    Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

    BUG: using smp_processor_id() in preemptible [00000000] code:
    kworker/1:1/269
    caller is vmstat_update+0x50/0xa0
    CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
    4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
    Workqueue: mm_percpu_wq vmstat_update
    Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

    Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
    Signed-off-by: Steven J. Hill
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven J. Hill
     

16 Nov, 2017

2 commits

  • This is the second step which introduces a tunable interface that allow
    numa stats configurable for optimizing zone_statistics(), as suggested
    by Dave Hansen and Ying Huang.

    =========================================================================

    When page allocation performance becomes a bottleneck and you can
    tolerate some possible tool breakage and decreased numa counter
    precision, you can do:

    echo 0 > /proc/sys/vm/numa_stat

    In this case, numa counter update is ignored. We can see about
    *4.8%*(185->176) drop of cpu cycles per single page allocation and
    reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
    drop of cpu cycles per single page allocation and reclaim on Jesper's
    page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
    (88 threads, 126G memory).

    Benchmark link provided by Jesper D Brouer (increase loop times to
    10000000):

    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    =========================================================================

    When page allocation performance is not a bottleneck and you want all
    tooling to work, you can do:

    echo 1 > /proc/sys/vm/numa_stat

    This is system default setting.

    Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
    for comments to help improve the original patch.

    [keescook@chromium.org: make sure mutex is a global static]
    Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
    Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Signed-off-by: Kees Cook
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Luis R . Rodriguez"
    Cc: Kees Cook
    Cc: Jonathan Corbet
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Sebastian Andrzej Siewior
    Cc: Andrey Ryabinin
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
    list") 'pgdat->inactive_ratio' is not used, except for printing
    "node_inactive_ratio: 0" in /proc/zoneinfo output.

    Remove it.

    Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

09 Sep, 2017

3 commits

  • To avoid deviation, the per cpu number of NUMA stats in
    vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.

    Since NUMA stats does not be read by users frequently, and kernel does not
    need it to make a decision, it will not be a problem to make the readers
    more expensive.

    Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tim Chen
    Cc: Ying Huang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • There is significant overhead in cache bouncing caused by zone counters
    (NUMA associated counters) update in parallel in multi-threaded page
    allocation (suggested by Dave Hansen).

    This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
    as a small threshold greatly increases the update frequency of the global
    counter from local per cpu counter(suggested by Ying Huang).

    The rationality is that these statistics counters don't affect the
    kernel's decision, unlike other VM counters, so it's not a problem to use
    a large threshold.

    With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
    single page allocation and reclaim on Jesper's page_bench03 benchmark.

    Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
    bench

    Threshold CPU cycles Throughput(88 threads)
    32 799 241760478
    64 640 301628829
    125 537 358906028 system by default (base)
    256 468 412397590
    512 428 450550704
    4096 399 482520943
    20000 394 489009617
    30000 395 488017817
    65533 369(-31.3%) 521661345(+45.3%) with this patchset
    N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

    Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Suggested-by: Dave Hansen
    Suggested-by: Ying Huang
    Acked-by: Mel Gorman
    Cc: Aaron Lu
    Cc: Andi Kleen
    Cc: Christopher Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     
  • Patch series "Separate NUMA statistics from zone statistics", v2.

    Each page allocation updates a set of per-zone statistics with a call to
    zone_statistics(). As discussed in 2017 MM summit, these are a
    substantial source of overhead in the page allocator and are very rarely
    consumed. This significant overhead in cache bouncing caused by zone
    counters (NUMA associated counters) update in parallel in multi-threaded
    page allocation (pointed out by Dave Hansen).

    A link to the MM summit slides:
    http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

    To mitigate this overhead, this patchset separates NUMA statistics from
    zone statistics framework, and update NUMA counter threshold to a fixed
    size of MAX_U16 - 2, as a small threshold greatly increases the update
    frequency of the global counter from local per cpu counter (suggested by
    Ying Huang). The rationality is that these statistics counters don't
    need to be read often, unlike other VM counters, so it's not a problem
    to use a large threshold and make readers more expensive.

    With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
    below) for per single page allocation and reclaim on Jesper's
    page_bench03 benchmark. Meanwhile, this patchset keeps the same style
    of virtual memory statistics with little end-user-visible effects (only
    move the numa stats to show behind zone page stats, see the first patch
    for details).

    I did an experiment of single page allocation and reclaim concurrently
    using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
    server (88 processors with 126G memory) with different size of threshold
    of pcp counter.

    Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
    https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

    Threshold CPU cycles Throughput(88 threads)
    32 799 241760478
    64 640 301628829
    125 537 358906028 system by default
    256 468 412397590
    512 428 450550704
    4096 399 482520943
    20000 394 489009617
    30000 395 488017817
    65533 369(-31.3%) 521661345(+45.3%) with this patchset
    N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

    This patch (of 3):

    In this patch, NUMA statistics is separated from zone statistics
    framework, all the call sites of NUMA stats are changed to use
    numa-stats-specific functions, it does not have any functionality change
    except that the number of NUMA stats is shown behind zone page stats
    when users *read* the zone info.

    E.g. cat /proc/zoneinfo
    ***Base*** ***With this patch***
    nr_free_pages 3976 nr_free_pages 3976
    nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
    nr_zone_active_anon 0 nr_zone_active_anon 0
    nr_zone_inactive_file 0 nr_zone_inactive_file 0
    nr_zone_active_file 0 nr_zone_active_file 0
    nr_zone_unevictable 0 nr_zone_unevictable 0
    nr_zone_write_pending 0 nr_zone_write_pending 0
    nr_mlock 0 nr_mlock 0
    nr_page_table_pages 0 nr_page_table_pages 0
    nr_kernel_stack 0 nr_kernel_stack 0
    nr_bounce 0 nr_bounce 0
    nr_zspages 0 nr_zspages 0
    numa_hit 0 *nr_free_cma 0*
    numa_miss 0 numa_hit 0
    numa_foreign 0 numa_miss 0
    numa_interleave 0 numa_foreign 0
    numa_local 0 numa_interleave 0
    numa_other 0 numa_local 0
    *nr_free_cma 0* numa_other 0
    ... ...
    vm stats threshold: 10 vm stats threshold: 10
    ... ...

    The next patch updates the numa stats counter size and threshold.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
    Signed-off-by: Kemi Wang
    Reported-by: Jesper Dangaard Brouer
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Christopher Lameter
    Cc: Dave Hansen
    Cc: Andi Kleen
    Cc: Ying Huang
    Cc: Aaron Lu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kemi Wang
     

07 Sep, 2017

6 commits

  • Patch series "mm, swap: VMA based swap readahead", v4.

    The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory space. And the different tasks in the system may have different
    access patterns, which makes the global space locality estimation
    incorrect.

    In this patchset, when page fault occurs, the virtual pages near the
    fault address will be readahead instead of the swap slots near the fault
    swap slot in swap device. This avoid to readahead the unrelated swap
    slots. At the same time, the swap readahead is changed to work on
    per-VMA from globally. So that the different access patterns of the
    different VMAs could be distinguished, and the different readahead
    policy could be applied accordingly. The original core readahead
    detection and scaling algorithm is reused, because it is an effect
    algorithm to detect the space locality.

    In addition to the swap readahead changes, some new sysfs interface is
    added to show the efficiency of the readahead algorithm and some other
    swap statistics.

    This new implementation will incur more small random read, on SSD, the
    improved correctness of estimation and readahead target should beat the
    potential increased overhead, this is also illustrated in the test
    results below. But on HDD, the overhead may beat the benefit, so the
    original implementation will be used by default.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
    Swap device: NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    This patch (of 5):

    The statistics for total readahead pages and total readahead hits are
    recorded and exported via the following sysfs interface.

    /sys/kernel/mm/swap/ra_hits
    /sys/kernel/mm/swap/ra_total

    With them, the efficiency of the swap readahead could be measured, so
    that the swap readahead algorithm and parameters could be tuned
    accordingly.

    [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
    Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
    pagetypeinfo_show_free()'s comment. This commit fixes it.

    Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
    Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
    Signed-off-by: SeongJae Park
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • When order is -1 or too big, *1UL << order* will be 0, which will cause
    a divide error. Although it seems that all callers of
    __fragmentation_index() will only do so with a valid order, the patch
    can make it more robust.

    Should prevent reoccurrences of
    https://bugzilla.kernel.org/show_bug.cgi?id=196555

    Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
    Signed-off-by: Wen Yang
    Reviewed-by: Jiang Biao
    Suggested-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Yang
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When swapping out THP (Transparent Huge Page), instead of swapping out
    the THP as a whole, sometimes we have to fallback to split the THP into
    normal pages before swapping, because no free swap clusters are
    available, or cgroup limit is exceeded, etc. To count the number of the
    fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
    we fallback to split the THP.

    Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • To support delay splitting THP (Transparent Huge Page) after swapped
    out, we need to enhance swap writing code to support to write a THP as a
    whole. This will improve swap write IO performance.

    As Ming Lei pointed out, this should be based on
    multipage bvec support, which hasn't been merged yet. So this patch is
    only for testing the functionality of the other patches in the series.
    And will be reimplemented after multipage bvec support is merged.

    Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Shaohua Li
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

11 Jul, 2017

1 commit

  • pagetypeinfo_showmixedcount_print is found to take a lot of time to
    complete and it does this holding the zone lock and disabling
    interrupts. In some cases it is found to take more than a second (On a
    2.4GHz,8Gb RAM,arm64 cpu).

    Avoid taking the zone lock similar to what is done by read_page_owner,
    which means possibility of inaccurate results.

    Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: zhongjiang
    Cc: Sergey Senozhatsky
    Cc: Sudip Mukherjee
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Sebastian Andrzej Siewior
    Cc: David Rientjes
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     

07 Jul, 2017

4 commits

  • Patch series "mm: per-lruvec slab stats"

    Josef is working on a new approach to balancing slab caches and the page
    cache. For this to work, he needs slab cache statistics on the lruvec
    level. These patches implement that by adding infrastructure that
    allows updating and reading generic VM stat items per lruvec, then
    switches some existing VM accounting sites, including the slab
    accounting ones, to this new cgroup-aware API.

    I'll follow up with more patches on this, because there is actually
    substantial simplification that can be done to the memory controller
    when we replace private memcg accounting with making the existing VM
    accounting sites cgroup-aware. But this is enough for Josef to base his
    slab reclaim work on, so here goes.

    This patch (of 5):

    To re-implement slab cache vs. page cache balancing, we'll need the
    slab counters at the lruvec level, which, ever since lru reclaim was
    moved from the zone to the node, is the intersection of the node, not
    the zone, and the memcg.

    We could retain the per-zone counters for when the page allocator dumps
    its memory information on failures, and have counters on both levels -
    which on all but NUMA node 0 is usually redundant. But let's keep it
    simple for now and just move them. If anybody complains we can restore
    the per-zone counters.

    [hannes@cmpxchg.org: fix oops]
    Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
    Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • pagetypeinfo_showblockcount_print skips over invalid pfns but it would
    report pages which are offline because those have a valid pfn. Their
    migrate type is misleading at best.

    Now that we have pfn_to_online_page() we can use it instead of
    pfn_valid() and fix this.

    [mhocko@suse.com: fix build]
    Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Balbir Singh
    Cc: Dan Williams
    Cc: Daniel Kiper
    Cc: David Rientjes
    Cc: Heiko Carstens
    Cc: Igor Mammedov
    Cc: Jerome Glisse
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Reza Arbab
    Cc: Tobias Regnery
    Cc: Toshi Kani
    Cc: Vitaly Kuznetsov
    Cc: Vlastimil Babka
    Cc: Xishi Qiu
    Cc: Yasuaki Ishimatsu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Standardize the file operation variable names related to all four memory
    management /proc interface files. Also change all the symbol
    permissions (S_IRUGO) into octal permissions (0444) as it got complaints
    from checkpatch.pl. This does not create any functional change to the
    interface.

    Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

13 May, 2017

1 commit

  • After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
    zoneinfo"), /proc/zoneinfo will show unpopulated zones.

    A memoryless node, having no populated zones at all, was previously
    ignored, but will now trigger the WARN() in is_zone_first_populated().

    Remove this warning, as its only purpose was to warn of a situation that
    has since been enabled.

    Aside: The "per-node stats" are still printed under the first populated
    zone, but that's not necessarily the first stanza any more. I'm not
    sure which criteria is more important with regard to not breaking
    parsers, but it looks a little weird to the eye.

    Fixes: e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
    Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

04 May, 2017

4 commits

  • After "mm, vmstat: print non-populated zones in zoneinfo",
    /proc/zoneinfo will show unpopulated zones.

    The per-cpu pageset statistics are not relevant for unpopulated zones
    and can be potentially lengthy, so supress them when they are not
    interesting.

    Also moves lowmem reserve protection information above pcp stats since
    it is relevant for all zones per vm.lowmem_reserve_ratio.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Initscripts can use the information (protection levels) from
    /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.

    vm.lowmem_reserve_ratio is an array of ratios for each configured zone
    on the system. If a zone is not populated on an arch, /proc/zoneinfo
    suppresses its output.

    This results in there not being a 1:1 mapping between the set of zones
    emitted by /proc/zoneinfo and the zones configured by
    vm.lowmem_reserve_ratio.

    This patch shows statistics for non-populated zones in /proc/zoneinfo.
    The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
    Without this patch, it is not possible to determine which index in the
    array controls which zone if one or more zones on the system are not
    populated.

    Remaining users of walk_zones_in_node() are unchanged. Files such as
    /proc/pagetypeinfo require certain zone data to be initialized properly
    for display, which is not done for unpopulated zones.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • NR_PAGES_SCANNED counts number of pages scanned since the last page free
    event in the allocator. This was used primarily to measure the
    reclaimability of zones and nodes, and determine when reclaim should
    give up on them. In that role, it has been replaced in the preceding
    patches by a different mechanism.

    Being implemented as an efficient vmstat counter, it was automatically
    exported to userspace as well. It's however unlikely that anyone
    outside the kernel is using this counter in any meaningful way.

    Remove the counter and the unused pgdat_reclaimable().

    Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner