Eric Lee / smarc-fsl-linux-kernel

06 Oct, 2018

2 commits

58bc4c34d mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly ... Browse Code »

5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even
on UP") made the availability of the NR_TLB_REMOTE_FLUSH* counters inside
the kernel unconditional to reduce #ifdef soup, but (either to avoid
showing dummy zero counters to userspace, or because that code was missed)
didn't update the vmstat_array, meaning that all following counters would
be shown with incorrect values.

This only affects kernel builds with
CONFIG_VM_EVENT_COUNTERS=y && CONFIG_DEBUG_TLBFLUSH=y && CONFIG_SMP=n.

Link: http://lkml.kernel.org/r/20181001143138.95119-2-jannh@google.com
Fixes: 5dd0b16cdaff ("mm/vmstat: Make NR_TLB_REMOTE_FLUSH_RECEIVED available even on UP")
Signed-off-by: Jann Horn
Reviewed-by: Kees Cook
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Kemi Wang
Cc: Andy Lutomirski
Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2018-10-06 07:32:05 +0800
28e2c4bb9 mm/vmstat.c: fix outdated vmstat_text ... Browse Code »

7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely") removed the
VMACACHE_FULL_FLUSHES statistics, but didn't remove the corresponding
entry in vmstat_text. This causes an out-of-bounds access in
vmstat_show().

Luckily this only affects kernels with CONFIG_DEBUG_VM_VMACACHE=y, which
is probably very rare.

Link: http://lkml.kernel.org/r/20181001143138.95119-1-jannh@google.com
Fixes: 7a9cdebdcc17 ("mm: get rid of vmacache_flush_all() entirely")
Signed-off-by: Jann Horn
Reviewed-by: Kees Cook
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Acked-by: Roman Gushchin
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Christoph Lameter
Cc: Kemi Wang
Cc: Andy Lutomirski
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2018-10-06 07:32:05 +0800

29 Jun, 2018

1 commit

28557cc10 Revert mm/vmstat.c: fix vmstat_update() preemption BUG ... Browse Code »

Revert commit c7f26ccfb2c3 ("mm/vmstat.c: fix vmstat_update() preemption
BUG"). Steven saw a "using smp_processor_id() in preemptible" message
and added a preempt_disable() section around it to keep it quiet. This
is not the right thing to do it does not fix the real problem.

vmstat_update() is invoked by a kworker on a specific CPU. This worker
it bound to this CPU. The name of the worker was "kworker/1:1" so it
should have been a worker which was bound to CPU1. A worker which can
run on any CPU would have a `u' before the first digit.

smp_processor_id() can be used in a preempt-enabled region as long as
the task is bound to a single CPU which is the case here. If it could
run on an arbitrary CPU then this is the problem we have an should seek
to resolve.

Not only this smp_processor_id() must not be migrated to another CPU but
also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
Not to mention that other code relies on the fact that such a worker
runs on one specific CPU only.

Therefore revert that commit and we should look instead what broke the
affinity mask of the kworker.

Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Cc: Steven J. Hill
Cc: Tejun Heo
Cc: Vlastimil Babka
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-06-29 02:16:44 +0800

16 May, 2018

1 commit

fddda2b7b proc: introduce proc_create_seq{,_data} ... Browse Code »

Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-16 13:23:35 +0800

12 May, 2018

1 commit

7aaf77272 mm: don't show nr_indirectly_reclaimable in /proc/vmstat ... Browse Code »

Don't show nr_indirectly_reclaimable in /proc/vmstat, because there is
no need to export this vm counter to userspace, and some changes are
expected in reclaimable object accounting, which can alter this counter.

Link: http://lkml.kernel.org/r/20180425191422.9159-1-guro@fb.com
Signed-off-by: Roman Gushchin
Acked-by: Vlastimil Babka
Reviewed-by: Andrew Morton
Cc: Matthew Wilcox
Cc: Alexander Viro
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-05-12 08:28:45 +0800

12 Apr, 2018

1 commit

eb5925460 mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES ... Browse Code »

Patch series "indirectly reclaimable memory", v2.

This patchset introduces the concept of indirectly reclaimable memory
and applies it to fix the issue of when a big number of dentries with
external names can significantly affect the MemAvailable value.

This patch (of 3):

Introduce a concept of indirectly reclaimable memory and adds the
corresponding memory counter and /proc/vmstat item.

Indirectly reclaimable memory is any sort of memory, used by the kernel
(except of reclaimable slabs), which is actually reclaimable, i.e. will
be released under memory pressure.

The counter is in bytes, as it's not always possible to count such
objects in pages. The name contains BYTES by analogy to
NR_KERNEL_STACK_KB.

Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
Signed-off-by: Roman Gushchin
Reviewed-by: Andrew Morton
Cc: Alexander Viro
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roman Gushchin
2018-04-12 01:28:29 +0800

29 Mar, 2018

1 commit

c7f26ccfb mm/vmstat.c: fix vmstat_update() preemption BUG ... Browse Code »

Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
cause vmstat_update() to report a BUG due to preemption not being
disabled around smp_processor_id().

Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

BUG: using smp_processor_id() in preemptible [00000000] code:
kworker/1:1/269
caller is vmstat_update+0x50/0xa0
CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
Workqueue: mm_percpu_wq vmstat_update
Call Trace:
show_stack+0x94/0x128
dump_stack+0xa4/0xe0
check_preemption_disabled+0x118/0x120
vmstat_update+0x50/0xa0
process_one_work+0x144/0x348
worker_thread+0x150/0x4b8
kthread+0x110/0x140
ret_from_kernel_thread+0x14/0x1c

Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
Signed-off-by: Steven J. Hill
Reviewed-by: Andrew Morton
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven J. Hill
2018-03-29 07:42:05 +0800

16 Nov, 2017

2 commits

4518085e1 mm, sysctl: make NUMA stats configurable ... Browse Code »

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.

=========================================================================

When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:

echo 0 > /proc/sys/vm/numa_stat

In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).

Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):

https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

=========================================================================

When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:

echo 1 > /proc/sys/vm/numa_stat

This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

[keescook@chromium.org: make sure mutex is a global static]
Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang
Signed-off-by: Kees Cook
Reported-by: Jesper Dangaard Brouer
Suggested-by: Dave Hansen
Suggested-by: Ying Huang
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: "Luis R . Rodriguez"
Cc: Kees Cook
Cc: Jonathan Corbet
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Christopher Lameter
Cc: Sebastian Andrzej Siewior
Cc: Andrey Ryabinin
Cc: Tim Chen
Cc: Andi Kleen
Cc: Aaron Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kemi Wang
2017-11-16 10:21:07 +0800
3a50d14d0 mm: remove unused pgdat->inactive_ratio ... Browse Code »

Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file
list") 'pgdat->inactive_ratio' is not used, except for printing
"node_inactive_ratio: 0" in /proc/zoneinfo output.

Remove it.

Link: http://lkml.kernel.org/r/20171003152611.27483-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2017-11-16 10:21:03 +0800

09 Sep, 2017

3 commits

638032224 mm: consider the number in local CPUs when reading NUMA stats ... Browse Code »

To avoid deviation, the per cpu number of NUMA stats in
vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.

Since NUMA stats does not be read by users frequently, and kernel does not
need it to make a decision, it will not be a problem to make the readers
more expensive.

Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang
Reported-by: Jesper Dangaard Brouer
Acked-by: Mel Gorman
Cc: Aaron Lu
Cc: Andi Kleen
Cc: Christopher Lameter
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tim Chen
Cc: Ying Huang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kemi Wang
2017-09-09 09:26:47 +0800
1d90ca897 mm: update NUMA counter threshold size ... Browse Code »

There is significant overhead in cache bouncing caused by zone counters
(NUMA associated counters) update in parallel in multi-threaded page
allocation (suggested by Dave Hansen).

This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
as a small threshold greatly increases the update frequency of the global
counter from local per cpu counter(suggested by Ying Huang).

The rationality is that these statistics counters don't affect the
kernel's decision, unlike other VM counters, so it's not a problem to use
a large threshold.

With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
single page allocation and reclaim on Jesper's page_bench03 benchmark.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 system by default (base)
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) with this patchset
N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang
Reported-by: Jesper Dangaard Brouer
Suggested-by: Dave Hansen
Suggested-by: Ying Huang
Acked-by: Mel Gorman
Cc: Aaron Lu
Cc: Andi Kleen
Cc: Christopher Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Tim Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kemi Wang
2017-09-09 09:26:47 +0800
3a321d2a3 mm: change the call sites of numa statistics items ... Browse Code »

Patch series "Separate NUMA statistics from zone statistics", v2.

Each page allocation updates a set of per-zone statistics with a call to
zone_statistics(). As discussed in 2017 MM summit, these are a
substantial source of overhead in the page allocator and are very rarely
consumed. This significant overhead in cache bouncing caused by zone
counters (NUMA associated counters) update in parallel in multi-threaded
page allocation (pointed out by Dave Hansen).

A link to the MM summit slides:
http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

To mitigate this overhead, this patchset separates NUMA statistics from
zone statistics framework, and update NUMA counter threshold to a fixed
size of MAX_U16 - 2, as a small threshold greatly increases the update
frequency of the global counter from local per cpu counter (suggested by
Ying Huang). The rationality is that these statistics counters don't
need to be read often, unlike other VM counters, so it's not a problem
to use a large threshold and make readers more expensive.

With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
below) for per single page allocation and reclaim on Jesper's
page_bench03 benchmark. Meanwhile, this patchset keeps the same style
of virtual memory statistics with little end-user-visible effects (only
move the numa stats to show behind zone page stats, see the first patch
for details).

I did an experiment of single page allocation and reclaim concurrently
using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
server (88 processors with 126G memory) with different size of threshold
of pcp counter.

Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

Threshold CPU cycles Throughput(88 threads)
32 799 241760478
64 640 301628829
125 537 358906028 system by default
256 468 412397590
512 428 450550704
4096 399 482520943
20000 394 489009617
30000 395 488017817
65533 369(-31.3%) 521661345(+45.3%) with this patchset
N/A 342(-36.3%) 562900157(+56.8%) disable zone_statistics

This patch (of 3):

In this patch, NUMA statistics is separated from zone statistics
framework, all the call sites of NUMA stats are changed to use
numa-stats-specific functions, it does not have any functionality change
except that the number of NUMA stats is shown behind zone page stats
when users *read* the zone info.

E.g. cat /proc/zoneinfo
***Base*** ***With this patch***
nr_free_pages 3976 nr_free_pages 3976
nr_zone_inactive_anon 0 nr_zone_inactive_anon 0
nr_zone_active_anon 0 nr_zone_active_anon 0
nr_zone_inactive_file 0 nr_zone_inactive_file 0
nr_zone_active_file 0 nr_zone_active_file 0
nr_zone_unevictable 0 nr_zone_unevictable 0
nr_zone_write_pending 0 nr_zone_write_pending 0
nr_mlock 0 nr_mlock 0
nr_page_table_pages 0 nr_page_table_pages 0
nr_kernel_stack 0 nr_kernel_stack 0
nr_bounce 0 nr_bounce 0
nr_zspages 0 nr_zspages 0
numa_hit 0 *nr_free_cma 0*
numa_miss 0 numa_hit 0
numa_foreign 0 numa_miss 0
numa_interleave 0 numa_foreign 0
numa_local 0 numa_interleave 0
numa_other 0 numa_local 0
*nr_free_cma 0* numa_other 0
... ...
vm stats threshold: 10 vm stats threshold: 10
... ...

The next patch updates the numa stats counter size and threshold.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
Signed-off-by: Kemi Wang
Reported-by: Jesper Dangaard Brouer
Acked-by: Mel Gorman
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Christopher Lameter
Cc: Dave Hansen
Cc: Andi Kleen
Cc: Ying Huang
Cc: Aaron Lu
Cc: Tim Chen
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kemi Wang
2017-09-09 09:26:47 +0800

07 Sep, 2017

6 commits

cbc65df24 mm, swap: add swap readahead hit statistics ... Browse Code »

Patch series "mm, swap: VMA based swap readahead", v4.

The swap readahead is an important mechanism to reduce the swap in
latency. Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space. And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.

In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device. This avoid to readahead the unrelated swap
slots. At the same time, the swap readahead is changed to work on
per-VMA from globally. So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly. The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.

This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below. But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds. The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds. This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time. The result (for sequential workload) is as follow,

Base Optimized
---- ---------
throughput 345413 KB/s 414029 KB/s (+19.9%)
latency.average 97.14 us 61.06 us (-37.1%)
latency.50th 2 us 1 us
latency.60th 2 us 1 us
latency.70th 98 us 2 us
latency.80th 160 us 2 us
latency.90th 260 us 217 us
latency.95th 346 us 369 us
latency.99th 1.34 ms 1.09 ms
ra_hit% 52.69% 99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower. The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

Base Optimized
---- ---------
elapsed_time 393.49 s 329.88 s (-16.2%)
ra_hit% 86.21% 98.82%

The score of base and optimized kernel hasn't visible changes. But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages. And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

This patch (of 5):

The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.

/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total

With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.

[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Fengguang Wu
Cc: Tim Chen
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:29 +0800
f113e6412 mm/vmstat.c: fix wrong comment ... Browse Code »

Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
pagetypeinfo_show_free()'s comment. This commit fixes it.

Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
Signed-off-by: SeongJae Park
Cc: Michal Hocko
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

SeongJae Park
2017-09-07 08:27:29 +0800
88d6ac40c mm/vmstat: fix divide error at __fragmentation_index ... Browse Code »

When order is -1 or too big, *1UL << order* will be 0, which will cause
a divide error. Although it seems that all callers of
__fragmentation_index() will only do so with a valid order, the patch
can make it more robust.

Should prevent reoccurrences of
https://bugzilla.kernel.org/show_bug.cgi?id=196555

Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
Signed-off-by: Wen Yang
Reviewed-by: Jiang Biao
Suggested-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Yang
2017-09-07 08:27:29 +0800
c41f012ad mm: rename global_page_state to global_zone_page_state ... Browse Code »

global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests. We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*($NR_[A-Z_]*$).*@\1@' | sort | uniq -c
2 NR_BOUNCE
2 NR_FREE_CMA_PAGES
11 NR_FREE_PAGES
1 NR_KERNEL_STACK_KB
1 NR_MLOCK
2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko
Signed-off-by: Johannes Weiner
Cc: Tetsuo Handa
Cc: Josef Bacik
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-09-07 08:27:29 +0800
fe490cc0f mm, THP, swap: add THP swapping out fallback counting ... Browse Code »

When swapping out THP (Transparent Huge Page), instead of swapping out
the THP as a whole, sometimes we have to fallback to split the THP into
normal pages before swapping, because no free swap clusters are
available, or cgroup limit is exceeded, etc. To count the number of the
fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
we fallback to split the THP.

Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Hugh Dickins
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Andrea Arcangeli
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jens Axboe
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800
225311a46 mm: test code to write THP to swap device as a whole ... Browse Code »

To support delay splitting THP (Transparent Huge Page) after swapped
out, we need to enhance swap writing code to support to write a THP as a
whole. This will improve swap write IO performance.

As Ming Lei pointed out, this should be based on
multipage bvec support, which hasn't been merged yet. So this patch is
only for testing the functionality of the other patches in the series.
And will be reimplemented after multipage bvec support is merged.

Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying"
Cc: "Kirill A . Shutemov"
Cc: Andrea Arcangeli
Cc: Dan Williams
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
Cc: Shaohua Li
Cc: Vishal L Verma
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Ying
2017-09-07 08:27:28 +0800

11 Jul, 2017

1 commit

727c080f0 mm: avoid taking zone lock in pagetypeinfo_showmixed() ... Browse Code »

pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).

Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.

Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Cc: zhongjiang
Cc: Sergey Senozhatsky
Cc: Sudip Mukherjee
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Sebastian Andrzej Siewior
Cc: David Rientjes
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vinayak Menon
2017-07-11 07:32:32 +0800

07 Jul, 2017

4 commits

385386cff mm: vmstat: move slab statistics from zone to node counters ... Browse Code »

Patch series "mm: per-lruvec slab stats"

Josef is working on a new approach to balancing slab caches and the page
cache. For this to work, he needs slab cache statistics on the lruvec
level. These patches implement that by adding infrastructure that
allows updating and reading generic VM stat items per lruvec, then
switches some existing VM accounting sites, including the slab
accounting ones, to this new cgroup-aware API.

I'll follow up with more patches on this, because there is actually
substantial simplification that can be done to the memory controller
when we replace private memcg accounting with making the existing VM
accounting sites cgroup-aware. But this is enough for Josef to base his
slab reclaim work on, so here goes.

This patch (of 5):

To re-implement slab cache vs. page cache balancing, we'll need the
slab counters at the lruvec level, which, ever since lru reclaim was
moved from the zone to the node, is the intersection of the node, not
the zone, and the memcg.

We could retain the per-zone counters for when the page allocator dumps
its memory information on failures, and have counters on both levels -
which on all but NUMA node 0 is usually redundant. But let's keep it
simple for now and just move them. If anybody complains we can restore
the per-zone counters.

[hannes@cmpxchg.org: fix oops]
Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org
Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Cc: Josef Bacik
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-07-07 07:24:35 +0800
8e675f7af mm/oom_kill: count global and memory cgroup oom kills ... Browse Code »

Show count of oom killer invocations in /proc/vmstat and count of
processes killed in memory cgroup in knob "memory.events" (in
memory.oom_control for v1 cgroup).

Also describe difference between "oom" and "oom_kill" in memory cgroup
documentation. Currently oom in memory cgroup kills tasks iff shortage
has happened inside page fault.

These counters helps in monitoring oom kills - for now the only way is
grepping for magic words in kernel log.

[akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
[akpm@linux-foundation.org: fix comment, per Konstantin]
Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
Signed-off-by: Konstantin Khlebnikov
Cc: Michal Hocko
Cc: Tetsuo Handa
Cc: Roman Guschin
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2017-07-07 07:24:35 +0800
d336e94e4 mm, vmstat: skip reporting offline pages in pagetypeinfo ... Browse Code »

pagetypeinfo_showblockcount_print skips over invalid pfns but it would
report pages which are offline because those have a valid pfn. Their
migrate type is misleading at best.

Now that we have pfn_to_online_page() we can use it instead of
pfn_valid() and fix this.

[mhocko@suse.com: fix build]
Link: http://lkml.kernel.org/r/20170519072225.GA13041@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170515085827.16474-11-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Andi Kleen
Cc: Andrea Arcangeli
Cc: Balbir Singh
Cc: Dan Williams
Cc: Daniel Kiper
Cc: David Rientjes
Cc: Heiko Carstens
Cc: Igor Mammedov
Cc: Jerome Glisse
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Reza Arbab
Cc: Tobias Regnery
Cc: Toshi Kani
Cc: Vitaly Kuznetsov
Cc: Vlastimil Babka
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-07-07 07:24:32 +0800
9d85e15f1 mm/vmstat.c: standardize file operations variable names ... Browse Code »

Standardize the file operation variable names related to all four memory
management /proc interface files. Also change all the symbol
permissions (S_IRUGO) into octal permissions (0444) as it got complaints
from checkpatch.pl. This does not create any functional change to the
interface.

Link: http://lkml.kernel.org/r/20170427030632.8588-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anshuman Khandual
2017-07-07 07:24:31 +0800

13 May, 2017

1 commit

8d35bb310 mm, vmstat: Remove spurious WARN() during zoneinfo print ... Browse Code »

After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
zoneinfo"), /proc/zoneinfo will show unpopulated zones.

A memoryless node, having no populated zones at all, was previously
ignored, but will now trigger the WARN() in is_zone_first_populated().

Remove this warning, as its only purpose was to warn of a situation that
has since been enabled.

Aside: The "per-node stats" are still printed under the first populated
zone, but that's not necessarily the first stanza any more. I'm not
sure which criteria is more important with regard to not breaking
parsers, but it looks a little weird to the eye.

Fixes: e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab
Cc: David Rientjes
Cc: Anshuman Khandual
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Reza Arbab
2017-05-13 06:57:15 +0800

04 May, 2017

5 commits

7dfb8bf3b mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo ... Browse Code »

After "mm, vmstat: print non-populated zones in zoneinfo",
/proc/zoneinfo will show unpopulated zones.

The per-cpu pageset statistics are not relevant for unpopulated zones
and can be potentially lengthy, so supress them when they are not
interesting.

Also moves lowmem reserve protection information above pcp stats since
it is relevant for all zones per vm.lowmem_reserve_ratio.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Cc: Anshuman Khandual
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-05-04 06:52:08 +0800
b2bd85981 mm, vmstat: print non-populated zones in zoneinfo ... Browse Code »

Initscripts can use the information (protection levels) from
/proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.

vm.lowmem_reserve_ratio is an array of ratios for each configured zone
on the system. If a zone is not populated on an arch, /proc/zoneinfo
suppresses its output.

This results in there not being a 1:1 mapping between the set of zones
emitted by /proc/zoneinfo and the zones configured by
vm.lowmem_reserve_ratio.

This patch shows statistics for non-populated zones in /proc/zoneinfo.
The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
Without this patch, it is not possible to determine which index in the
array controls which zone if one or more zones on the system are not
populated.

Remaining users of walk_zones_in_node() are unchanged. Files such as
/proc/pagetypeinfo require certain zone data to be initialized properly
for display, which is not done for unpopulated zones.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Reviewed-by: Anshuman Khandual
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-05-04 06:52:08 +0800
f7ad2a6cb mm: move MADV_FREE pages into LRU_INACTIVE_FILE list ... Browse Code »

madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
anonymous pages, but they can be freed without pageout. To distinguish
these from normal anonymous pages, we clear their SwapBacked flag.

MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages. For such pages, we'd like to reclaim them once
there is memory pressure. Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.

This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.

The patch is based on Minchan's original patch.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
Signed-off-by: Shaohua Li
Suggested-by: Johannes Weiner
Acked-by: Johannes Weiner
Acked-by: Minchan Kim
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2017-05-04 06:52:08 +0800
c822f6223 mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() ... Browse Code »

NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Acked-by: Hillf Danton
Acked-by: Michal Hocko
Cc: Jia He
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:08 +0800
c73322d09 mm: fix 100% CPU kswapd busyloop on unreclaimable nodes ... Browse Code »

Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages. In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met. Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer. If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism. It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them. This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Shakeel Butt
Reported-by: Jia He
Tested-by: Jia He
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2017-05-04 06:52:07 +0800

20 Apr, 2017

1 commit

80d136e13 mm: make mm_percpu_wq non freezable ... Browse Code »

Geert has reported a freeze during PM resume and some additional
debugging has shown that the device_resume worker cannot make a forward
progress because it waits for an event which is stuck waiting in
drain_all_pages:

INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
Not tainted 4.11.0-rc7-koelsch-00029-g005882e53d62f25d-dirty #3476
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:0 D 0 5 2 0x00000000
Workqueue: events_unbound async_run_entry_fn
__schedule
schedule
schedule_timeout
wait_for_common
dpm_wait_for_superior
device_resume
async_resume
async_run_entry_fn
process_one_work
worker_thread
kthread
[...]
bash D 0 1703 1694 0x00000000
__schedule
schedule
schedule_timeout
wait_for_common
flush_work
drain_all_pages
start_isolate_page_range
alloc_contig_range
cma_alloc
__alloc_from_contiguous
cma_allocator_alloc
__dma_alloc
arm_dma_alloc
sh_eth_ring_init
sh_eth_open
sh_eth_resume
dpm_run_callback
device_resume
dpm_resume
dpm_resume_end
suspend_devices_and_enter
pm_suspend
state_store
kernfs_fop_write
__vfs_write
vfs_write
SyS_write
[...]
Showing busy workqueues and worker pools:
[...]
workqueue mm_percpu_wq: flags=0xc
pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
delayed: drain_local_pages_wq, vmstat_update
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
delayed: drain_local_pages_wq BAR(1703), vmstat_update

Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
so it is frozen this early during resume so we are effectively
deadlocked. Fix this by dropping WQ_FREEZABLE when creating
mm_percpu_wq. We really want to have it operational all the time.

Fixes: ce612879ddc7 ("mm: move pcp and lru-pcp draining into single wq")
Reported-and-tested-by: Geert Uytterhoeven
Debugged-by: Tetsuo Handa
Signed-off-by: Michal Hocko
Signed-off-by: Linus Torvalds

Michal Hocko
2017-04-20 06:53:48 +0800

08 Apr, 2017

1 commit

ce612879d mm: move pcp and lru-pcp draining into single wq ... Browse Code »

We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches. This seems more than necessary because both can run
on a single WQ. Both do not block on locks requiring a memory
allocation nor perform any allocations themselves. We will save one
rescuer thread this way.

On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).

Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:

: 4.11-rc has been giving me hangs after hours of swapping load. At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().

This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress. We can reuse the same one as for lru draining and
vmstat.

Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Suggested-by: Tetsuo Handa
Acked-by: Vlastimil Babka
Acked-by: Mel Gorman
Tested-by: Yang Li
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-04-08 15:47:49 +0800

01 Apr, 2017

1 commit

597b7305d mm: move mm_percpu_wq initialization earlier ... Browse Code »

Yang Li has reported that drain_all_pages triggers a WARN_ON which means
that this function is called earlier than the mm_percpu_wq is
initialized on arm64 with CMA configured:

WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
Modules linked in:
CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
Hardware name: Freescale Layerscape 2088A RDB Board (DT)
task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
PC is at drain_all_pages+0x244/0x25c
LR is at start_isolate_page_range+0x14c/0x1f0
[...]
drain_all_pages+0x244/0x25c
start_isolate_page_range+0x14c/0x1f0
alloc_contig_range+0xec/0x354
cma_alloc+0x100/0x1fc
dma_alloc_from_contiguous+0x3c/0x44
atomic_pool_init+0x7c/0x208
arm64_dma_init+0x44/0x4c
do_one_initcall+0x38/0x128
kernel_init_freeable+0x1a0/0x240
kernel_init+0x10/0xfc
ret_from_fork+0x10/0x20

Fix this by moving the whole setup_vmstat which is an initcall right now
to init_mm_internals which will be called right after the WQ subsystem
is initialized.

Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Yang Li
Tested-by: Yang Li
Tested-by: Xiaolong Ye
Cc: Mel Gorman
Cc: Vlastimil Babka
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2017-04-01 08:13:30 +0800

10 Mar, 2017

1 commit

ce9311cf9 mm/vmstats: add thp_split_pud event for clarity ... Browse Code »

We added support for PUD-sized transparent hugepages, however we count
the event "thp split pud" into thp_split_pmd event.

To separate the event count of thp split pud from pmd, add a new event
named thp_split_pud.

Link: http://lkml.kernel.org/r/1488282380-5076-1-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie
Cc: Vlastimil Babka
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Joonsoo Kim
Cc: Sebastian Siewior
Cc: Hugh Dickins
Cc: Christoph Lameter
Cc: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Ebru Akagunduz
Cc: David Rientjes
Cc: Hanjun Guo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yisheng Xie
2017-03-10 09:01:10 +0800

23 Feb, 2017

1 commit

7f354a548 mm, compaction: add vmstats for kcompactd work ... Browse Code »

A "compact_daemon_wake" vmstat exists that represents the number of
times kcompactd has woken up. This doesn't represent how much work it
actually did, though.

It's useful to understand how much compaction work is being done by
kcompactd versus other methods such as direct compaction and explicitly
triggered per-node (or system) compaction.

This adds two new vmstats: "compact_daemon_migrate_scanned" and
"compact_daemon_free_scanned" to represent the number of pages kcompactd
has scanned as part of its migration scanner and freeing scanner,
respectively.

These values are still accounted for in the general
"compact_migrate_scanned" and "compact_free_scanned" for compatibility.

It could be argued that explicitly triggered compaction could also be
tracked separately, and that could be added if others find it useful.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2017-02-23 08:41:29 +0800

02 Dec, 2016

3 commits

5438da977 mm/vmstat: Convert to hotplug state machine ... Browse Code »

Install the callbacks via the state machine, but do not invoke them as we
can initialize the node state without calling the callbacks on all online
CPUs.

start_shepherd_timer() is now called outside the get_online_cpus() block
which is safe as it only operates on cpu possible mask.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner
Cc: Andrew Morton
Cc: Mel Gorman
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20161129145221.ffc3kg3hd7lxiwj6@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-12-02 07:52:35 +0800
4c501327b mm/vmstat: Avoid on each online CPU loops ... Browse Code »

Both iterations over online cpus can be replaced by the proper node
specific functions.

Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Michal Hocko
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner
Cc: Andrew Morton
Cc: Mel Gorman
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20161129145113.fn3lw5aazjjvdrr3@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-12-02 07:52:35 +0800
76f290935 mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead() ... Browse Code »

Both functions are called with protection against cpu hotplug already so
*_online_cpus() could be dropped.

Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Michal Hocko
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Johannes Weiner
Cc: Andrew Morton
Cc: Mel Gorman
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20161126231350.10321-8-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-12-02 07:52:35 +0800

08 Oct, 2016

3 commits

75ba1d07f seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char ... Browse Code »

Allow some seq_puts removals by taking a string instead of a single
char.

[akpm@linux-foundation.org: update vmstat_show(), per Joe]
Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
Signed-off-by: Joe Perches
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-10-08 09:46:30 +0800
68ba0326b proc: much faster /proc/vmstat ... Browse Code »

Every current KDE system has process named ksysguardd polling files
below once in several seconds:

$ strace -e trace=open -p $(pidof ksysguardd)
Process 1812 attached
open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8
open("/proc/net/dev", O_RDONLY) = 8
open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/proc/stat", O_RDONLY) = 8
open("/proc/vmstat", O_RDONLY) = 8

Hell knows what it is doing but speed up reading /proc/vmstat by 33%!

Benchmark is open+read+close 1.000.000 times.

BEFORE
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% )
15 context-switches # 0.001 K/sec ( +- 1.41% )
1 cpu-migrations # 0.000 K/sec ( +- 11.11% )
104 page-faults # 0.008 K/sec ( +- 0.57% )
45,489,799,349 cycles # 3.460 GHz ( +- 0.03% )
9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% )
2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% )
79,241,190,850 instructions # 1.74 insn per cycle
# 0.13 stalled cycles per insn ( +- 0.00% )
17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% )
176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% )

13.691078109 seconds time elapsed ( +- 0.03% )
^^^^^^^^^^^^

AFTER
$ perf stat -r 10 taskset -c 3 ./proc-vmstat

Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):

8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% )
10 context-switches # 0.001 K/sec ( +- 2.13% )
1 cpu-migrations # 0.000 K/sec
104 page-faults # 0.012 K/sec ( +- 0.56% )
30,384,010,730 cycles # 3.497 GHz ( +- 0.07% )
12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% )
3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% )
28,969,052,879 instructions # 0.95 insn per cycle
# 0.42 stalled cycles per insn ( +- 0.01% )
6,308,245,891 branches # 726.058 M/sec ( +- 0.00% )
214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% )

9.146081052 seconds time elapsed ( +- 0.07% )
^^^^^^^^^^^

vsnprintf() is slow because:

1. format_decode() is busy looking for format specifier: 2 branches
per character (not in this case, but in others)

2. approximately million branches while parsing format mini language
and everywhere

3. just look at what string() does /proc/vmstat is good case because
most of its content are strings

Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Cc: Joe Perches
Cc: Andi Kleen
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-10-08 09:46:30 +0800
03e86dba5 cpu: fix node state for whether it contains CPU ... Browse Code »

In current kernel code, we only call node_set_state(cpu_to_node(cpu),
N_CPU) when a cpu is hot plugged. But we do not set the node state for
N_CPU when the cpus are brought online during boot.

So this could lead to failure when we check to see if a node contains
cpu with node_state(node_id, N_CPU).

One use case is in the node_reclaime function:

/*
* Only run node reclaim on the local node or on nodes that do
* not
* have associated processors. This will favor the local
* processor
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
numa_node_id())
return NODE_RECLAIM_NOSCAN;

I instrumented the kernel to call this function after boot and it always
returns 0 on a x86 desktop machine until I apply the attached patch.

int num_cpu_node(void)
{
int i, nr_cpu_nodes = 0;

for_each_node(i) {
if (node_state(i, N_CPU))
++ nr_cpu_nodes;
}

return nr_cpu_nodes;
}

Fix this by checking each node for online CPU when we initialize
vmstat that's responsible for maintaining node state.

Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
Signed-off-by: Tim Chen
Acked-by: David Rientjes
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Tim Chen
Cc:
Cc: Ying
Cc: Andi Kleen
Cc: Dave Hansen
Cc: Dan Williams
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tim Chen
2016-10-08 09:46:28 +0800