Eric Lee / smarc-fsl-linux-kernel

12 Feb, 2015

2 commits

57c2e36b6 vmstat: Reduce time interval to stat update on idle cpu ... Browse Code »

It was noted that the vm stat shepherd runs every 2 seconds and that the
vmstat update is then scheduled 2 seconds in the future.

This yields an interval of double the time interval which is not desired.

Change the shepherd so that it does not delay the vmstat update on the
other cpu. We stil have to use schedule_delayed_work since we are using a
delayed_work_struct but we can set the delay to 0.

Signed-off-by: Christoph Lameter
Acked-by: Michal Hocko
Cc: Vinayak Menon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2015-02-12 09:06:07 +0800
ba4877b9c vmstat: do not use deferrable delayed work for vmstat_update ... Browse Code »

Vinayak Menon has reported that an excessive number of tasks was throttled
in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE
was relatively high compared to NR_INACTIVE_FILE. However it turned out
that the real number of NR_ISOLATED_FILE was 0 and the per-cpu
vm_stat_diff wasn't transferred into the global counter.

vmstat_work which is responsible for the sync is defined as deferrable
delayed work which means that the defined timeout doesn't wake up an idle
CPU. A CPU might stay in an idle state for a long time and general effort
is to keep such a CPU in this state as long as possible which might lead
to all sorts of troubles for vmstat consumers as can be seen with the
excessive direct reclaim throttling.

This patch basically reverts 39bf6270f524 ("VM statistics: Make timer
deferrable") but it shouldn't cause any problems for idle CPUs because
only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5
("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a
longer time shouldn't have per-cpu drift.

Fixes: 39bf6270f524 (VM statistics: Make timer deferrable)
Signed-off-by: Michal Hocko
Reported-by: Vinayak Menon
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Vladimir Davydov
Cc: Mel Gorman
Cc: Minchan Kim
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-02-12 09:06:07 +0800

11 Feb, 2015

1 commit

3c4868710 mm/vmstat.c: fix/cleanup ifdefs ... Browse Code »

CONFIG_COMPACTION=y, CONFIG_DEBUG_FS=n:

mm/vmstat.c:690: warning: 'frag_start' defined but not used
mm/vmstat.c:702: warning: 'frag_next' defined but not used
mm/vmstat.c:710: warning: 'frag_stop' defined but not used
mm/vmstat.c:715: warning: 'walk_zones_in_node' defined but not used

It's all a bit of a tangly mess and it's unclear why CONFIG_COMPACTION
figures in there at all. Move frag_start/frag_next/frag_stop and
migratetype_names[] into the existing CONFIG_PROC_FS block.

walk_zones_in_node() gets a special ifdef.

Also move the #include lines up to where #include lines live.

[axel.lin@ingics.com: fix build error when !CONFIG_PROC_FS]
Signed-off-by: Axel Lin
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Joonsoo Kim
Tested-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-02-11 06:30:30 +0800

14 Dec, 2014

2 commits

f5f302e21 mm,vmacache: count number of system-wide flushes ... Browse Code »

These flushes deal with sequence number overflows, such as for long lived
threads. These are rare, but interesting from a debugging PoV. As such,
display the number of flushes when vmacache debugging is enabled.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-12-14 04:42:48 +0800
48c96a368 mm/page_owner: keep track of page owners ... Browse Code »

This is the page owner tracking code which is introduced so far ago. It
is resident on Andrew's tree, though, nobody tried to upstream so it
remain as is. Our company uses this feature actively to debug memory leak
or to find a memory hogger so I decide to upstream this feature.

This functionality help us to know who allocates the page. When
allocating a page, we store some information about allocation in extra
memory. Later, if we need to know status of all pages, we can get and
analyze it from this stored information.

In previous version of this feature, extra memory is statically defined in
struct page, but, in this version, extra memory is allocated outside of
struct page. It enables us to turn on/off this feature at boottime
without considerable memory waste.

Although we already have tracepoint for tracing page allocation/free,
using it to analyze page owner is rather complex. We need to enlarge the
trace buffer for preventing overlapping until userspace program launched.
And, launched program continually dump out the trace buffer for later
analysis and it would change system behaviour with more possibility rather
than just keeping it in memory, so bad for debug.

Moreover, we can use page_owner feature further for various purposes. For
example, we can use it for fragmentation statistics implemented in this
patch. And, I also plan to implement some CMA failure debugging feature
using this interface.

I'd like to give the credit for all developers contributed this feature,
but, it's not easy because I don't know exact history. Sorry about that.
Below is people who has "Signed-off-by" in the patches in Andrew's tree.

Contributor:
Alexander Nyberg
Mel Gorman
Dave Hansen
Minchan Kim
Michal Nazarewicz
Andrew Morton
Jungsoo Son

Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-12-14 04:42:48 +0800

10 Oct, 2014

2 commits

7cc36bbdd vmstat: on-demand vmstat workers V8 ... Browse Code »

vmstat workers are used for folding counter differentials into the zone,
per node and global counters at certain time intervals. They currently
run at defined intervals on all processors which will cause some holdoff
for processors that need minimal intrusion by the OS.

The current vmstat_update mechanism depends on a deferrable timer firing
every other second by default which registers a work queue item that runs
on the local CPU, with the result that we have 1 interrupt and one
additional schedulable task on each CPU every 2 seconds If a workload
indeed causes VM activity or multiple tasks are running on a CPU, then
there are probably bigger issues to deal with.

However, some workloads dedicate a CPU for a single CPU bound task. This
is done in high performance computing, in high frequency financial
applications, in networking (Intel DPDK, EZchip NPS) and with the advent
of systems with more and more CPUs over time, this may become more and
more common to do since when one has enough CPUs one cares less about
efficiently sharing a CPU with other tasks and more about efficiently
monopolizing a CPU per task.

The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance (~700
cycles) with this patch and loose by couple of orders of magnitude (~200k
cycles) without it[*]. The loss occurs for something that just calculates
statistics. For networking applications, for example, this could be the
difference between dropping packets or sustaining line rate.

Statistics are important and useful, but it would be great if there would
be a way to not cause statistics gathering produce a huge performance
difference. This patche does just that.

This patch creates a vmstat shepherd worker that monitors the per cpu
differentials on all processors. If there are differentials on a
processor then a vmstat worker local to the processors with the
differentials is created. That worker will then start folding the diffs
in regular intervals. Should the worker find that there is no work to be
done then it will make the shepherd worker monitor the differentials
again.

With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).

The patch shows a very minor increased in system performance.

hackbench -s 512 -l 2000 -g 15 -f 25 -P

Results before the patch:

Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.992
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.971
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 5.063

Hackbench after the patch:

Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.973
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.990
Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
Each sender will pass 2000 messages of 512 bytes
Time: 4.993

[fengguang.wu@intel.com: cpu_stat_off can be static]
Signed-off-by: Christoph Lameter
Reviewed-by: Gilad Ben-Yossef
Cc: Frederic Weisbecker
Cc: Thomas Gleixner
Cc: Tejun Heo
Cc: John Stultz
Cc: Mike Frysinger
Cc: Minchan Kim
Cc: Hakan Akkan
Cc: Max Krasnyansky
Cc: "Paul E. McKenney"
Cc: Hugh Dickins
Cc: Viresh Kumar
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-10-10 10:26:02 +0800
09316c09d mm/balloon_compaction: add vmstat counters and kpageflags bit ... Browse Code »

Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.

Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate". They accumulate balloon
activity. Current size of balloon is (balloon_inflate - balloon_deflate)
pages.

All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.

Signed-off-by: Konstantin Khlebnikov
Cc: Rafael Aquini
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2014-10-10 10:26:01 +0800

07 Aug, 2014

3 commits

bb0b6dffa mm: vmscan: only update per-cpu thresholds for online CPU ... Browse Code »

When kswapd is awake reclaiming, the per-cpu stat thresholds are lowered
to get more accurate counts to avoid breaching watermarks. This
threshold update iterates over all possible CPUs which is unnecessary.
Only online CPUs need to be updated. If a new CPU is onlined,
refresh_zone_stat_thresholds() will set the thresholds correctly.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-08-07 09:01:20 +0800
0d5d823ab mm: move zone->pages_scanned into a vmstat counter ... Browse Code »

zone->pages_scanned is a write-intensive cache line during page reclaim
and it's also updated during page free. Move the counter into vmstat to
take advantage of the per-cpu updates and do not update it in the free
paths unless necessary.

On a small UMA machine running tiobench the difference is marginal. On
a 4-node machine the overhead is more noticable. Note that automatic
NUMA balancing was disabled for this test as otherwise the system CPU
overhead is unpredictable.

3.16.0-rc3 3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5 vmstat-v5
User 746.94 759.78 774.56
System 65336.22 58350.98 32847.27
Elapsed 27553.52 27282.02 27415.04

Note that the overhead reduction will vary depending on where exactly
pages are allocated and freed.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-08-07 09:01:20 +0800
3484b2de9 mm: rearrange zone fields into read-only, page alloc, statistics and page reclaim lines ... Browse Code »

The arrangement of struct zone has changed over time and now it has
reached the point where there is some inappropriate sharing going on.
On x86-64 for example

o The zone->node field is shared with the zone lock and zone->node is
accessed frequently from the page allocator due to the fair zone
allocation policy.

o span_seqlock is almost never used by shares a line with free_area

o Some zone statistics share a cache line with the LRU lock so
reclaim-intensive and allocator-intensive workloads can bounce the cache
line on a stat update

This patch rearranges struct zone to put read-only and read-mostly
fields together and then splits the page allocator intensive fields, the
zone statistics and the page reclaim intensive fields into their own
cache lines. Note that the type of lowmem_reserve changes due to the
watermark calculations being signed and avoiding a signed/unsigned
conversion there.

On the test configuration I used the overall size of struct zone shrunk
by one cache line. On smaller machines, this is not likely to be
noticable. However, on a 4-node NUMA machine running tiobench the
system CPU overhead is reduced by this patch.

3.16.0-rc3 3.16.0-rc3
vanillarearrange-v5r9
User 746.94 759.78
System 65336.22 58350.98
Elapsed 27553.52 27282.02

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-08-07 09:01:20 +0800

05 Jun, 2014

3 commits

bea04b073 mm: use the light version __mod_zone_page_state in mlocked_vma_newpage() ... Browse Code »

mlocked_vma_newpage() is called with pte lock held(a spinlock), which
implies preemtion disabled, and the vm stat counter is not modified from
interrupt context, so we need not use an irq-safe mod_zone_page_state()
here, using a light-weight version __mod_zone_page_state() would be OK.

This patch also documents __mod_zone_page_state() and some of its
callsites. The comment above __mod_zone_page_state() is from Hugh
Dickins, and acked by Christoph.

Most credits to Hugh and Christoph for the clarification on the usage of
the __mod_zone_page_state().

[akpm@linux-foundation.org: coding-style fixes]
Suggested-by: Andrew Morton
Acked-by: Hugh Dickins
Signed-off-by: Jianyu Zhan
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:07 +0800
7c8e0181e mm: replace __get_cpu_var uses with this_cpu_ptr ... Browse Code »

Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().

Signed-off-by: Christoph Lameter
Cc: Tejun Heo
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-06-05 07:54:03 +0800
4f115147f mm,vmacache: add debug data ... Browse Code »

Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
hit rate -- exported in /proc/vmstat.

Any updates to the caching scheme needs this kind of data, thus it can
save some work re-implementing the counting all the time.

Signed-off-by: Davidlohr Bueso
Cc: Aswin Chandramouleeswaran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-06-05 07:53:57 +0800

08 Apr, 2014

1 commit

467a9e163 Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
"The purpose of this single series of commits from Srivatsa S Bhat
(with a small piece from Gautham R Shenoy) touching multiple
subsystems that use CPU hotplug notifiers is to provide a way to
register them that will not lead to deadlocks with CPU online/offline
operations as described in the changelog of commit 93ae4f978ca7f ("CPU
hotplug: Provide lockless versions of callback registration
functions").

The first three commits in the series introduce the API and document
it and the rest simply goes through the users of CPU hotplug notifiers
and converts them to using the new method"

* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
net/iucv/iucv.c: Fix CPU hotplug callback registration
net/core/flow.c: Fix CPU hotplug callback registration
mm, zswap: Fix CPU hotplug callback registration
mm, vmstat: Fix CPU hotplug callback registration
profile: Fix CPU hotplug callback registration
trace, ring-buffer: Fix CPU hotplug callback registration
xen, balloon: Fix CPU hotplug callback registration
hwmon, via-cputemp: Fix CPU hotplug callback registration
hwmon, coretemp: Fix CPU hotplug callback registration
thermal, x86-pkg-temp: Fix CPU hotplug callback registration
octeon, watchdog: Fix CPU hotplug callback registration
oprofile, nmi-timer: Fix CPU hotplug callback registration
intel-idle: Fix CPU hotplug callback registration
clocksource, dummy-timer: Fix CPU hotplug callback registration
drivers/base/topology.c: Fix CPU hotplug callback registration
acpi-cpufreq: Fix CPU hotplug callback registration
zsmalloc: Fix CPU hotplug callback registration
scsi, fcoe: Fix CPU hotplug callback registration
scsi, bnx2fc: Fix CPU hotplug callback registration
scsi, bnx2i: Fix CPU hotplug callback registration
...

Linus Torvalds
2014-04-08 05:55:46 +0800

04 Apr, 2014

3 commits

5509a5d27 drop_caches: add some documentation and info message ... Browse Code »

There is plenty of anecdotal evidence and a load of blog posts
suggesting that using "drop_caches" periodically keeps your system
running in "tip top shape". Perhaps adding some kernel documentation
will increase the amount of accurate data on its use.

If we are not shrinking caches effectively, then we have real bugs.
Using drop_caches will simply mask the bugs and make them harder to
find, but certainly does not fix them, nor is it an appropriate
"workaround" to limit the size of the caches. On the contrary, there
have been bug reports on issues that turned out to be misguided use of
cache dropping.

Dropping caches is a very drastic and disruptive operation that is good
for debugging and running tests, but if it creates bug reports from
production use, kernel developers should be aware of its use.

Add a bit more documentation about it, a syslog message to track down
abusers, and vmstat drop counters to help analyze problem reports.

[akpm@linux-foundation.org: checkpatch fixes]
[hannes@cmpxchg.org: add runtime suppression control]
Signed-off-by: Dave Hansen
Signed-off-by: Michal Hocko
Acked-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-04-04 07:21:04 +0800
449dd6984 mm: keep page cache radix tree nodes in check ... Browse Code »

Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers. But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed. This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting. The shadow entries will just
sit there and waste memory. In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads. A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Bob Liu
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800
a528910e1 mm: thrash detection-based file cache sizing ... Browse Code »

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past. We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Greg Thelen
Cc: Hugh Dickins
Cc: Jan Kara
Cc: KOSAKI Motohiro
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Metin Doslu
Cc: Michel Lespinasse
Cc: Ozgun Erdogan
Cc: Peter Zijlstra
Cc: Roman Gushchin
Cc: Ryan Mallon
Cc: Tejun Heo
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-04-04 07:21:01 +0800

20 Mar, 2014

1 commit

0be94bad0 mm, vmstat: Fix CPU hotplug callback registration ... Browse Code »

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

get_online_cpus();

for_each_online_cpu(cpu)
init_cpu(cpu);

register_cpu_notifier(&foobar_cpu_notifier);

put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

cpu_notifier_register_begin();

for_each_online_cpu(cpu)
init_cpu(cpu);

/* Note the use of the double underscored version of the API */
__register_cpu_notifier(&foobar_cpu_notifier);

cpu_notifier_register_done();

Fix the vmstat code in the MM subsystem by using this latter form of callback
registration.

Cc: Andrew Morton
Cc: Johannes Weiner
Cc: Cody P Schafer
Cc: Toshi Kani
Cc: Dave Hansen
Cc: Ingo Molnar
Acked-by: Christoph Lameter
Acked-by: Rik van Riel
Reviewed-by: Yasuaki Ishimatsu
Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Rafael J. Wysocki

Srivatsa S. Bhat
2014-03-20 20:43:48 +0800

25 Jan, 2014

1 commit

ec6599344 mm, x86: Account for TLB flushes only when debugging ... Browse Code »

Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.

The counters are undeniably useful but how often do we really
need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Hugh Dickins
Cc: Alex Shi
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar

Mel Gorman
2014-01-25 16:10:41 +0800

13 Nov, 2013

3 commits

72403b4a0 mm: numa: return the number of base pages altered by protection changes ... Browse Code »

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa. task_numa_work was using the old return value
to track how much address space had been updated. Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat. It is up to the administrator to
interpret the pair of values correctly. This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result. On
the flip size system CPU usage should be lower than recent tests
reported. This is an illustrative example of a short single JVM specjbb
test

specjbb
3.12.0 3.12.0
vanilla acctupdates
TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

3.12.0 3.12.0
vanillaacctupdates
User 5169.64 5184.14
System 100.45 80.02
Elapsed 252.75 251.85

Performance is similar but note the reduction in system CPU time. While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented. The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

3.12.0 3.12.0
vanillaacctupdates
NUMA page range updates 1408326 11043064
NUMA huge PMD updates 0 21040
NUMA PTE updates 1408326 291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman
Reported-by: Alex Thorlton
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-11-13 11:09:11 +0800
807a1bd2b mm: clear N_CPU from node_states at CPU offline ... Browse Code »

vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
node at CPU online event. However, it does not update this N_CPU info at
CPU offline event.

Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
node is put into offline, i.e. the node no longer has any online CPU.

Signed-off-by: Toshi Kani
Acked-by: Christoph Lameter
Reviewed-by: Yasuaki Ishimatsu
Tested-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-11-13 11:09:09 +0800
d7e0b37a8 mm: set N_CPU to node_states during boot ... Browse Code »

After a system booted, N_CPU is not set to any node as has_cpu shows an
empty line.

# cat /sys/devices/system/node/has_cpu
(show-empty-line)

setup_vmstat() registers its CPU notifier callback,
vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
into online. However, setup_vmstat() is called after all CPUs are
launched in the boot sequence.

Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
boot, which is consistent with other operations in
vmstat_cpuup_callback(), i.e. start_cpu_timer() and
refresh_zone_stat_thresholds().

Also added get_online_cpus() to protect the for_each_online_cpu() loop.

Signed-off-by: Toshi Kani
Acked-by: Christoph Lameter
Reviewed-by: Yasuaki Ishimatsu
Tested-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-11-13 11:09:09 +0800

12 Sep, 2013

7 commits

6e543d578 mm: vmscan: fix do_try_to_free_pages() livelock ... Browse Code »

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar
Cc: Ying Han
Cc: Nick Piggin
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Bob Liu
Cc: Neil Zhang
Cc: Russell King - ARM Linux
Reviewed-by: Michal Hocko
Acked-by: Minchan Kim
Acked-by: Johannes Weiner
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Lisa Du
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lisa Du
2013-09-12 06:58:01 +0800
fbc2edb05 vmstat: use this_cpu() to avoid irqon/off sequence in refresh_cpu_vm_stats ... Browse Code »

Disabling interrupts repeatedly can be avoided in the inner loop if we use
a this_cpu operation.

Signed-off-by: Christoph Lameter
Cc: KOSAKI Motohiro
CC: Tejun Heo
Cc: Joonsoo Kim
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2013-09-12 06:57:31 +0800
4edb0748b vmstat: create fold_diff ... Browse Code »

Both functions that update global counters use the same mechanism.

Create a function that contains the common code.

Signed-off-by: Christoph Lameter
Cc: KOSAKI Motohiro
CC: Tejun Heo
Cc: Joonsoo Kim
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2013-09-12 06:57:31 +0800
2bb921e52 vmstat: create separate function to fold per cpu diffs into local counters ... Browse Code »

The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.

This patch (of 3):

It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.

If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.

The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.

[akpm@linux-foundation.org: fix UP build]
Signed-off-by: Christoph Lameter
Cc: KOSAKI Motohiro
CC: Tejun Heo
Cc: Joonsoo Kim
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2013-09-12 06:57:31 +0800
81c0a2bb5 mm: page_alloc: fair zone allocator policy ... Browse Code »

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021

Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937

When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Paul Bolle
Cc: Zlatko Calusic
Tested-by: Kevin Hilman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-12 06:57:23 +0800
6df46865f mm: vmstats: track TLB flush stats on UP too ... Browse Code »

The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
for SMP.

UP systems do not do remote TLB flushes, so compile those counters out on
UP.

arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly. This is
probably an optimization since both the mtrr code and __flush_tlb() write
cr4. It would probably be safe to make that a flush_tlb_all() (and then
get these statistics), but the mtrr code is ancient and I'm hesitant to
touch it other than to just stick in the counters.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Dave Hansen
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2013-09-12 06:57:09 +0800
9824cf975 mm: vmstats: tlb flush counters ... Browse Code »

I was investigating some TLB flush scaling issues and realized that we do
not have any good methods for figuring out how many TLB flushes we are
doing.

It would be nice to be able to do these in generic code, but the
arch-independent calls don't explicitly specify whether we actually need
to do remote flushes or not. In the end, we really need to know if we
actually _did_ global vs. local invalidations, so that leaves us with few
options other than to muck with the counters from arch-specific code.

Signed-off-by: Dave Hansen
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2013-09-12 06:57:08 +0800

15 Jul, 2013

1 commit

0db0628d9 kernel: delete __cpuinit usage from all core kernel files ... Browse Code »

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the uses of the __cpuinit macros from C files in
the core kernel directories (kernel, init, lib, mm, and include)
that don't really have a specific maintainer.

[1] https://lkml.org/lkml/2013/5/20/589

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2013-07-15 07:36:59 +0800

30 Apr, 2013

2 commits

40f4b1ead mm/vmstat: add note on safety of drain_zonestat ... Browse Code »

Signed-off-by: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-04-30 06:54:38 +0800
f1cb08798 mm: remove CONFIG_HOTPLUG ifdefs ... Browse Code »

CONFIG_HOTPLUG is going away as an option, cleanup CONFIG_HOTPLUG
ifdefs in mm files.

Signed-off-by: Yijing Wang
Acked-by: Greg Kroah-Hartman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yijing Wang
2013-04-30 06:54:37 +0800

24 Feb, 2013

4 commits

108bcc96e mm: add & use zone_end_pfn() and zone_spans_pfn() ... Browse Code »

Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.

This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.

Note that in compaction.c I avoid calling zone_end_pfn() repeatedly
because I expect at some point the sycronization issues with start_pfn &
spanned_pages will need fixing, either by actually using the seqlock or
clever memory barrier usage.

Signed-off-by: Cody P Schafer
Cc: David Hansen
Cc: Catalin Marinas
Cc: Johannes Weiner
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-02-24 09:50:20 +0800
258401a60 mm: don't wait on congested zones in balance_pgdat() ... Browse Code »

From: Zlatko Calusic

Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
many dirty pages under writeback") introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list().

What this means is that there's no more need for throttling and
additional heuristics in balance_pgdat(). So, let's remove it and tidy
up the code.

Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zlatko Calusic
2013-02-24 09:50:15 +0800
194159fbc mm: remove MIGRATE_ISOLATE check in hotpath ... Browse Code »

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-02-24 09:50:15 +0800
b40da0494 mm: use zone->present_pages instead of zone->managed_pages where appropriate ... Browse Code »

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:14 +0800

17 Dec, 2012

1 commit

3d59eebc5 Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma ... Browse Code »

Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
"There are three implementations for NUMA balancing, this tree
(balancenuma), numacore which has been developed in tip/master and
autonuma which is in aa.git.

In almost all respects balancenuma is the dumbest of the three because
its main impact is on the VM side with no attempt to be smart about
scheduling. In the interest of getting the ball rolling, it would be
desirable to see this much merged for 3.8 with the view to building
scheduler smarts on top and adapting the VM where required for 3.9.

The most recent set of comparisons available from different people are

mel: https://lkml.org/lkml/2012/12/9/108
mingo: https://lkml.org/lkml/2012/12/7/331
tglx: https://lkml.org/lkml/2012/12/10/437
srikar: https://lkml.org/lkml/2012/12/10/397

The results are a mixed bag. In my own tests, balancenuma does
reasonably well. It's dumb as rocks and does not regress against
mainline. On the other hand, Ingo's tests shows that balancenuma is
incapable of converging for this workloads driven by perf which is bad
but is potentially explained by the lack of scheduler smarts. Thomas'
results show balancenuma improves on mainline but falls far short of
numacore or autonuma. Srikar's results indicate we all suffer on a
large machine with imbalanced node sizes.

My own testing showed that recent numacore results have improved
dramatically, particularly in the last week but not universally.
We've butted heads heavily on system CPU usage and high levels of
migration even when it shows that overall performance is better.
There are also cases where it regresses. Of interest is that for
specjbb in some configurations it will regress for lower numbers of
warehouses and show gains for higher numbers which is not reported by
the tool by default and sometimes missed in treports. Recently I
reported for numacore that the JVM was crashing with
NullPointerExceptions but currently it's unclear what the source of
this problem is. Initially I thought it was in how numacore batch
handles PTEs but I'm no longer think this is the case. It's possible
numacore is just able to trigger it due to higher rates of migration.

These reports were quite late in the cycle so I/we would like to start
with this tree as it contains much of the code we can agree on and has
not changed significantly over the last 2-3 weeks."

* tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
mm/rmap: Convert the struct anon_vma::mutex to an rwsem
mm: migrate: Account a transhuge page properly when rate limiting
mm: numa: Account for failed allocations and isolations as migration failures
mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
mm: numa: Add THP migration for the NUMA working set scanning fault case.
mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
mm: sched: numa: Control enabling and disabling of NUMA balancing
mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
mm: numa: migrate: Set last_nid on newly allocated page
mm: numa: split_huge_page: Transfer last_nid on tail page
mm: numa: Introduce last_nid to the page frame
sched: numa: Slowly increase the scanning period as NUMA faults are handled
mm: numa: Rate limit setting of pte_numa if node is saturated
mm: numa: Rate limit the amount of memory that is migrated between nodes
mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
mm: numa: Migrate pages handled during a pmd_numa hinting fault
mm: numa: Migrate on reference policy
...

Linus Torvalds
2012-12-17 07:18:08 +0800

13 Dec, 2012

3 commits

9feedc9d8 mm: introduce new field "managed_pages" to struct zone ... Browse Code »

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.

spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages may
have different meanings in different contexts:

1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And revert
zone->present_pages to count "physical pages existing in a zone", which
also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:

(spanned_pages - absent_pages - memmap_pages - dma_reserve)

Later zone->managed_pages will be adjusted to the accurate value when the
bootmem allocator frees all free pages to the buddy system in function
free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

[akpm@linux-foundation.org: small comment tweaks]
Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Maciej Rutecki
Tested-by: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2012-12-13 09:38:34 +0800
a47b53c5f vmstat: use N_MEMORY instead N_HIGH_MEMORY ... Browse Code »

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan
Acked-by: Christoph Lameter
Signed-off-by: Wen Congyang
Cc: Hillf Danton
Cc: Lin Feng
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2012-12-13 09:38:33 +0800
d8a8e1f0d thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events ... Browse Code »

hzp_alloc is incremented every time a huge zero page is successfully
allocated. It includes allocations which where dropped due
race with other allocation. Note, it doesn't count every map
of the huge zero page, only its allocation.

hzp_alloc_failed is incremented if kernel fails to allocate huge zero
page and falls back to using small pages.

Signed-off-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: "H. Peter Anvin"
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-12-13 09:38:32 +0800