Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Oct, 2013

1 commit

7393dc45f revert "mm/memory-hotplug: fix lowmem count overflow when offline pages" ... Browse Code »

This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
overflow when offline pages").

The fixed bug by commit cea27eb was fixed to another way by commit
3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory, for details please refer to:

http://marc.info/?l=linux-mm&m=136957578620221&w=2

As a result, commit cea27eb2a202 currently causes duplicated decreasing
of totalhigh_pages, thus the revert.

Signed-off-by: Joonyoung Shim
Reviewed-by: Wanpeng Li
Cc: Jiang Liu
Cc: KOSAKI Motohiro
Cc: Bartlomiej Zolnierkiewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonyoung Shim
2013-10-01 05:31:01 +0800

12 Sep, 2013

16 commits

cf6fe9453 mm: correct the comment about the value for buddy _mapcount ... Browse Code »

Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
magic number -2.

Signed-off-by: Wang Sheng-Hui
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2013-09-12 06:58:06 +0800
6e543d578 mm: vmscan: fix do_try_to_free_pages() livelock ... Browse Code »

This patch is based on KOSAKI's work and I add a little more description,
please refer https://lkml.org/lkml/2012/6/14/74.

Currently, I found system can enter a state that there are lots of free
pages in a zone but only order-0 and order-1 pages which means the zone is
heavily fragmented, then high order allocation could make direct reclaim
path's long stall(ex, 60 seconds) especially in no swap and no compaciton
enviroment. This problem happened on v3.4, but it seems issue still lives
in current tree, the reason is do_try_to_free_pages enter live lock:

kswapd will go to sleep if the zones have been fully scanned and are still
not balanced. As kswapd thinks there's little point trying all over again
to avoid infinite loop. Instead it changes order from high-order to
0-order because kswapd think order-0 is the most important. Look at
73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
and may leave zone->all_unreclaimable =3D 0. It assume high-order users
can still perform direct reclaim if they wish.

Direct reclaim continue to reclaim for a high order which is not a
COSTLY_ORDER without oom-killer until kswapd turn on
zone->all_unreclaimble= . This is because to avoid too early oom-kill.
So it means direct_reclaim depends on kswapd to break this loop.

In worst case, direct-reclaim may continue to page reclaim forever when
kswapd sleeps forever until someone like watchdog detect and finally kill
the process. As described in:
http://thread.gmane.org/gmane.linux.kernel.mm/103737

We can't turn on zone->all_unreclaimable from direct reclaim path because
direct reclaim path don't take any lock and this way is racy. Thus this
patch removes zone->all_unreclaimable field completely and recalculates
zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

[akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
Cc: Aaditya Kumar
Cc: Ying Han
Cc: Nick Piggin
Acked-by: Rik van Riel
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Cc: Bob Liu
Cc: Neil Zhang
Cc: Russell King - ARM Linux
Reviewed-by: Michal Hocko
Acked-by: Minchan Kim
Acked-by: Johannes Weiner
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Lisa Du
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lisa Du
2013-09-12 06:58:01 +0800
3b11f0aaa mm: page_alloc: fix comment get_page_from_freelist ... Browse Code »

cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

Signed-off-by: SeungHun Lee
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

SeungHun Lee
2013-09-12 06:57:56 +0800
e76b63f80 memblock, numa: binary search node id ... Browse Code »

Current early_pfn_to_nid() on arch that support memblock go over
memblock.memory one by one, so will take too many try near the end.

We can use existing memblock_search to find the node id for given pfn,
that could save some time on bigger system that have many entries
memblock.memory array.

Here are the timing differences for several machines. In each case with
the patch less time was spent in __early_pfn_to_nid().

3.11-rc5 with patch difference (%)
-------- ---------- --------------
UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
Time in seconds.

Signed-off-by: Yinghai Lu
Cc: Tejun Heo
Acked-by: Russ Anderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2013-09-12 06:57:51 +0800
c8721bbbd mm: memory-hotplug: enable memory hotplug to handle hugepage ... Browse Code »

Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.

What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().

Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().

As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.

[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Cc: Hillf Danton
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-09-12 06:57:48 +0800
8080fc038 mm: use zone_is_empty() instead of if(zone->spanned_pages) ... Browse Code »

Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
Simplify the code, no functional change.

Signed-off-by: Xishi Qiu
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-09-12 06:57:38 +0800
2bb921e52 vmstat: create separate function to fold per cpu diffs into local counters ... Browse Code »

The main idea behind this patchset is to reduce the vmstat update overhead
by avoiding interrupt enable/disable and the use of per cpu atomics.

This patch (of 3):

It is better to have a separate folding function because
refresh_cpu_vm_stats() also does other things like expire pages in the
page allocator caches.

If we have a separate function then refresh_cpu_vm_stats() is only called
from the local cpu which allows additional optimizations.

The folding function is only called when a cpu is being downed and
therefore no other processor will be accessing the counters. Also
simplifies synchronization.

[akpm@linux-foundation.org: fix UP build]
Signed-off-by: Christoph Lameter
Cc: KOSAKI Motohiro
CC: Tejun Heo
Cc: Joonsoo Kim
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2013-09-12 06:57:31 +0800
e66f09725 mm, page_alloc: add unlikely macro to help compiler optimization ... Browse Code »

We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
path. For helping compiler optimization, add unlikely macro to
ALLOC_NO_WATERMARKS checking.

This patch doesn't have any effect now, because gcc already optimize this
properly. But we cannot assume that gcc always does right and nobody
re-evaluate if gcc do proper optimization with their change, for example,
it is not optimized properly on v3.10. So adding compiler hint here is
reasonable.

Signed-off-by: Joonsoo Kim
Acked-by: Johannes Weiner
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2013-09-12 06:57:29 +0800
81c0a2bb5 mm: page_alloc: fair zone allocator policy ... Browse Code »

Each zone that holds userspace pages of one workload must be aged at a
speed proportional to the zone size. Otherwise, the time an individual
page gets to stay in memory depends on the zone it happened to be
allocated in. Asymmetry in the zone aging creates rather unpredictable
aging behavior and results in the wrong pages being reclaimed, activated
etc.

But exactly this happens right now because of the way the page allocator
and kswapd interact. The page allocator uses per-node lists of all zones
in the system, ordered by preference, when allocating a new page. When
the first iteration does not yield any results, kswapd is woken up and the
allocator retries. Due to the way kswapd reclaims zones below the high
watermark while a zone can be allocated from when it is above the low
watermark, the allocator may keep kswapd running while kswapd reclaim
ensures that the page allocator can keep allocating from the first zone in
the zonelist for extended periods of time. Meanwhile the other zones
rarely see new allocations and thus get aged much slower in comparison.

The result is that the occasional page placed in lower zones gets
relatively more time in memory, even gets promoted to the active list
after its peers have long been evicted. Meanwhile, the bulk of the
working set may be thrashing on the preferred zone even though there may
be significant amounts of memory available in the lower zones.

Even the most basic test -- repeatedly reading a file slightly bigger than
memory -- shows how broken the zone aging is. In this scenario, no single
page should be able stay in memory long enough to get referenced twice and
activated, but activation happens in spades:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 0
nr_active_file 8
nr_inactive_file 1582
nr_active_file 11994
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 70
nr_inactive_file 258753
nr_active_file 443214
nr_inactive_file 149793
nr_active_file 12021

Fix this with a very simple round robin allocator. Each zone is allowed a
batch of allocations that is proportional to the zone's size, after which
it is treated as full. The batch counters are reset when all zones have
been tried and the allocator enters the slowpath and kicks off kswapd
reclaim. Allocation and reclaim is now fairly spread out to all
available/allowable zones:

$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 174
nr_active_file 4865
nr_inactive_file 53
nr_active_file 860
$ cat data data data data >/dev/null
$ grep active_file /proc/zoneinfo
nr_inactive_file 0
nr_active_file 0
nr_inactive_file 666622
nr_active_file 4988
nr_inactive_file 190969
nr_active_file 937

When zone_reclaim_mode is enabled, allocations will now spread out to all
zones on the local node, not just the first preferred zone (which on a 4G
node might be a tiny Normal zone).

Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Paul Bolle
Cc: Zlatko Calusic
Tested-by: Kevin Hilman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-12 06:57:23 +0800
e085dbc52 mm: page_alloc: rearrange watermark checking in get_page_from_freelist ... Browse Code »

Allocations that do not have to respect the watermarks are rare
high-priority events. Reorder the code such that per-zone dirty limits
and future checks important only to regular page allocations are ignored
in these extraordinary situations.

Signed-off-by: Johannes Weiner
Cc: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Paul Bolle
Tested-by: Zlatko Calusic
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2013-09-12 06:57:22 +0800
e2d0bd2b9 mm: kill one if loop in __free_pages_bootmem() ... Browse Code »

We should not check loop+1 with loop end in loop body. Just duplicate two
lines code to avoid it.

That will help a bit when we have huge amount of pages on system with
16TiB memory.

Signed-off-by: Yinghai Lu
Cc: Mel Gorman
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2013-09-12 06:57:19 +0800
f92310c18 mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint() ... Browse Code »

In the current code, the value of fallback_migratetype that is printed
using the mm_page_alloc_extfrag tracepoint, is the value of the
migratetype *after* it has been set to the preferred migratetype (if the
ownership was changed). Obviously that wouldn't have been the original
intent. (We already have a separate 'change_ownership' field to tell
whether the ownership of the pageblock was changed from the
fallback_migratetype to the preferred type.)

The intent of the fallback_migratetype field is to show the migratetype
from which we borrowed pages in order to satisfy the allocation request.
So fix the code to print that value correctly.

Signed-off-by: Srivatsa S. Bhat
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa S. Bhat
2013-09-12 06:57:19 +0800
fef903efc mm/page_allo.c: restructure free-page stealing code and fix a bug ... Browse Code »

The free-page stealing code in __rmqueue_fallback() is somewhat hard to
follow, and has an incredible amount of subtlety hidden inside!

First off, there is a minor bug in the reporting of change-of-ownership of
pageblocks. Under some conditions, we try to move upto
'pageblock_nr_pages' no. of pages to the preferred allocation list. But
we change the ownership of that pageblock to the preferred type only if we
manage to successfully move atleast half of that pageblock (or if
page_group_by_mobility_disabled is set).

However, the current code ignores the latter part and sets the
'migratetype' variable to the preferred type, irrespective of whether we
actually changed the pageblock migratetype of that block or not. So, the
page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
'change_ownership' might be shown as 1 when it must have been 0).

So fixing this involves moving the update of the 'migratetype' variable to
the right place. But looking closer, we observe that the 'migratetype'
variable is used subsequently for checks such as "is_migrate_cma()".
Obviously the intent there is to check if the *fallback* type is
MIGRATE_CMA, but since we already set the 'migratetype' variable to
start_migratetype, we end up checking if the *preferred* type is
MIGRATE_CMA!!

To make things more interesting, this actually doesn't cause a bug in
practice, because we never change *anything* if the fallback type is CMA.

So, restructure the code in such a way that it is trivial to understand
what is going on, and also fix the above mentioned bug. And while at it,
also add a comment explaining the subtlety behind the migratetype used in
the call to expand().

[akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
Signed-off-by: Srivatsa S. Bhat
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa S. Bhat
2013-09-12 06:57:19 +0800
b8af29418 mm/page_alloc.c: fix coding style and spelling ... Browse Code »

Fix all errors reported by checkpatch and some small spelling mistakes.

Signed-off-by: Pintu Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pintu Kumar
2013-09-12 06:57:18 +0800
15ca220e1 mm/page_alloc.c: use '__paginginit' instead of '__init' ... Browse Code »

set_pageblock_order() may be called when memory hotplug, so need use
'__paginginit' instead of '__init'.

The related warning:

The function __meminit .free_area_init_node() references
a function __init .set_pageblock_order().
If .set_pageblock_order is only used by .free_area_init_node then
annotate .set_pageblock_order with a matching annotation.

Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-09-12 06:57:13 +0800
a7e833182 mm: fix negative left shift count when PAGE_SHIFT > 20 ... Browse Code »

When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
previous calculating here will generate an unexpected result. In
addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
already integral multiple of 1MB.

Signed-off-by: Jerry Zhou
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerry Zhou
2013-09-12 06:57:12 +0800

27 Aug, 2013

1 commit

9cf510a58 Fix comment typo for init_cma_reserved_pageblock ... Browse Code »

It seems the "it's" should be "its" here.

Signed-off-by: Li Zhong
Signed-off-by: Jiri Kosina

Li Zhong
2013-08-27 16:58:04 +0800

10 Jul, 2013

4 commits

5f12733e9 mm: honor min_free_kbytes set by user ... Browse Code »

min_free_kbytes is updated during memory hotplug (by
init_per_zone_wmark_min) currently which is right thing to do in most
cases but this could be unexpected if admin increased the value to
prevent from allocation failures and the new min_free_kbytes would be
decreased as a result of memory hotadd.

This patch saves the user defined value and allows updating
min_free_kbytes only if it is higher than the saved one.

A warning is printed when the new value is ignored.

Signed-off-by: Michal Hocko
Cc: Mel Gorman
Acked-by: Zhang Yanfei
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2013-07-10 01:33:25 +0800
345606d42 mm/page_alloc.c: remove unlikely() from the current_order test ... Browse Code »

In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
the order passed. MAX_ORDER is typically 11 and pageblock_order is
typically 9 on x86. Integer division truncates, so pageblock_order / 2
is 4. For the first eight iterations, it's guaranteed that
current_order >= pageblock_order / 2 if it even gets that far!

So just remove the unlikely(), it's completely bogus.

Signed-off-by: Zhang Yanfei
Suggested-by: David Rientjes
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
bc732f1d5 mm/page_alloc.c: remove zone_type argument of build_zonelists_node ... Browse Code »

The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
zone_type argument, so we can directly use the value in
build_zonelists_node and remove zone_type argument.

Signed-off-by: Zhang Yanfei
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:22 +0800
7960aedde mm: remove duplicated call of get_pfn_range_for_nid ... Browse Code »

When calculating pages in a node, for each zone in that node, we will
have

zone_spanned_pages_in_node
--> get_pfn_range_for_nid
zone_absent_pages_in_node
--> get_pfn_range_for_nid

That is to say, we call the get_pfn_range_for_nid to get start_pfn and
end_pfn of the node for MAX_NR_ZONES * 2 times. And this is totally
unnecessary if we call the get_pfn_range_for_nid before
zone_*_pages_in_node add two extra arguments node_start_pfn and
node_end_pfn for zone_*_pages_in_node, then we can remove the
get_pfn_range_in_node in zone_*_pages_in_node.

[akpm@linux-foundation.org: make definitions more readable]
Signed-off-by: Zhang Yanfei
Cc: Michal Hocko
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-07-10 01:33:20 +0800

04 Jul, 2013

18 commits

7ee3d4e8c mm: introduce helper function mem_init_print_info() to simplify mem_init() ... Browse Code »

Introduce helper function mem_init_print_info() to simplify mem_init()
across different architectures, which also unifies the format and
information printed.

Function mem_init_print_info() calculates memory statistics information
without walking each page, so it should be a little faster on some
architectures.

Also introduce another helper get_num_physpages() to kill the global
variable num_physpages.

Signed-off-by: Jiang Liu
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:35 +0800
cdd91a770 mm: report available pages as "MemTotal" for each NUMA node ... Browse Code »

As reported by https://bugzilla.kernel.org/show_bug.cgi?id=53501,
"MemTotal" from /proc/meminfo means memory pages managed by the buddy
system (managed_pages), but "MemTotal" from /sys/.../node/nodex/meminfo
means physical pages present (present_pages) within the NUMA node.
There's a difference between managed_pages and present_pages due to
bootmem allocator and reserved pages.

And Documentation/filesystems/proc.txt says
MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)

So change /sys/.../node/nodex/meminfo to report available pages within
the node as "MemTotal".

Signed-off-by: Jiang Liu
Reported-by:
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Marek Szyprowski
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tang Chen
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:34 +0800
3dcc0571c mm: correctly update zone->managed_pages ... Browse Code »

Enhance adjust_managed_page_count() to adjust totalhigh_pages for
highmem pages. And change code which directly adjusts totalram_pages to
use adjust_managed_page_count() because it adjusts totalram_pages,
totalhigh_pages and zone->managed_pages altogether in a safe way.

Remove inc_totalhigh_pages() and dec_totalhigh_pages() from xen/balloon
driver bacause adjust_managed_page_count() has already adjusted
totalhigh_pages.

This patch also fixes two bugs:

1) enhances virtio_balloon driver to adjust totalhigh_pages when
reserve/unreserve pages.
2) enhance memory_hotplug.c to adjust totalhigh_pages when hot-removing
memory.

We still need to deal with modifications of totalram_pages in file
arch/powerpc/platforms/pseries/cmm.c, but need help from PPC experts.

[akpm@linux-foundation.org: remove ifdef, per Wanpeng Li, virtio_balloon.c cleanup, per Sergei]
[akpm@linux-foundation.org: export adjust_managed_page_count() to modules, for drivers/virtio/virtio_balloon.c]
Signed-off-by: Jiang Liu
Cc: Chris Metcalf
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Cc: Wen Congyang
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "H. Peter Anvin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Marek Szyprowski
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Yinghai Lu
Cc: Russell King
Cc: Sergei Shtylyov
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
170a5a7eb mm: make __free_pages_bootmem() only available at boot time ... Browse Code »

In order to simpilify management of totalram_pages and
zone->managed_pages, make __free_pages_bootmem() only available at boot
time. With this change applied, __free_pages_bootmem() will only be
used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
Other callers of __free_pages_bootmem() have been converted to use
free_reserved_page(), which handles totalram_pages and
zone->managed_pages in a safer way.

This patch also fix a bug in free_pagetable() for x86_64, which should
increase zone->managed_pages instead of zone->present_pages when freeing
reserved pages.

And now we have managed_pages_count_lock to protect totalram_pages and
zone->managed_pages, so remove the redundant ppb_lock lock in
put_page_bootmem(). This greatly simplifies the locking rules.

Signed-off-by: Jiang Liu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Cc: Wen Congyang
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: Mel Gorman
Cc: Minchan Kim
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Marek Szyprowski
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tejun Heo
Cc: Will Deacon
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
c3d5f5f0c mm: use a dedicated lock to protect totalram_pages and zone->managed_pages ... Browse Code »

Currently lock_memory_hotplug()/unlock_memory_hotplug() are used to
protect totalram_pages and zone->managed_pages. Other than the memory
hotplug driver, totalram_pages and zone->managed_pages may also be
modified at runtime by other drivers, such as Xen balloon,
virtio_balloon etc. For those cases, memory hotplug lock is a little
too heavy, so introduce a dedicated lock to protect totalram_pages and
zone->managed_pages.

Now we have a simplified locking rules totalram_pages and
zone->managed_pages as:

1) no locking for read accesses because they are unsigned long.
2) no locking for write accesses at boot time in single-threaded context.
3) serialize write accesses at runtime by acquiring the dedicated
managed_page_count_lock.

Also adjust zone->managed_pages when freeing reserved pages into the
buddy system, to keep totalram_pages and zone->managed_pages in
consistence.

[akpm@linux-foundation.org: don't export adjust_managed_page_count to modules (for now)]
Signed-off-by: Jiang Liu
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Marek Szyprowski
Cc: Rusty Russell
Cc: Tang Chen
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
7b4b2a0d6 mm: accurately calculate zone->managed_pages for highmem zones ... Browse Code »

Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
that all highmem pages will be freed into the buddy system by function
mem_init(). But that's not always true, some architectures may reserve
some highmem pages during boot. For example PPC may allocate highmem
pages for giagant HugeTLB pages, and several architectures have code to
check PageReserved flag to exclude highmem pages allocated during boot
when freeing highmem pages into the buddy system.

So treat highmem pages in the same way as normal pages, that is to:
1) reset zone->managed_pages to zero in mem_init().
2) recalculate managed_pages when freeing pages into the buddy system.

Signed-off-by: Jiang Liu
Cc: "H. Peter Anvin"
Cc: Tejun Heo
Cc: Joonsoo Kim
Cc: Yinghai Lu
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Kamezawa Hiroyuki
Cc: Marek Szyprowski
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Konrad Rzeszutek Wilk
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tang Chen
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
4f9f47745 mm: use managed_pages to calculate default zonelist order ... Browse Code »

Use zone->managed_pages instead of zone->present_pages to calculate
default zonelist order because managed_pages means allocatable pages.

Signed-off-by: Jiang Liu
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Marek Szyprowski
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tang Chen
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
834405c3b mm: fix some trivial typos in comments ... Browse Code »

Fix some trivial typos in comments.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Marek Szyprowski
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:33 +0800
dbe67df4b mm: enhance free_reserved_area() to support poisoning memory with zero ... Browse Code »

Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
pattern '0'. This could be used to get rid of poison_init_mem()
on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().

Signed-off-by: Jiang Liu
Cc: Geert Uytterhoeven
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc:
Cc: Arnd Bergmann
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Marek Szyprowski
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tang Chen
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:32 +0800
11199692d mm: change signature of free_reserved_area() to fix building warnings ... Browse Code »

Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:

arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,

mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]

Also address some minor code review comments.

Signed-off-by: Jiang Liu
Reported-by: Arnd Bergmann
Cc: "H. Peter Anvin"
Cc: "Michael S. Tsirkin"
Cc:
Cc: Catalin Marinas
Cc: Chris Metcalf
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: Jeremy Fitzhardinge
Cc: Jianguo Wu
Cc: Joonsoo Kim
Cc: Kamezawa Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Marek Szyprowski
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Rusty Russell
Cc: Tang Chen
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Wen Congyang
Cc: Will Deacon
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-07-04 07:07:32 +0800
cea27eb2a mm/memory-hotplug: fix lowmem count overflow when offline pages ... Browse Code »

The logic for the memory-remove code fails to correctly account the
Total High Memory when a memory block which contains High Memory is
offlined as shown in the example below. The following patch fixes it.

Before logic memory remove:

MemTotal: 7603740 kB
MemFree: 6329612 kB
Buffers: 94352 kB
Cached: 872008 kB
SwapCached: 0 kB
Active: 626932 kB
Inactive: 519216 kB
Active(anon): 180776 kB
Inactive(anon): 222944 kB
Active(file): 446156 kB
Inactive(file): 296272 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 7294672 kB
HighFree: 5704696 kB
LowTotal: 309068 kB
LowFree: 624916 kB

After logic memory remove:

MemTotal: 7079452 kB
MemFree: 5805976 kB
Buffers: 94372 kB
Cached: 872000 kB
SwapCached: 0 kB
Active: 626936 kB
Inactive: 519236 kB
Active(anon): 180780 kB
Inactive(anon): 222944 kB
Active(file): 446156 kB
Inactive(file): 296292 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 7294672 kB
HighFree: 5181024 kB
LowTotal: 4294752076 kB
LowFree: 624952 kB

[mhocko@suse.cz: fix CONFIG_HIGHMEM=n build]
Signed-off-by: Wanpeng Li
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: [2.6.24+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-07-04 07:07:32 +0800
dacbde096 mm/page_alloc.c: add additional checking and return value for the 'table->data' ... Browse Code »

- check the length of the procfs data before copying it into a fixed
size array.

- when __parse_numa_zonelist_order() fails, save the error code for
return.

- 'char*' --> 'char *' coding style fix

Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-07-04 07:07:31 +0800
169f6c199 mm/page_alloc: don't re-init pageset in zone_pcp_update() ... Browse Code »

When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.

Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().

Signed-off-by: Cody P Schafer
Cc: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800
3664033c5 mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch() ... Browse Code »

Signed-off-by: Cody P Schafer
Cc: Gilad Ben-Yossef
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800
737af4c01 mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init() ... Browse Code »

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0.

Correct this by calling zone_pageset_init(), which chooses the
appropriate ->batch and ->high calculations.

Signed-off-by: Cody P Schafer
Cc: Gilad Ben-Yossef
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800
56cef2b85 mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset() ... Browse Code »

Signed-off-by: Cody P Schafer
Cc: Gilad Ben-Yossef
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800
dd1895e2c mm/page_alloc: relocate comment to be directly above code it refers to. ... Browse Code »

Signed-off-by: Cody P Schafer
Cc: Gilad Ben-Yossef
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800
88c90dbcc mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch() ... Browse Code »

Signed-off-by: Cody P Schafer
Cc: Gilad Ben-Yossef
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2013-07-04 07:07:27 +0800