Eric Lee / smarc-fsl-linux-kernel

16 Dec, 2009

40 commits

7b5115940 rmap: simplify try_to_unmap_file() ... Browse Code »

Just simplify the code when `mlocked' is true.

Signed-off-by: Huang Shijie
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-12-16 00:53:16 +0800
8051be5e6 rmap: fix the comment for try_to_unmap_anon ... Browse Code »

Fix the comment for try_to_unmap_anon() with the new arguments.

Signed-off-by: Huang Shijie
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-12-16 00:53:16 +0800
6aceb53be mm/vmscan: change comment generic_file_write to __generic_file_aio_write ... Browse Code »

Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
cleanups") removed generic_file_write() in filemap. Change the comment in
vmscan pageout() to __generic_file_aio_write().

Signed-off-by: Vincent Li
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-12-16 00:53:16 +0800
d4906e1aa swap: rework map_swap_page() again ... Browse Code »

Seems that page_io.c doesn't really need to know that page_private(page)
is the swp_entry 'val'. Rework map_swap_page() to do what its name says
and map a page to a page offset in the swap space.

The only other caller of map_swap_page() is internal to mm/swapfile.c and
it does want to map a swap entry to the 'sector'. So rename
map_swap_page() to map_swap_entry(), make it 'static' and and implement
map_swap_page() as a wrapper around that.

Signed-off-by: Lee Schermerhorn
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:16 +0800
7509765a2 swap_info: reorder its fields ... Browse Code »

Reorder (and comment) the fields of swap_info_struct, to make better
use of its cachelines: it's good for swap_duplicate() in particular
if unsigned int max and swap_map are near the start.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:16 +0800
aaa468653 swap_info: note SWAP_MAP_SHMEM ... Browse Code »

While we're fiddling with the swap_map values, let's assign a particular
value to shmem/tmpfs swap pages: their swap counts are never incremented,
and it helps swapoff's try_to_unuse() a little if it can immediately
distinguish those pages from process pages.

Since we've no use for SWAP_MAP_BAD | COUNT_CONTINUED,
we might as well use that 0xbf value for SWAP_MAP_SHMEM.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:16 +0800
570a335b8 swap_info: swap count continuations ... Browse Code »

Swap is duplicated (reference count incremented by one) whenever the same
swap page is inserted into another mm (when forking finds a swap entry in
place of a pte, or when reclaim unmaps a pte to insert the swap entry).

swap_info_struct's vmalloc'ed swap_map is the array of these reference
counts: but what happens when the unsigned short (or unsigned char since
the preceding patch) is full? (and its high bit is kept for a cache flag)

We then lose track of it, never freeing, leaving it in use until swapoff:
at which point we _hope_ that a single pass will have found all instances,
assume there are no more, and will lose user data if we're wrong.

Swapping of KSM pages has not yet been enabled; but it is implemented,
and makes it very easy for a user to overflow the maximum swap count:
possible with ordinary process pages, but unlikely, even when pid_max
has been raised from PID_MAX_DEFAULT.

This patch implements swap count continuations: when the count overflows,
a continuation page is allocated and linked to the original vmalloc'ed
map page, and this used to hold the continuation counts for that entry
and its neighbours. These continuation pages are seldom referenced:
the common paths all work on the original swap_map, only referring to
a continuation page when the low "digit" of a count is incremented or
decremented through SWAP_MAP_MAX.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
8d69aaee8 swap_info: swap_map of chars not shorts ... Browse Code »

Halve the vmalloc'ed swap_map array from unsigned shorts to unsigned
chars: it's still very unusual to reach a swap count of 126, and the
next patch allows it to be extended indefinitely.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
253d553ba swap_info: SWAP_HAS_CACHE cleanups ... Browse Code »

Though swap_count() is useful, I'm finding that swap_has_cache() and
encode_swapmap() obscure what happens in the swap_map entry, just at
those points where I need to understand it. Remove them, and pass
more usable "usage" values to scan_swap_map(), swap_entry_free() and
__swap_duplicate(), instead of the SWAP_MAP and SWAP_CACHE enum.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
73c34b6ac swap_info: miscellaneous minor cleanups ... Browse Code »

Move CONFIG_HIBERNATION's swapdev_block() into the main CONFIG_HIBERNATION
block, remove extraneous whitespace and return, fix typo in a comment.

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
9625a5f28 swap_info: include first_swap_extent ... Browse Code »

Make better use of the space by folding first swap_extent into its
swap_info_struct, instead of just the list_head: swap partitions need
only that one, and for others it's used as a circular list anyway.

[jirislaby@gmail.com: fix crash on double swapon]
Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
efa90a981 swap_info: change to array of pointers ... Browse Code »

The swap_info_struct is only 76 or 104 bytes, but it does seem wrong
to reserve an array of about 30 of them in bss, when most people will
want only one. Change swap_info[] to an array of pointers.

That does need a "type" field in the structure: pack it as a char with
next type and short prio (aha, char is unsigned by default on PowerPC).
Use the (admittedly peculiar) name "type" throughout for this index.

/proc/swaps does not take swap_lock: I wouldn't want it to, but do take
care with barriers when adding a new item to the array (never removed).

Signed-off-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:15 +0800
f29ad6a99 swap_info: private to swapfile.c ... Browse Code »

The swap_info_struct is mostly private to mm/swapfile.c, with only
one other in-tree user: get_swap_bio(). Adjust its interface to
map_swap_page(), so that we can then remove get_swap_info_struct().

But there is a popular user out-of-tree, TuxOnIce: so leave the
declaration of swap_info_struct in linux/swap.h.

Signed-off-by: Hugh Dickins
Cc: Nigel Cunningham
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-12-16 00:53:13 +0800
976d6dfbb vmalloc(): adjust gfp mask passed on nested vmalloc() invocation ... Browse Code »

- avoid wasting more precious resources (DMA or DMA32 pools), when
being called through vmalloc_32{,_user}()
- explicitly allow using high memory here even if the outer allocation
request doesn't allow it

Signed-off-by: Jan Beulich
Acked-by: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2009-12-16 00:53:13 +0800
bad44b5be mm: add gfp flags for NODEMASK_ALLOC slab allocations ... Browse Code »

Objects passed to NODEMASK_ALLOC() are relatively small in size and are
backed by slab caches that are not of large order, traditionally never
greater than PAGE_ALLOC_COSTLY_ORDER.

Thus, using GFP_KERNEL for these allocations on large machines when
CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
the allocation attempt, each time invoking both direct reclaim or the oom
killer.

This is of particular interest when using NODEMASK_ALLOC() from a
mempolicy context (either directly in mm/mempolicy.c or the mempolicy
constrained hugetlb allocations) since the oom killer always kills current
when allocations are constrained by mempolicies. So for all present use
cases in the kernel, current would end up being oom killed when direct
reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
current would have sacrificed itself upon returning.

This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
All current use cases either directly from hugetlb code or indirectly via
NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
killer when the slab allocator needs to allocate additional pages.

The side-effect of this change is that all current use cases of either
NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
when the allocation fails (never for CONFIG_NODES_SHIFT
Acked-by: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-12-16 00:53:13 +0800
39da08cb0 hugetlb: offload per node attribute registrations ... Browse Code »

Offload the registration and unregistration of per node hstate sysfs
attributes to a worker thread rather than attempt the
allocation/attachment or detachment/freeing of the attributes in the
context of the memory hotplug handler.

I don't know that this is absolutely required, but the registration can
sleep in allocations and other mem hot plug handlers do it this way. If
it turns out this is NOT required, we can drop this patch.

N.B., Only tested build, boot, libhugetlbfs regression.
i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:13 +0800
4faf8d950 hugetlb: handle memory hot-plug events ... Browse Code »

Register per node hstate attributes only for nodes with memory. As
suggested by David Rientjes.

With Memory Hotplug, memory can be added to a memoryless node and a node
with memory can become memoryless. Therefore, add a memory on/off-line
notifier callback to [un]register a node's attributes on transition
to/from memoryless state.

N.B., Only tested build, boot, libhugetlbfs regression.
i.e., no memory hotplug testing.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Andi Kleen
Acked-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:13 +0800
8fe23e057 mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined ... Browse Code »

When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.

In such a situation, kswapd must also be stopped since it has nothing left
to do.

Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Yasunori Goto
Cc: Mel Gorman
Cc: Rafael J. Wysocki
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-12-16 00:53:13 +0800
9b5e5d0fd hugetlb: use only nodes with memory for huge pages ... Browse Code »

Register per node hstate sysfs attributes only for nodes with memory.
Global replacement of 'all online nodes" with "all nodes with memory" in
mm/hugetlb.c. Suggested by David Rientjes.

A subsequent patch will handle adding/removing of per node hstate sysfs
attributes when nodes transition to/from memoryless state via memory
hotplug.

NOTE: this patch has not been tested with memoryless nodes.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Acked-by: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:13 +0800
267b4c281 hugetlb: update hugetlb documentation for NUMA controls ... Browse Code »

Update the kernel huge tlb documentation to describe the numa memory
policy based huge page management. Additionaly, the patch includes a fair
amount of rework to improve consistency, eliminate duplication and set the
context for documenting the memory policy interaction.

Signed-off-by: Lee Schermerhorn
Acked-by: David Rientjes
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
9a3052306 hugetlb: add per node hstate attributes ... Browse Code »

Add the per huge page size control/query attributes to the per node
sysdevs:

/sys/devices/system/node/node/hugepages/hugepages-/
nr_hugepages - r/w
free_huge_pages - r/o
surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing global hstate
attribute initialization and handling, and the "nodes_allowed" constraint
processing as possible.

Calling set_max_huge_pages() with no node indicates a change to global
hstate parameters. In this case, any non-default task mempolicy will be
used to generate the nodes_allowed mask. A valid node id indicates an
update to that node's hstate parameters, and the count argument specifies
the target count for the specified node. From this info, we compute the
target global count for the hstate and construct a nodes_allowed node mask
contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./ ../ free_hugepages nr_hugepages surplus_hugepages

Starting from:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 0
Node 2 HugePages_Free: 0
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 16
Node 2 HugePages_Free: 16
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 8
Node 2 HugePages_Free: 8
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0

Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
4e25b2576 hugetlb: add generic definition of NUMA_NO_NODE ... Browse Code »

Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
to generic header 'linux/numa.h' for use in generic code. NUMA_NO_NODE
replaces bare '-1' where it's used in this series to indicate "no node id
specified". Ultimately, it can be used to replace the -1 elsewhere where
it is used similarly.

Signed-off-by: Lee Schermerhorn
Acked-by: David Rientjes
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
06808b082 hugetlb: derive huge pages nodes allowed from task mempolicy ... Browse Code »

This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

sysctl vm.nr_hugepages[_mempolicy]=32

yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
c1e6c8d07 hugetlb: factor init_nodemask_of_node() ... Browse Code »

Factor init_nodemask_of_node() out of the nodemask_of_node() macro.

This will be used to populate the huge pages "nodes_allowed" nodemask for
a single node when basing nodes_allowed on a preferred/local mempolicy or
when a persistent huge page pool page count is modified via a per node
sysfs attribute.

Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Acked-by: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
6ae11b278 hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functions ... Browse Code »

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions. For now, pass
NULL to indicate default behavior--i.e., use node_online_map. A
subsqeuent patch will derive a non-default mask from the controlling
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under the
constraint of a nodemask simplifies keeping the global state
consistent--especially the number of persistent and surplus pages relative
to reservations and overcommit limits. There are undoubtedly other ways
to do this, but this works for both interfaces: mempolicy and per node
attributes.

[rientjes@google.com: fix HIGHMEM compile error]
Signed-off-by: Lee Schermerhorn
Reviewed-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
9a76db099 hugetlb: rework hstate_next_node_* functions ... Browse Code »

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid". Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
we successfully allocated/freed a huge page on the node, now we only call
these functions on failure to alloc/free to advance to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end of
node_online_map. In this version, the allowed nodes include all of the
online nodes.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
4e7b8a6ce nodemask: make NODEMASK_ALLOC more general ... Browse Code »

This is a series of patches to provide control over the location of the
allocation and freeing of persistent huge pages on a NUMA platform.
Please consider for merging into mmotm.

This series uses two mechanisms to constrain the nodes from which
persistent huge pages are allocated: 1) the task NUMA mempolicy of the
task modifying a new sysctl "nr_hugepages_mempolicy", based on a
suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
attributes have been added [in V4] to each node system device under:

/sys/devices/node/node[0-9]*/hugepages

The per node attibutes allow direct assignment of a huge page count on a
specific node, regardless of the task's mempolicy or cpuset constraints.

This patch:

NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
It's perfectly reasonable to use this macro to allocate a nodemask_t,
which is anonymous, either dynamically or on the stack depending on
NODES_SHIFT.

Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Acked-by: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-12-16 00:53:12 +0800
6d9c285a6 mm: move inc_zone_page_state(NR_ISOLATED) to just isolated place ... Browse Code »

Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
in right after isolate_page().

This patch does it.

Reviewed-by: Christoph Lameter
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:12 +0800
ee32398fd /dev/mem: remove redundant parameter from do_write_kmem() ... Browse Code »

Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Cc: Avi Kivity
Cc: Greg Kroah-Hartman
Cc: Johannes Berg
Cc: Marcelo Tosatti
Cc: Mark Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:12 +0800
80ad89a0c /dev/mem: remove the "written" variable in write_kmem() ... Browse Code »

Also rename "len" to "sz". No behavior change.

Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Cc: Avi Kivity
Cc: Greg Kroah-Hartman
Cc: Johannes Berg
Cc: Marcelo Tosatti
Cc: Mark Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:11 +0800
7fabaddd0 /dev/mem: make size_inside_page() logic straight ... Browse Code »

Also convert more size_inside_page() users.

Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Cc: Avi Kivity
Cc: Greg Kroah-Hartman
Cc: Johannes Berg
Cc: Marcelo Tosatti
Cc: Mark Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:11 +0800
fa29e97bb /dev/mem: cleanup unxlate_dev_mem_ptr() calls ... Browse Code »

No behaviour change.

[akpm@linux-foundation.org: cleanuplets]
[akpm@linux-foundation.org: remove unused `ret']
Signed-off-by: Wu Fengguang
Acked-by: Andi Kleen
Cc: Marcelo Tosatti
Cc: Greg Kroah-Hartman
Cc: Mark Brown
Cc: Johannes Berg
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:11 +0800
f222318e9 /dev/mem: introduce size_inside_page() ... Browse Code »

Introduce size_inside_page() to replace duplicate /dev/mem code.

Also apply it to /dev/kmem, whose alignment logic was buggy.

Signed-off-by: Wu Fengguang
Acked-by: Andi Kleen
Cc: Marcelo Tosatti
Cc: Greg Kroah-Hartman
Cc: Mark Brown
Cc: Johannes Berg
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:11 +0800
4ea2f43f2 /dev/mem: remove redundant test on len ... Browse Code »

The len test in write_kmem() is always true, so can be reduced.

Signed-off-by: Wu Fengguang
Acked-by: Andi Kleen
Cc: Marcelo Tosatti
Cc: Greg Kroah-Hartman
Cc: Mark Brown
Cc: Johannes Berg
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-12-16 00:53:11 +0800
659ace584 mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap() ... Browse Code »

On ia64, the following test program exit abnormally, because glibc thread
library called abort().

========================================================
(gdb) bt
#0 0xa000000000010620 in __kernel_syscall_via_break ()
#1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2 0x2000000000324090 in abort () from /lib/libc.so.6.1
#3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================

The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail. However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.

Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary. internal temporary exceeding
shouldn't make ENOMEM.

This patch does it.

test_max_mapcount.c
==================================================================
#include
#include
#include
#include
#include
#include

#define THREAD_NUM 30000
#define MAL_SIZE (8*1024*1024)

void *wait_thread(void *args)
{
void *addr;

addr = malloc(MAL_SIZE);
sleep(10);

return NULL;
}

void *wait_thread2(void *args)
{
sleep(60);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;

ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}

ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}

for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;

ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}

sleep(3600);
return 0;
}
==================================================================

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:11 +0800
bb86a7338 page-types: exit early when invoked with -d|--describe ... Browse Code »

On a system with large amount of memory (256GB), invoking page-types can
take quite a long time, which is unreasonable considering the user only
wants a description of the flags:

# time ./page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

real 0m34.285s
user 0m1.966s
sys 0m32.313s

This is because we still walk the entire address range.

Exiting early seems like a reasonble solution:

# time ./page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

real 0m0.007s
user 0m0.001s
sys 0m0.005s

Signed-off-by: Alex Chiang
Cc: Andi Kleen
Cc: Haicheng Li
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Chiang
2009-12-16 00:53:11 +0800
9fdcd886a page-types: whitespace alignment ... Browse Code »

Align the output when page-type -h is invoked.

Signed-off-by: Alex Chiang
Acked-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Chiang
2009-12-16 00:53:11 +0800
dcfe730c6 page-types: learn to describe flags directly from command line ... Browse Code »

Teach page-types to describe page flags directly from the command line.

Why is this useful? For instance, if you're using memory hotplug and see
this in /var/log/messages:

kernel: removing from LRU failed 3836dd0/1/1e00000000000010

It would be nice to decode those page flags without staring at the source.

Example usage and output:

# Documentation/vm/page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

# Documentation/vm/page-types -d anon
0x0000000000001000 ____________a_____________________ anonymous

# Documentation/vm/page-types -d anon,0x10
0x0000000000001010 ____D_______a_____________________ dirty,anonymous

[achiang@hp.com: documentation]
Signed-off-by: Alex Chiang
Signed-off-by: Wu Fengguang
Cc: Andi Kleen
Cc: Haicheng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Chiang
2009-12-16 00:53:11 +0800
f1327bf18 page-types: unsigned cannot be less than 0 in add_page() ... Browse Code »

If not signed, testing of the read() return value in this function
will not work.

Signed-off-by: Roel Kluin
Cc: Wu Fengguang
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-12-16 00:53:11 +0800
3428838d8 page-types: constify read only arrays ... Browse Code »

Signed-off-by: Tommi Rantala
Cc: Randy Dunlap
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tommi Rantala
2009-12-16 00:53:10 +0800