Eric Lee / smarc-fsl-linux-kernel

16 Dec, 2009

9 commits

536240f2b hugetlb: abort a hugepage pool resize if a signal is pending ... Browse Code »

If a user asks for a hugepage pool resize but specified a large number,
the machine can begin trashing. In response, they might hit ctrl-c but
signals are ignored and the pool resize continues until it fails an
allocation. This can take a considerable amount of time so this patch
aborts a pool resize if a signal is pending.

Suggested by Dave Hansen.

Signed-off-by: Mel Gorman
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-12-16 00:53:24 +0800
4eb2b1dcd hugetlb: acquire the i_mmap_lock before walking the prio_tree to unmap a page ... Browse Code »

When the owner of a mapping fails COW because a child process is holding a
reference, the children VMAs are walked and the page is unmapped. The
i_mmap_lock is taken for the unmapping of the page but not the walking of
the prio_tree. In theory, that tree could be changing if the lock is not
held. This patch takes the i_mmap_lock properly for the duration of the
prio_tree walk.

[hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
Signed-off-by: Mel Gorman
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-12-16 00:53:23 +0800
b76c8cfbf hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails ... Browse Code »

hugetlb_fault() takes the mm->page_table_lock spinlock then calls
hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
insufficient huge page pool it calls unmap_ref_private() with the
mm->page_table_lock held. unmap_ref_private() then calls
unmap_hugepage_range() which tries to acquire the mm->page_table_lock.

[] print_circular_bug_tail+0x80/0x9f
[] ? check_noncircular+0xb0/0xe8
[] __lock_acquire+0x956/0xc0e
[] lock_acquire+0xee/0x12e
[] ? unmap_hugepage_range+0x3e/0x84
[] ? unmap_hugepage_range+0x3e/0x84
[] _spin_lock+0x40/0x89
[] ? unmap_hugepage_range+0x3e/0x84
[] ? alloc_huge_page+0x218/0x318
[] unmap_hugepage_range+0x3e/0x84
[] hugetlb_cow+0x1e2/0x3f4
[] ? hugetlb_fault+0x453/0x4f6
[] hugetlb_fault+0x480/0x4f6
[] follow_hugetlb_page+0x116/0x2d9
[] ? _spin_unlock_irq+0x3a/0x5c
[] __get_user_pages+0x2a3/0x427
[] get_user_pages+0x3e/0x54
[] get_user_pages_fast+0x170/0x1b5
[] dio_get_page+0x64/0x14a
[] __blockdev_direct_IO+0x4b7/0xb31
[] blkdev_direct_IO+0x58/0x6e
[] ? blkdev_get_blocks+0x0/0xb8
[] generic_file_aio_read+0xdd/0x528
[] ? avc_has_perm+0x66/0x8c
[] do_sync_read+0xf5/0x146
[] ? autoremove_wake_function+0x0/0x5a
[] ? security_file_permission+0x24/0x3a
[] vfs_read+0xb5/0x126
[] ? fget_light+0x5e/0xf8
[] sys_read+0x54/0x8c
[] system_call_fastpath+0x16/0x1b

This can be fixed by dropping the mm->page_table_lock around the call to
unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
the normal path anyway. However, earlier in the that function, it's also
possible to call into the page allocator with the same spinlock held.

What this patch does is drop the spinlock before the page allocator is
potentially entered. The check for page allocation failure can be made
without the page_table_lock as well as the copy of the huge page. Even if
the PTE changed while the spinlock was held, the consequence is that a
huge page is copied unnecessarily. This resolves both the double taking
of the lock and sleeping with the spinlock held.

[mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
Signed-off-by: Larry Woodman
Signed-off-by: Mel Gorman
Acked-by: Adam Litke
Cc: Andy Whitcroft
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Larry Woodman
2009-12-16 00:53:20 +0800
bad44b5be mm: add gfp flags for NODEMASK_ALLOC slab allocations ... Browse Code »

Objects passed to NODEMASK_ALLOC() are relatively small in size and are
backed by slab caches that are not of large order, traditionally never
greater than PAGE_ALLOC_COSTLY_ORDER.

Thus, using GFP_KERNEL for these allocations on large machines when
CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
the allocation attempt, each time invoking both direct reclaim or the oom
killer.

This is of particular interest when using NODEMASK_ALLOC() from a
mempolicy context (either directly in mm/mempolicy.c or the mempolicy
constrained hugetlb allocations) since the oom killer always kills current
when allocations are constrained by mempolicies. So for all present use
cases in the kernel, current would end up being oom killed when direct
reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
current would have sacrificed itself upon returning.

This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
All current use cases either directly from hugetlb code or indirectly via
NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
killer when the slab allocator needs to allocate additional pages.

The side-effect of this change is that all current use cases of either
NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
when the allocation fails (never for CONFIG_NODES_SHIFT
Acked-by: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-12-16 00:53:13 +0800
9b5e5d0fd hugetlb: use only nodes with memory for huge pages ... Browse Code »

Register per node hstate sysfs attributes only for nodes with memory.
Global replacement of 'all online nodes" with "all nodes with memory" in
mm/hugetlb.c. Suggested by David Rientjes.

A subsequent patch will handle adding/removing of per node hstate sysfs
attributes when nodes transition to/from memoryless state via memory
hotplug.

NOTE: this patch has not been tested with memoryless nodes.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Acked-by: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:13 +0800
9a3052306 hugetlb: add per node hstate attributes ... Browse Code »

Add the per huge page size control/query attributes to the per node
sysdevs:

/sys/devices/system/node/node/hugepages/hugepages-/
nr_hugepages - r/w
free_huge_pages - r/o
surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing global hstate
attribute initialization and handling, and the "nodes_allowed" constraint
processing as possible.

Calling set_max_huge_pages() with no node indicates a change to global
hstate parameters. In this case, any non-default task mempolicy will be
used to generate the nodes_allowed mask. A valid node id indicates an
update to that node's hstate parameters, and the count argument specifies
the target count for the specified node. From this info, we compute the
target global count for the hstate and construct a nodes_allowed node mask
contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./ ../ free_hugepages nr_hugepages surplus_hugepages

Starting from:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 0
Node 2 HugePages_Free: 0
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 16
Node 2 HugePages_Free: 16
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
Node 1 HugePages_Total: 0
Node 1 HugePages_Free: 0
Node 1 HugePages_Surp: 0
Node 2 HugePages_Total: 8
Node 2 HugePages_Free: 8
Node 2 HugePages_Surp: 0
Node 3 HugePages_Total: 0
Node 3 HugePages_Free: 0
Node 3 HugePages_Surp: 0

Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
06808b082 hugetlb: derive huge pages nodes allowed from task mempolicy ... Browse Code »

This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
is produced. This will cause the hugetlb subsystem to use
node_online_map as the "nodes_allowed". This preserves the
behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
a nodemask with the single preferred node will be produced.
"local" policy will NOT track any internode migrations of the
task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
will be used.
* Other than to inform the construction of the nodes_allowed node
mask, the actual mempolicy mode is ignored. That is, all modes
behave like interleave over the resulting nodes_allowed mask
with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

Node 0 HugePages_Total: 0
Node 1 HugePages_Total: 0
Node 2 HugePages_Total: 0
Node 3 HugePages_Total: 0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

sysctl vm.nr_hugepages[_mempolicy]=32

yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 8
Node 3 HugePages_Total: 8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes. So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

Node 0 HugePages_Total: 8
Node 1 HugePages_Total: 8
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

Node 0 HugePages_Total: 4
Node 1 HugePages_Total: 4
Node 2 HugePages_Total: 16
Node 3 HugePages_Total: 8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
6ae11b278 hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functions ... Browse Code »

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions. For now, pass
NULL to indicate default behavior--i.e., use node_online_map. A
subsqeuent patch will derive a non-default mask from the controlling
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under the
constraint of a nodemask simplifies keeping the global state
consistent--especially the number of persistent and surplus pages relative
to reservations and overcommit limits. There are undoubtedly other ways
to do this, but this works for both interfaces: mempolicy and per node
attributes.

[rientjes@google.com: fix HIGHMEM compile error]
Signed-off-by: Lee Schermerhorn
Reviewed-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800
9a76db099 hugetlb: rework hstate_next_node_* functions ... Browse Code »

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid". Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
we successfully allocated/freed a huge page on the node, now we only call
these functions on failure to alloc/free to advance to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end of
node_online_map. In this version, the allowed nodes include all of the
online nodes.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-12-16 00:53:12 +0800

28 Sep, 2009

1 commit

f0f37e2f7 const: mark struct vm_struct_operations ... Browse Code »

* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code

But leave TTM code alone, something is fishy there with global vm_ops
being used.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-28 02:39:25 +0800

24 Sep, 2009

1 commit

8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800

22 Sep, 2009

5 commits

3ae77f43b mm: hugetlbfs_pagecache_present ... Browse Code »

Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
and add more comments, as suggested by Mel Gorman.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
2a15efc95 mm: follow_hugetlb_page flags ... Browse Code »

follow_hugetlb_page() shouldn't be guessing about the coredump case
either: pass the foll_flags down to it, instead of just the write bit.

Remove that obscure huge_zeropage_ok() test. The decision is easy,
though unlike the non-huge case - here vm_ops->fault is always set.
But we know that a fault would serve up zeroes, unless there's
already a hugetlbfs pagecache page to back the range.

(Alternatively, since hugetlb pages aren't swapped out under pressure,
you could save more dump space by arguing that a page not yet faulted
into this process cannot be relevant to the dump; but that would be
more surprising.)

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
57dd28fb0 hugetlb: restore interleaving of bootmem huge pages ... Browse Code »

I noticed that alloc_bootmem_huge_page() will only advance to the next
node on failure to allocate a huge page, potentially filling nodes with
huge-pages. I asked about this on linux-mm and linux-numa, cc'ing the
usual huge page suspects.

Mel Gorman responded:

I strongly suspect that the same node being used until allocation
failure instead of round-robin is an oversight and not deliberate
at all. It appears to be a side-effect of a fix made way back in
commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix
hugepage allocation with memoryless nodes"]. Prior to that patch
it looked like allocations would always round-robin even when
allocation was successful.

This patch--factored out of my "hugetlb mempolicy" series--moves the
advance of the hstate next node from which to allocate up before the test
for success of the attempted allocation.

Note that alloc_bootmem_huge_page() is only used for order > MAX_ORDER
huge pages.

I'll post a separate patch for mainline/stable, as the above mentioned
"balance freeing" series renamed the next node to alloc function.

Signed-off-by: Lee Schermerhorn
Reviewed-by: Mel Gorman
Reviewed-by: Andy Whitcroft
Reviewed-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-09-22 22:17:26 +0800
685f34570 hugetlb: use free_pool_huge_page() to return unused surplus pages ... Browse Code »

Use the [modified] free_pool_huge_page() function to return unused
surplus pages. This will help keep huge pages balanced across nodes
between freeing of unused surplus pages and freeing of persistent huge
pages [from set_max_huge_pages] by using the same node id "cursor". It
also eliminates some code duplication.

Signed-off-by: Lee Schermerhorn
Cc: Mel Gorman
Cc: Nishanth Aravamudan
Acked-by: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-09-22 22:17:26 +0800
e8c5c8249 hugetlb: balance freeing of huge pages across nodes ... Browse Code »

Free huges pages from nodes in round robin fashion in an attempt to keep
[persistent a.k.a static] hugepages balanced across nodes

New function free_pool_huge_page() is modeled on and performs roughly the
inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which
now has no callers, so this patch removes it.

Helper function hstate_next_node_to_free() uses new hstate member
next_to_free_nid to distribute "frees" across all nodes with huge pages.

Acked-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Acked-by: Mel Gorman
Cc: Nishanth Aravamudan
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-09-22 22:17:26 +0800

10 Sep, 2009

1 commit

f340ca0f0 hugetlbfs: export vma_kernel_pagsize to modules ... Browse Code »

This function is required by KVM.

Signed-off-by: Joerg Roedel
Signed-off-by: Avi Kivity

Joerg Roedel
2009-09-10 13:33:01 +0800

30 Jul, 2009

1 commit

e4c6f8bed hugetlbfs: fix i_blocks accounting ... Browse Code »

As reported in Red Hat bz #509671, i_blocks for files on hugetlbfs get
accounting wrong when doing something like:

$ > foo
$ date > foo
date: write error: Invalid argument
$ /usr/bin/stat foo
File: `foo'
Size: 0 Blocks: 18446744073709547520 IO Block: 2097152 regular
...

This is because hugetlb_unreserve_pages() is unconditionally removing
blocks_per_huge_page(h) on each call rather than using the freed amount.
If there were 0 blocks, it goes negative, resulting in the above.

This is a regression from commit a5516438959d90b071ff0a484ce4f3f523dc3152
("hugetlb: modular state for hugetlb page size")

which did:

- inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
+ inode->i_blocks -= blocks_per_huge_page(h);

so just put back the freed multiplier, and it's all happy again.

Signed-off-by: Eric Sandeen
Acked-by: Andi Kleen
Cc: William Lee Irwin III
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2009-07-30 10:10:35 +0800

24 Jun, 2009

1 commit

788c7df45 hugetlb: fault flags instead of write_access ... Browse Code »

handle_mm_fault() is now passing fault flags rather than write_access
down to hugetlb_fault(), so better recognize that in hugetlb_fault(),
and in hugetlb_no_page().

Signed-off-by: Hugh Dickins
Acked-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-06-24 02:23:33 +0800

17 Jun, 2009

3 commits

20a0307c0 mm: introduce PageHuge() for testing huge/gigantic pages ... Browse Code »

A series of patches to enhance the /proc/pagemap interface and to add a
userspace executable which can be used to present the pagemap data.

Export 10 more flags to end users (and more for kernel developers):

11. KPF_MMAP (pseudo flag) memory mapped page
12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
13. KPF_SWAPCACHE page is in swap cache
14. KPF_SWAPBACKED page is swap/RAM backed
15. KPF_COMPOUND_HEAD (*)
16. KPF_COMPOUND_TAIL (*)
17. KPF_HUGE hugeTLB pages
18. KPF_UNEVICTABLE page is in the unevictable LRU list
19. KPF_HWPOISON hardware detected corruption
20. KPF_NOPAGE (pseudo flag) no page frame at the address

(*) For compound pages, exporting _both_ head/tail info enables
users to tell where a compound page starts/ends, and its order.

a simple demo of the page-types tool

# ./page-types -h
page-types [options]
-r|--raw Raw mode, for kernel developers
-a|--addr addr-spec Walk a range of pages
-b|--bits bits-spec Walk pages with specified bits
-l|--list Show page details in ranges
-L|--list-each Show page details one by one
-N|--no-summary Don't show summay info
-h|--help Show this usage message
addr-spec:
N one page at offset N (unit: pages)
N+M pages range from N to N+M-1
N,M pages range from N to M-1
N, pages range from N to end
,M pages range from 0 to M
bits-spec:
bit1,bit2 (flags & (bit1|bit2)) != 0
bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
bit1,~bit2 (flags & (bit1|bit2)) == bit1
=bit1,bit2 flags == (bit1|bit2)
bit-names:
locked error referenced uptodate
dirty lru active slab
writeback reclaim buddy mmap
anonymous swapcache swapbacked compound_head
compound_tail huge unevictable hwpoison
nopage reserved(r) mlocked(r) mappedtodisk(r)
private(r) private_2(r) owner_private(r) arch(r)
uncached(r) readahead(o) slob_free(o) slub_frozen(o)
slub_debug(o)
(r) raw mode bits (o) overloaded bits

# ./page-types
flags
0x0000000000000000
0x0000000000000014
0x0000000000000020
0x0000000000000024
0x0000000000000028
0x0001000000000028
0x000000000000002c
0x000100000000002c
0x0000000000000040
0x0000000000000060
0x0000000000000068
0x0001000000000068
0x000000000000006c
0x000100000000006c
0x0000000000004078
0x000000000000407c
0x0000000000000400
0x0000000000000804
0x0000000000000828
0x0001000000000828
0x000000000000082c
0x000100000000082c
0x0000000000000868
0x0001000000000868
0x000000000000086c
0x000100000000086c
0x0000000000004878
0x0000000000001000
0x0000000000005808
0x0000000000005868
0x000000000000586c
total page-count MB symbolic-flags long-symbolic-flags 487369 1903 _________________________________ 5 0 __R_D____________________________ referenced,dirty 1 0 _____l___________________________ lru 34 0 __R__l___________________________ referenced,lru 3838 14 ___U_l___________________________ uptodate,lru 48 0 ___U_l_______________________I___ uptodate,lru,readahead 6478 25 __RU_l___________________________ referenced,uptodate,lru 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead 8344 32 ______A__________________________ active 1 0 _____lA__________________________ lru,active 348 1 ___U_lA__________________________ uptodate,lru,active 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead 988 3 __RU_lA__________________________ referenced,uptodate,lru,active 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked 503 1 __________B______________________ buddy 1 0 __R________M_____________________ referenced,mmap 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked 492 1 ____________a____________________ anonymous 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked 513968 2007

# ./page-types -r
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000000 468002 1828 _________________________________
0x0000000100000000 19102 74 _____________________r___________ reserved
0x0000000000008000 41 0 _______________H_________________ compound_head
0x0000000000010000 188 0 ________________T________________ compound_tail
0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head
0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail
0x0000000000000020 1 0 _____l___________________________ lru
0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private
0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru
0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead
0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk
0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead
0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru
0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk
0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private
0x0000000800000040 8124 31 ______A_________________P________ active,private
0x0000000000000040 219 0 ______A__________________________ active
0x0000000800000060 1 0 _____lA_________________P________ lru,active,private
0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active
0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk
0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private
0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active
0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk
0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private
0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private
0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private
0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
0x0000000000000400 538 2 __________B______________________ buddy
0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
0x0000000000001000 492 1 ____________a____________________ anonymous
0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked
0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked
0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
total 513968 2007

# ./page-types --raw --list --no-summary --bits reserved
offset count flags
0 15 _____________________r___________
31 4 _____________________r___________
159 97 _____________________r___________
4096 2067 _____________________r___________
6752 2390 _____________________r___________
9355 3 _____________________r___________
9728 14526 _____________________r___________

This patch:

Introduce PageHuge(), which identifies huge/gigantic pages by their
dedicated compound destructor functions.

Also move prep_compound_gigantic_page() to hugetlb.c and make
__free_pages_ok() non-static.

Signed-off-by: Wu Fengguang
Cc: KOSAKI Motohiro
Cc: Andi Kleen
Cc: Matt Mackall
Cc: Alexey Dobriyan
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:36 +0800
62bc62a87 page allocator: use a pre-calculated value instead of num_online_nodes() in fast paths ... Browse Code »

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered
based on cpusets or the zonelist cache. This is actually a heavy function
and touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

[rientjes@google.com: do not override definition of node_set_online() with macro]
Signed-off-by: Christoph Lameter
Signed-off-by: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2009-06-17 10:47:35 +0800
6484eb3e2 page allocator: do not check NUMA node ID when the caller knows the node is valid ... Browse Code »

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in
fast paths know for a fact their node is valid. To avoid a comparison and
branch, this patch adds alloc_pages_exact_node() that only checks the nid
with VM_BUG_ON(). Callers that know their node is valid are then
converted.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Pekka Enberg
Acked-by: Paul Mundt [for the SLOB NUMA bits]
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:32 +0800

29 May, 2009

1 commit

f83a275db mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs ... Browse Code »

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302

hugetlbfs reserves huge pages but does not fault them at mmap() time to
ensure that future faults succeed. The reservation behaviour differs
depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
For MAP_SHARED mappings, hugepages are reserved when mmap() is first
called and are tracked based on information associated with the inode.
Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE
track the reservations based on the VMA created as part of the mmap()
operation. Each process mapping MAP_PRIVATE must make its own
reservation.

hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs,
VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
read-write. If a shared memory mapping was mapped shared-read-write for
populating of data and mapped shared-read-only by other processes, then
hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This
causes processes to fail to map the file MAP_SHARED even though it should
succeed as the reservation is there.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
when the intent of the code was to check whether the VMA was mapped
MAP_SHARED or MAP_PRIVATE.

Signed-off-by: Mel Gorman
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc:
Cc: Lee Schermerhorn
Cc: KOSAKI Motohiro
Cc:
Cc: Eric B Munson
Cc: Adam Litke
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-05-29 23:40:03 +0800

01 Apr, 2009

1 commit

e2f17d945 hugetlb: chg cannot become less than 0 ... Browse Code »

chg is unsigned, so it cannot be less than 0.

Also, since region_chg returns long, let vma_needs_reservation() forward
this to alloc_huge_page(). Store it as long as well. all callers cast it
to long anyway.

Signed-off-by: Roel Kluin
Cc: Andy Whitcroft
Cc: Mel Gorman
Cc: Adam Litke
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-04-01 23:59:13 +0800

12 Feb, 2009

1 commit

17c9d12e1 Do not account for hugetlbfs quota at mmap() time if mapping [SHM|MAP]_NORESERVE ... Browse Code »

Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
in line with the core VM by obeying VM_NORESERVE and not reserving
hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
are specified. However, it is still taking filesystem quota
unconditionally.

At fault time, if there are no reserves and attempt is made to allocate
the page and account for filesystem quota. If either fail, the fault
fails. The impact is that quota is getting accounted for twice. This
patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
help prevent this mistake happening again, it improves the documentation
of hugetlb_reserve_pages()

Reported-by: Andy Whitcroft
Signed-off-by: Mel Gorman
Acked-by: Andy Whitcroft
Signed-off-by: Linus Torvalds

Mel Gorman
2009-02-12 04:38:09 +0800

11 Feb, 2009

1 commit

5a6fe1259 Do not account for the address space used by hugetlbfs using VM_ACCOUNT ... Browse Code »

When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2009-02-11 02:48:42 +0800

07 Jan, 2009

4 commits

91f47662d mm: hugetlb: remove redundant `if' operation ... Browse Code »

At this point we already know that 'addr' is not NULL so get rid of
redundant 'if'. Probably gcc eliminate it by optimization pass.

[akpm@linux-foundation.org: use __weak, too]
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2009-01-07 07:59:10 +0800
ebdd4aea8 hugetlb: fix sparse warnings ... Browse Code »

Fix the following sparse warnings:

mm/hugetlb.c:375:3: warning: returning void-valued expression
mm/hugetlb.c:408:3: warning: returning void-valued expression

Signed-off-by: Hannes Eder
Acked-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hannes Eder
2009-01-07 07:59:06 +0800
3340289dd mm: report the MMU pagesize in /proc/pid/smaps ... Browse Code »

The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA. This matches the size used by the MMU in the
majority of cases. However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor. To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman
Cc: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800
08fba6998 mm: report the pagesize backing a VMA in /proc/pid/smaps ... Browse Code »

It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman
Acked-by: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800

13 Nov, 2008

1 commit

7526674de hugetlb: make unmap_ref_private multi-size-aware ... Browse Code »

Oops. Part of the hugetlb private reservation code was not fully
converted to use hstates.

When a huge page must be unmapped from VMAs due to a failed COW,
HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of
the page size being used. This works if the VMA is using the default
huge page size. Otherwise we might unmap too much, too little, or
trigger a BUG_ON. Rare but serious -- fix it.

Signed-off-by: Adam Litke
Cc: Jon Tollefson
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-11-13 09:17:16 +0800

07 Nov, 2008

2 commits

18229df5b hugetlb: pull gigantic page initialisation out of the default path ... Browse Code »

As we can determine exactly when a gigantic page is in use we can optimise
the common regular page cases by pulling out gigantic page initialisation
into its own function. As gigantic pages are never released to buddy we
do not need a destructor. This effectivly reverts the previous change to
the main buddy allocator. It also adds a paranoid check to ensure we
never release gigantic pages from hugetlbfs to the main buddy.

Signed-off-by: Andy Whitcroft
Cc: Jon Tollefson
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: [2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-11-07 07:41:18 +0800
69d177c2f hugetlbfs: handle pages higher order than MAX_ORDER ... Browse Code »

When working with hugepages, hugetlbfs assumes that those hugepages are
smaller than MAX_ORDER. Specifically it assumes that the mem_map is
contigious and uses that to optimise access to the elements of the mem_map
that represent the hugepage. Gigantic pages (such as 16GB pages on
powerpc) by definition are of greater order than MAX_ORDER (larger than
MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of
the buddy alloctor guarentees for the contiguity of the mem_map, which
ensures that the mem_map is at least contigious for maximmally aligned
areas of MAX_ORDER_NR_PAGES pages.

This patch adds new mem_map accessors and iterator helpers which handle
any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to
implement gigantic page versions of copy_huge_page and clear_huge_page,
and to allow follow_hugetlb_page handle gigantic pages.

Signed-off-by: Andy Whitcroft
Cc: Jon Tollefson
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: [2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-11-07 07:41:18 +0800

23 Oct, 2008

1 commit

e1759c215 proc: switch /proc/meminfo to seq_file ... Browse Code »

and move it to fs/proc/meminfo.c while I'm at it.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 17:52:40 +0800

20 Oct, 2008

3 commits

4b2e38ad7 hugepage: support ZERO_PAGE() ... Browse Code »

Presently hugepage doesn't use zero page at all because zero page is only
used for coredumping and hugepage can't core dump.

However we have now implemented hugepage coredumping. Therefore we should
implement the zero page of hugepage.

Implementation note:

o Why do we only check VM_SHARED for zero page?
normal page checked as ..

static inline int use_zero_page(struct vm_area_struct *vma)
{
if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
return 0;

return !vma->vm_ops || !vma->vm_ops->fault;
}

First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
doesn't have any file backing. Thus ops->fault checking is meaningless.

o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or some error happened. So we
shouldn't return zero page if any error occurred.

Signed-off-by: KOSAKI Motohiro
Cc: Adam Litke
Cc: Hugh Dickins
Cc: Kawai Hidehiro
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-10-20 23:52:32 +0800
2a4b3ded5 mm: hugetlb.c make functions static, use NULL rather than 0 ... Browse Code »

mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

Signed-off-by: Harvey Harrison
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-10-20 23:52:32 +0800
4f98a2fee vmscan: split LRU lists into anon & file sets ... Browse Code »

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel
Signed-off-by: Lee Schermerhorn
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-10-20 23:50:25 +0800

17 Oct, 2008

1 commit

b4d1d99fd hugetlb: handle updating of ACCESSED and DIRTY in hugetlb_fault() ... Browse Code »

The page fault path for normal pages, if the fault is neither a no-page
fault nor a write-protect fault, will update the DIRTY and ACCESSED bits
in the page table appropriately.

The hugepage fault path, however, does not do this, handling only no-page
or write-protect type faults. It assumes that either the ACCESSED and
DIRTY bits are irrelevant for hugepages (usually true, since they are
never swapped) or that they are handled by the arch code.

This is inconvenient for some software-loaded TLB architectures, where the
_PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write)
access to the page at the TLB miss. This could be worked around in the
arch TLB miss code, but the TLB miss fast path can be made simple more
easily if the hugetlb_fault() path handles this, as the normal page fault
path does.

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Cc: Hugh Dickins
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2008-10-17 02:21:29 +0800

13 Aug, 2008

2 commits

2b26736c8 allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 ... Browse Code »

[Andrew this should replace the previous version which did not check
the returns from the region prepare for errors. This has been tested by
us and Gerald and it looks good.

Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors. Here is an updated patch to correct this. This passes
testing here.]

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800
57303d801 hugetlbfs: allocate structures for reservation tracking outside of spinlocks ... Browse Code »

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults. A struct file_region is used to track when
reservations have been consumed and where. These file_regions are
allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held. This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases. The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change. The second phase actually commits the
change. This patch makes use of this by checking the reservations before
the page_table_lock is taken; triggering any necessary allocations. This
may then be safely repeated within the locks without any allocations being
required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800