Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2008

1 commit

68842c9b9 hugetlbfs: fix quota leak ... Browse Code »

In the error path of both shared and private hugetlb page allocation,
the file system quota is never undone, leading to fs quota leak. Fix
them up.

[akpm@linux-foundation.org: cleanup, micro-optimise]
Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2008-01-15 00:52:23 +0800

18 Dec, 2007

2 commits

368d2c635 Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" ... Browse Code »

This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")

Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).

(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800
d1c3fb1f8 hugetlb: introduce nr_overcommit_hugepages sysctl ... Browse Code »

hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

nr_overcommit_hugepages > 0

indicates the same administrative setting as

hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800

11 Dec, 2007

1 commit

72fad7139 hugetlb: handle write-protection faults in follow_hugetlb_page ... Browse Code »

The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe8173c212d6a326e35347b038705603d39) missed one case. If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing. Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.

This simple one-liner makes sure we also fault on the pte for this case.
Please apply.

Signed-off-by: Adam Litke
Acked-by: Dave Kleikamp
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-12-11 11:43:55 +0800

15 Nov, 2007

8 commits

45c682a68 hugetlb: fix i_blocks accounting ... Browse Code »

For administrative purpose, we want to query actual block usage for
hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that
up since kernel already has all the information to track it properly.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: Badari Pulavarty
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-11-15 10:45:40 +0800
8cde045c7 mm/hugetlb.c: make a function static ... Browse Code »

return_unused_surplus_pages() can become static.

Signed-off-by: Adrian Bunk
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-11-15 10:45:40 +0800
90d8b7e61 hugetlb: enforce quotas during reservation for shared mappings ... Browse Code »

When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
to guarantee no problems will occur later when instantiating pages. If quotas
are in force, page instantiation could fail due to a race with another process
or an oversized (but approved) shared mapping.

To prevent these scenarios, debit the quota for the full reservation amount up
front and credit the unused quota when the reservation is released.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
9a119c056 hugetlb: allow bulk updating in hugetlb_*_quota() ... Browse Code »

Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
2fc39cec6 hugetlb: debit quota in alloc_huge_page ... Browse Code »

Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
seem out of place. The alloc/free API is unbalanced because we handle the
hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
Move the get inside alloc_huge_page to clean up this disparity.

This patch has been kept apart from the previous patch because of the somewhat
dodgy ERR_PTR() use herein. Moving the quota logic means that
alloc_huge_page() has two failure modes. Quota failure must result in a
SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR()
doesn't like the small positive errnos we have in VM_FAULT_* so they must be
negated before they are used.

Does anyone take issue with the way I am using PTR_ERR. If so, what are your
thoughts on how to clean this up (without needing an if,else if,else block at
each alloc_huge_page() callsite)?

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
c79fb75e5 hugetlb: fix quota management for private mappings ... Browse Code »

The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.

Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.

This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
348ea204c hugetlb: split alloc_huge_page into private and shared components ... Browse Code »

Hugetlbfs implements a quota system which can limit the amount of memory that
can be used by the filesystem. Before allocating a new huge page for a file,
the quota is checked and debited. The quota is then credited when truncating
the file. I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
mappings. Before detailing the problems and my proposed solutions, we should
agree on a definition of quotas that properly addresses both private and
shared pages. Since the purpose of quotas is to limit total memory
consumption on a per-filesystem basis, I argue that all pages allocated by the
fs (private and shared) should be charged against quota.

Private Mappings
================

The current code will debit quota for private pages sometimes, but will never
credit it. At a minimum, this causes a leak in the quota accounting which
renders the accounting essentially useless as it is. Shared pages have a one
to one mapping with a hugetlbfs file and are easy to account by debiting on
allocation and crediting on truncate. Private pages are anonymous in nature
and have a many to one relationship with their hugetlbfs files (due to copy on
write). Because private pages are not indexed by the mapping's radix tree,
thier quota cannot be credited at file truncation time. Crediting must be
done when the page is unmapped and freed.

Shared Pages
============

I discovered an issue concerning the interaction between the MAP_SHARED
reservation system and quotas. Since quota is not checked until page
instantiation, an over-quota mmap/reservation will initially succeed. When
instantiating the first over-quota page, the program will receive SIGBUS.
This is inconsistent since the reservation is supposed to be a guarantee. The
solution is to debit the full amount of quota at reservation time and credit
the unused portion when the reservation is released.

This patch series brings quotas back in line by making the following
modifications:
* Private pages
- Debit quota in alloc_huge_page()
- Credit quota in free_huge_page()
* Shared pages
- Debit quota for entire reservation at mmap time
- Credit quota for instantiated pages in free_huge_page()
- Credit quota for unused reservation at munmap time

This patch:

The shared page reservation and dynamic pool resizing features have made the
allocation of private vs. shared huge pages quite different. By splitting
out the private/shared-specific portions of the process into their own
functions, readability is greatly improved. alloc_huge_page now calls the
proper helper and performs common operations.

[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:39 +0800
5b23dbe81 hugetlb: follow_hugetlb_page() for write access ... Browse Code »

When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages. Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access. This can cause data corruption because a device driver
that calls get_user_pages() with write set will not expect COW faults to
occur on the returned pages.

This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.

[ezk@cs.sunysb.edu: build fix]
Signed-off-by: Adam Litke
Reviewed-by: Ken Chen
Cc: David Gibson
Cc: William Lee Irwin III
Cc: Badari Pulavarty
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:39 +0800

20 Oct, 2007

1 commit

183ff22bb spelling fixes: mm/ ... Browse Code »

Spelling fixes in mm/.

Signed-off-by: Simon Arlott
Signed-off-by: Adrian Bunk

Simon Arlott
2007-10-20 07:27:18 +0800

19 Oct, 2007

1 commit

c80544dc0 sparse pointer use of zero as null ... Browse Code »

Get rid of sparse related warnings from places that use integer as NULL
pointer.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Stephen Hemminger
Cc: Andi Kleen
Cc: Jeff Garzik
Cc: Matt Mackall
Cc: Ian Kent
Cc: Arnd Bergmann
Cc: Davide Libenzi
Cc: Stephen Smalley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen Hemminger
2007-10-19 05:37:31 +0800

17 Oct, 2007

8 commits

af767cbdd hugetlb: fix dynamic pool resize failure case ... Browse Code »

When gather_surplus_pages() fails to allocate enough huge pages to satisfy
the requested reservation, it frees what it did allocate back to the buddy
allocator. put_page() should be called instead of update_and_free_page()
to ensure that pool counters are updated as appropriate and the page's
refcount is decremented.

Signed-off-by: Adam Litke
Acked-by: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Cc: Badari Pulavarty
Cc: Ken Chen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:03 +0800
63b4613c3 hugetlb: fix hugepage allocation with memoryless nodes ... Browse Code »

Anton found a problem with the hugetlb pool allocation when some nodes have
no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee worked
on versions that tried to fix it, but none were accepted. Christoph has
created a set of patches which allow for GFP_THISNODE allocations to fail
if the node has no memory.

Currently, alloc_fresh_huge_page() returns NULL when it is not able to
allocate a huge page on the current node, as specified by its custom
interleave variable. The callers of this function, though, assume that a
failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
on the system period. This might not be the case, for instance, if we have
an uneven NUMA system, and we happen to try to allocate a hugepage on a
node with less memory and fail, while there is still plenty of free memory
on the other nodes.

To correct this, make alloc_fresh_huge_page() search through all online
nodes before deciding no hugepages can be allocated. Add a helper function
for actually allocating the hugepage. Use a new global nid iterator to
control which nid to allocate on.

Note: we expect particular semantics for __GFP_THISNODE, which are now
enforced even for memoryless nodes. That is, there is should be no
fallback to other nodes. Therefore, we rely on the nid passed into
alloc_pages_node() to be the nid the page comes from. If this is
incorrect, accounting will break.

Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
memoryless nodes).

Before on the ppc64 box:
Trying to clear the hugetlb pool
Done. 0 free
Trying to resize the pool to 100
Node 0 HugePages_Free: 25
Node 1 HugePages_Free: 75
Node 2 HugePages_Free: 0
Node 3 HugePages_Free: 0
Done. Initially 100 free
Trying to resize the pool to 200
Node 0 HugePages_Free: 50
Node 1 HugePages_Free: 150
Node 2 HugePages_Free: 0
Node 3 HugePages_Free: 0
Done. 200 free

After:
Trying to clear the hugetlb pool
Done. 0 free
Trying to resize the pool to 100
Node 0 HugePages_Free: 50
Node 1 HugePages_Free: 50
Node 2 HugePages_Free: 0
Node 3 HugePages_Free: 0
Done. Initially 100 free
Trying to resize the pool to 200
Node 0 HugePages_Free: 100
Node 1 HugePages_Free: 100
Node 2 HugePages_Free: 0
Node 3 HugePages_Free: 0
Done. 200 free

Signed-off-by: Nishanth Aravamudan
Acked-by: Christoph Lameter
Cc: Adam Litke
Cc: David Gibson
Cc: Badari Pulavarty
Cc: Ken Chen
Cc: William Lee Irwin III
Cc: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-10-17 00:43:03 +0800
6b0c880df hugetlb: fix pool resizing corner case ... Browse Code »

When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
are careful to keep enough pages around to satisfy reservations. But the
calculation is flawed for the following scenario:

Action Pool Counters (Total, Free, Resv)
====== =============
Set pool to 1 page 1 1 0
Map 1 page MAP_PRIVATE 1 1 0
Touch the page to fault it in 1 0 0
Set pool to 3 pages 3 2 0
Map 2 pages MAP_SHARED 3 2 2
Set pool to 2 pages 2 1 2
Acked-by: Ken Chen
Cc: David Gibson
Cc: Badari Pulavarty
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:03 +0800
54f9f80d6 hugetlb: Add hugetlb_dynamic_pool sysctl ... Browse Code »

The maximum size of the huge page pool can be controlled using the overall
size of the hugetlb filesystem (via its 'size' mount option). However in the
common case the this will not be set as the pool is traditionally fixed in
size at boot time. In order to maintain the expected semantics, we need to
prevent the pool expanding by default.

This patch introduces a new sysctl controlling dynamic pool resizing. When
this is enabled the pool will expand beyond its base size up to the size of
the hugetlb filesystem. It is disabled by default.

Signed-off-by: Adam Litke
Acked-by: Andy Whitcroft
Acked-by: Dave McCracken
Cc: William Irwin
Cc: David Gibson
Cc: Ken Chen
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:02 +0800
e4e574b76 hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings ... Browse Code »

Shared mappings require special handling because the huge pages needed to
fully populate the VMA must be reserved at mmap time. If not enough pages are
available when making the reservation, allocate all of the shortfall at once
from the buddy allocator and add the pages directly to the hugetlb pool. If
they cannot be allocated, then fail the mapping. The page surplus is
accounted for in the same way as for private mappings; faulted surplus pages
will be freed at unmap time. Reserved, surplus pages that have not been used
must be freed separately when their reservation has been released.

Signed-off-by: Adam Litke
Acked-by: Andy Whitcroft
Acked-by: Dave McCracken
Cc: William Irwin
Cc: David Gibson
Cc: Ken Chen
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:02 +0800
7893d1d50 hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings ... Browse Code »

Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that
the hugetlb pool will be exhausted or completely reserved when a hugepage is
needed to satisfy a page fault. Before killing the process in this situation,
try to allocate a hugepage directly from the buddy allocator.

The explicitly configured pool size becomes a low watermark. When dynamically
grown, the allocated huge pages are accounted as a surplus over the watermark.
As huge pages are freed on a node, surplus pages are released to the buddy
allocator so that the pool will shrink back to the watermark.

Surplus accounting also allows for friendlier explicit pool resizing. When
shrinking a pool that is fully in-use, increase the surplus so pages will be
returned to the buddy allocator as soon as they are freed. When growing a
pool that has a surplus, consume the surplus first and then allocate new
pages.

Signed-off-by: Adam Litke
Signed-off-by: Mel Gorman
Acked-by: Andy Whitcroft
Acked-by: Dave McCracken
Cc: William Irwin
Cc: David Gibson
Cc: Ken Chen
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:02 +0800
6af2acb66 hugetlb: Move update_and_free_page ... Browse Code »

Dynamic huge page pool resizing.

In most real-world scenarios, configuring the size of the hugetlb pool
correctly is a difficult task. If too few pages are allocated to the pool,
applications using MAP_SHARED may fail to mmap() a hugepage region and
applications using MAP_PRIVATE may receive SIGBUS. Isolating too much memory
in the hugetlb pool means it is not available for other uses, especially those
programs not using huge pages.

The obvious answer is to let the hugetlb pool grow and shrink in response to
the runtime demand for huge pages. The work Mel Gorman has been doing to
establish a memory zone for movable memory allocations makes dynamically
resizing the hugetlb pool reliable within the limits of that zone. This patch
series implements dynamic pool resizing for private and shared mappings while
being careful to maintain existing semantics. Please reply with your comments
and feedback; even just to say whether it would be a useful feature to you.
Thanks.

How it works
============

Upon depletion of the hugetlb pool, rather than reporting an error immediately,
first try and allocate the needed huge pages directly from the buddy allocator.
Care must be taken to avoid unbounded growth of the hugetlb pool, so the
hugetlb filesystem quota is used to limit overall pool size.

The real work begins when we decide there is a shortage of huge pages. What
happens next depends on whether the pages are for a private or shared mapping.
Private mappings are straightforward. At fault time, if alloc_huge_page()
fails, we allocate a page from the buddy allocator and increment the source
node's surplus_huge_pages counter. When free_huge_page() is called for a page
on a node with a surplus, the page is freed directly to the buddy allocator
instead of the hugetlb pool.

Because shared mappings require all of the pages to be reserved up front, some
additional work must be done at mmap() to support them. We determine the
reservation shortage and allocate the required number of pages all at once.
These pages are then added to the hugetlb pool and marked reserved. Where that
is not possible the mmap() will fail. As with private mappings, the
appropriate surplus counters are updated. Since reserved huge pages won't
necessarily be used by the process, we can't be sure that free_huge_page() will
always be called to return surplus pages to the buddy allocator. To prevent
the huge page pool from bloating, we must free unused surplus pages when their
reservation has ended.

Controlling it
==============

With the entire patch series applied, pool resizing is off by default so unless
specific action is taken, the semantics are unchanged.

To take advantage of the flexibility afforded by this patch series one must
tolerate a change in semantics. To control hugetlb pool growth, the following
techniques can be employed:

* A sysctl tunable to enable/disable the feature entirely
* The size= mount option for hugetlbfs filesystems to limit pool size

Performance
===========

When contiguous memory is readily available, it is expected that the cost of
dynamicly resizing the pool will be small. This series has been performance
tested with 'stream' to measure this cost.

Stream (http://www.cs.virginia.edu/stream/) was linked with libhugetlbfs to
enable remapping of the text and data/bss segments into huge pages.

Stream with small array
-----------------------
Baseline: nr_hugepages = 0, No libhugetlbfs segment remapping
Preallocated: nr_hugepages = 5, Text and data/bss remapping
Dynamic: nr_hugepages = 0, Text and data/bss remapping

Rate (MB/s)
Function Baseline Preallocated Dynamic
Copy: 4695.6266 5942.8371 5982.2287
Scale: 4451.5776 5017.1419 5658.7843
Add: 5815.8849 7927.7827 8119.3552
Triad: 5949.4144 8527.6492 8110.6903

Stream with large array
-----------------------
Baseline: nr_hugepages = 0, No libhugetlbfs segment remapping
Preallocated: nr_hugepages = 67, Text and data/bss remapping
Dynamic: nr_hugepages = 0, Text and data/bss remapping

Rate (MB/s)
Function Baseline Preallocated Dynamic
Copy: 2227.8281 2544.2732 2546.4947
Scale: 2136.3208 2430.7294 2421.2074
Add: 2773.1449 4004.0021 3999.4331
Triad: 2748.4502 3777.0109 3773.4970

* All numbers are averages taken from 10 consecutive runs with a maximum
standard deviation of 1.3 percent noted.

This patch:

Simply move update_and_free_page() so that it can be reused later in this
patch series. The implementation is not changed.

Signed-off-by: Adam Litke
Acked-by: Andy Whitcroft
Acked-by: Dave McCracken
Acked-by: William Irwin
Cc: David Gibson
Cc: Ken Chen
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:02 +0800
954ffcb35 flush icache before set_pte() on ia64: flush icache at set_pte ... Browse Code »

Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
set_pte(). This is too late. This patch removes lazy_mmu_prot_update and
add modfied set_pte() for flushing if necessary.

This patch flush icache of a page when
new pte has exec bit.
&& new pte has present bit
&& new pte is user's page.
&& (old *ptep is not present
|| new pte's pfn is not same to old *ptep's ptn)
&& new pte's page has no Pg_arch_1 bit.
Pg_arch_1 is set when a page is cache consistent.

I think this condition checks are much easier to understand than considering
"Where sync_icache_dcache() should be inserted ?".

pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
clean-up. So, I added it again.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Luck, Tony"
Cc: Christoph Lameter
Cc: Hugh Dickins
Cc: Nick Piggin
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2007-10-17 00:42:59 +0800

01 Oct, 2007

1 commit

281e0e3b3 hugetlb: fix clear_user_highpage arguments ... Browse Code »

The virtual address space argument of clear_user_highpage is supposed to be
the virtual address where the page being cleared will eventually be mapped.
This allows architectures with virtually indexed caches a few clever
tricks. That sort of trick falls over in painful ways if the virtual
address argument is wrong.

Signed-off-by: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ralf Baechle
2007-10-01 22:52:23 +0800

20 Sep, 2007

1 commit

480eccf9a Fix NUMA Memory Policy Reference Counting ... Browse Code »

This patch proposes fixes to the reference counting of memory policy in the
page allocation paths and in show_numa_map(). Extracted from my "Memory
Policy Cleanups and Enhancements" series as stand-alone.

Shared policy lookup [shmem] has always added a reference to the policy,
but this was never unrefed after page allocation or after formatting the
numa map data.

Default system policy should not require additional ref counting, nor
should the current task's task policy. However, show_numa_map() calls
get_vma_policy() to examine what may be [likely is] another task's policy.
The latter case needs protection against freeing of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another task's
mempolicy. Again, shared policy is already reference counted on lookup. A
matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
shared and vma policies, and in show_numa_map() for shared and another
task's mempolicy. We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL policy.

Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a shared or vma
'BIND policy. In this case, we should hold the reference until after the
huge page allocation in dequeue_hugepage(). The patch modifies
huge_zonelist() to return a pointer to the mempolicy if it needs to be
unref'd after allocation.

Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

w/o patch w/ refcount patch
Avg Std Devn Avg Std Devn
Real: 100.59 0.38 100.63 0.43
User: 1209.60 0.37 1209.91 0.31
System: 81.52 0.42 81.64 0.34

Signed-off-by: Lee Schermerhorn
Acked-by: Andi Kleen
Cc: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2007-09-20 02:24:18 +0800

23 Aug, 2007

1 commit

a89182c76 Fix VM_FAULT flags conversion for hugetlb ... Browse Code »

It seems a simple mistake was made when converting follow_hugetlb_page()
over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
83c54070ee1a2d05c89793884bea1a03f2851ed4).

By using the wrong bitmask, hugetlb_fault() failures are not being
recognized. This results in an infinite loop whenever follow_hugetlb_page
is involved in a failed fault.

Signed-off-by: Adam Litke
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-08-23 10:52:46 +0800

25 Jul, 2007

1 commit

5ab3ee7b1 fix hugetlb page allocation leak ... Browse Code »

dequeue_huge_page() has a serious memory leak upon hugetlb page
allocation. The for loop continues on allocating hugetlb pages out of
all allowable zone, where this function is supposedly only dequeue one
and only one pages.

Fixed it by breaking out of the for loop once a hugetlb page is found.

Signed-off-by: Ken Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-07-25 03:24:59 +0800

20 Jul, 2007

5 commits

f8af0bb89 hugetlb: use set_compound_page_dtor ... Browse Code »

Use appropriate accessor function to set compound page destructor
function.

Cc: William Irwin
Signed-off-by: Akinobu Mita
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2007-07-20 01:04:50 +0800
7ed5cb2b7 Remove nid_lock from alloc_fresh_huge_page ... Browse Code »

The fix to that race in alloc_fresh_huge_page() which could give an illegal
node ID did not need nid_lock at all: the fix was to replace static int nid
by static int prev_nid and do the work on local int nid. nid_lock did make
sure that racers strictly roundrobin the nodes, but that's not something we
need to enforce strictly. Kill nid_lock.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2007-07-20 01:04:50 +0800
3abf7afd4 dequeue_huge_page() warning fix ... Browse Code »

mm/hugetlb.c: In function `dequeue_huge_page':
mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function

Cc: Christoph Lameter
Cc: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2007-07-20 01:04:50 +0800
83c54070e mm: fault feedback #2 ... Browse Code »

This patch completes Linus's wish that the fault return codes be made into
bit flags, which I agree makes everything nicer. This requires requires
all handle_mm_fault callers to be modified (possibly the modifications
should go further and do things like fault accounting in handle_mm_fault --
however that would be for another patch).

[akpm@linux-foundation.org: fix alpha build]
[akpm@linux-foundation.org: fix s390 build]
[akpm@linux-foundation.org: fix sparc build]
[akpm@linux-foundation.org: fix sparc64 build]
[akpm@linux-foundation.org: fix ia64 build]
Signed-off-by: Nick Piggin
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Russell King
Cc: Ian Molton
Cc: Bryan Wu
Cc: Mikael Starvik
Cc: David Howells
Cc: Yoshinori Sato
Cc: "Luck, Tony"
Cc: Hirokazu Takata
Cc: Geert Uytterhoeven
Cc: Roman Zippel
Cc: Greg Ungerer
Cc: Matthew Wilcox
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Paul Mundt
Cc: Kazumoto Kojima
Cc: Richard Curnow
Cc: William Lee Irwin III
Cc: "David S. Miller"
Cc: Jeff Dike
Cc: Paolo 'Blaisorblade' Giarrusso
Cc: Miles Bader
Cc: Chris Zankel
Acked-by: Kyle McMartin
Acked-by: Haavard Skinnemoen
Acked-by: Ralf Baechle
Acked-by: Andi Kleen
Signed-off-by: Andrew Morton
[ Still apparently needs some ARM and PPC loving - Linus ]
Signed-off-by: Linus Torvalds

Nick Piggin
2007-07-20 01:04:41 +0800
d0217ac04 mm: fault feedback #1 ... Browse Code »

Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).

This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.

struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.

The page is now returned via a page pointer in the vm_fault struct.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-07-20 01:04:41 +0800

18 Jul, 2007

2 commits

a1ed3dda0 MM: Make needlessly global hugetlb_no_page() static. ... Browse Code »

Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2007-07-18 01:23:02 +0800
396faf030 Allow huge page allocations to use GFP_HIGH_MOVABLE ... Browse Code »

Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
can be used to satisfy hugepage allocations even when the system has been
running a long time. This allows an administrator to resize the hugepage pool
at runtime depending on the size of ZONE_MOVABLE.

This patch adds a new sysctl called hugepages_treat_as_movable. When a
non-zero value is written to it, future allocations for the huge page pool
will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.

[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-07-18 01:22:59 +0800

17 Jul, 2007

2 commits

f96efd585 hugetlb: fix race in alloc_fresh_huge_page() ... Browse Code »

That static `nid' index needs locking. Without it we can end up calling
alloc_pages_node() with an illegal node ID and the kernel crashes.

Acked-by: gurudas pai
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Jin
2007-07-17 00:05:35 +0800
31a5c6e4f hugetlb: remove unnecessary nid initialization ... Browse Code »

nid is initialized to numa_node_id() but will either be overwritten in
the loop or not used in the conditional. So remove the initialization.

Signed-off-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-07-17 00:05:35 +0800

17 Jun, 2007

1 commit

8dab5241d Rework ptep_set_access_flags and fix sun4c ... Browse Code »

Some changes done a while ago to avoid pounding on ptep_set_access_flags and
update_mmu_cache in some race situations break sun4c which requires
update_mmu_cache() to always be called on minor faults.

This patch reworks ptep_set_access_flags() semantics, implementations and
callers so that it's now responsible for returning whether an update is
necessary or not (basically whether the PTE actually changed). This allow
fixing the sparc implementation to always return 1 on sun4c.

[akpm@linux-foundation.org: fixes, cleanups]
Signed-off-by: Benjamin Herrenschmidt
Cc: Hugh Dickins
Cc: David Miller
Cc: Mark Fortescue
Acked-by: William Lee Irwin III
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2007-06-17 04:16:16 +0800

10 May, 2007

2 commits

8a6301127 pretend cpuset has some form of hugetlb page reservation ... Browse Code »

When cpuset is configured, it breaks the strict hugetlb page reservation as
the accounting is done on a global variable. Such reservation is
completely rubbish in the presence of cpuset because the reservation is not
checked against page availability for the current cpuset. Application can
still potentially OOM'ed by kernel with lack of free htlb page in cpuset
that the task is in. Attempt to enforce strict accounting with cpuset is
almost impossible (or too ugly) because cpuset is too fluid that task or
memory node can be dynamically moved between cpusets.

The change of semantics for shared hugetlb mapping with cpuset is
undesirable. However, in order to preserve some of the semantics, we fall
back to check against current free page availability as a best attempt and
hopefully to minimize the impact of changing semantics that cpuset has on
hugetlb.

Signed-off-by: Ken Chen
Cc: Paul Jackson
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-05-10 03:30:49 +0800
ace4bd29c fix leaky resv_huge_pages when cpuset is in use ... Browse Code »

The internal hugetlb resv_huge_pages variable can permanently leak nonzero
value in the error path of hugetlb page fault handler when hugetlb page is
used in combination of cpuset. The leaked count can permanently trap N
number of hugetlb pages in unusable "reserved" state.

Steps to reproduce the bug:

(1) create two cpuset, user1 and user2
(2) reserve 50 htlb pages in cpuset user1
(3) attempt to shmget/shmat 50 htlb page inside cpuset user2
(4) kernel oom the user process in step 3
(5) ipcrm the shm segment

At this point resv_huge_pages will have a count of 49, even though
there are no active hugetlbfs file nor hugetlb shared memory segment
in the system. The leak is permanent and there is no recovery method
other than system reboot. The leaked count will hold up all future use
of that many htlb pages in all cpusets.

The culprit is that the error path of alloc_huge_page() did not
properly undo the change it made to resv_huge_page, causing
inconsistent state.

Signed-off-by: Ken Chen
Cc: David Gibson
Cc: Adam Litke
Cc: Martin Bligh
Acked-by: David Gibson
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-05-10 03:30:48 +0800

10 Feb, 2007

1 commit

6649a3863 [PATCH] hugetlb: preserve hugetlb pte dirty state ... Browse Code »

__unmap_hugepage_range() is buggy that it does not preserve dirty state of
huge_pte when unmapping hugepage range. It causes data corruption in the
event of dop_caches being used by sys admin. For example, an application
creates a hugetlb file, modify pages, then unmap it. While leaving the
hugetlb file alive, comes along sys admin doing a "echo 3 >
/proc/sys/vm/drop_caches".

drop_pagecache_sb() will happily free all pages that aren't marked dirty if
there are no active mapping. Later when application remaps the hugetlb
file back and all data are gone, triggering catastrophic flip over on
application.

Not only that, the internal resv_huge_pages count will also get all messed
up. Fix it up by marking page dirty appropriately.

Signed-off-by: Ken Chen
Cc: "Nish Aravamudan"
Cc: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Cc:
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-02-10 01:25:46 +0800

14 Dec, 2006

1 commit

9de455b20 [PATCH] Pass vma argument to copy_user_highpage(). ... Browse Code »

To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.

The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.

Signed-off-by: Atsushi Nemoto
Signed-off-by: Ralf Baechle
Signed-off-by: Linus Torvalds

Atsushi Nemoto
2006-12-14 01:27:08 +0800