Eric Lee / smarc-fsl-linux-kernel

29 Oct, 2006

1 commit

ebed4bfc8 [PATCH] hugetlb: fix absurd HugePages_Rsvd ... Browse Code »

If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my
preliminary i_size check before attempting to allocate the page (though this
only fixes the most obvious case: more work will be needed here).

Signed-off-by: Hugh Dickins
Cc: Adam Litke
Cc: David Gibson
Cc: "Chen, Kenneth W"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-10-29 02:30:53 +0800

12 Oct, 2006

1 commit

502717f4e [PATCH] hugetlb: fix linked list corruption in unmap_hugepage_range() ... Browse Code »

commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
libhugetlbfs test suite. The problem is that hugetlb pages can be shared
by multiple mappings. Multiple threads can fight over page->lru in the
unmap path and bad things happen. We now serialize __unmap_hugepage_range
to void concurrent linked list manipulation. Such serialization is also
needed for shared page table page on hugetlb area. This patch will fixed
the bug and also serve as a prepatch for shared page table.

Signed-off-by: Ken Chen
Cc: Hugh Dickins
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-10-12 02:14:15 +0800

04 Oct, 2006

1 commit

fe1668ae5 [PATCH] enforce proper tlb flush in unmap_hugepage_range ... Browse Code »

Spotted by Hugh that hugetlb page is free'ed back to global pool before
performing any TLB flush in unmap_hugepage_range(). This potentially allow
threads to abuse free-alloc race condition.

The generic tlb gather code is unsuitable to use by hugetlb, I just open
coded a page gathering list and delayed put_page until tlb flush is
performed.

Cc: Hugh Dickins
Signed-off-by: Ken Chen
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-10-04 22:55:12 +0800

26 Sep, 2006

2 commits

89fa30242 [PATCH] NUMA: Add zone_to_nid function ... Browse Code »

There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM. Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800
4415cc8df [PATCH] Hugepages: Use page_to_nid rather than traversing zone pointers ... Browse Code »

I found two location in hugetlb.c where we chase pointer instead of using
page_to_nid(). Page_to_nid is more effective and can get the node directly
from page flags.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:52 +0800

23 Jun, 2006

1 commit

a43a8c39b [PATCH] tightening hugetlb strict accounting ... Browse Code »

Current hugetlb strict accounting for shared mapping always assume mapping
starts at zero file offset and reserves pages between zero and size of the
file. This assumption often reserves (or lock down) a lot more pages then
necessary if application maps at none zero file offset. libhugetlbfs is
one example that requires proper reservation on shared mapping starts at
none zero offset.

This patch extends the reservation and hugetlb strict accounting to support
any arbitrary pair of (offset, len), resulting a much more robust and
accurate scheme. More importantly, it won't lock down any hugetlb pages
outside file mapping.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-06-23 22:42:48 +0800

01 Apr, 2006

2 commits

78c997a4b [PATCH] hugetlb: don't allow free hugetlb count fall below reserved count ... Browse Code »

With strict page reservation, I think kernel should enforce number of free
hugetlb page don't fall below reserved count. Currently it is possible in
the sysctl path. Add proper check in sysctl to disallow that.

Signed-off-by: Ken Chen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-04-01 04:18:50 +0800
d6692183a [PATCH] fix extra page ref count in follow_hugetlb_page ... Browse Code »

git-commit: d5d4b0aa4e1430d73050babba999365593bdb9d2
"[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas.

I mis-interpret pages argument and made get_page() unconditional. It
should only get a ref count when "pages" argument is non-null.

Credit goes to Adam Litke who spotted the bug.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-04-01 04:18:49 +0800

22 Mar, 2006

9 commits

fdb7cc590 [PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fix ... Browse Code »

Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
assuming that nodes are numbered contiguously from 0 to num_online_nodes().
Once the hotplug folks get this far, that will be false.

Signed-off-by: Paul Jackson
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2006-03-22 23:54:06 +0800
d5d4b0aa4 [PATCH] optimize follow_hugetlb_page ... Browse Code »

follow_hugetlb_page() walks a range of user virtual address and then fills
in list of struct page * into an array that is passed from the argument
list. It also gets a reference count via get_page(). For compound page,
get_page() actually traverse back to head page via page_private() macro and
then adds a reference count to the head page. Since we are doing a virt to
pte look up, kernel already has a struct page pointer into the head page.
So instead of traverse into the small unit page struct and then follow a
link back to the head page, optimize that with incrementing the reference
count directly on the head page.

The benefit is that we don't take a cache miss on accessing page struct for
the corresponding user address and more importantly, not to pollute the
cache with a "not very useful" round trip of pointer chasing. This adds a
moderate performance gain on an I/O intensive database transaction
workload.

Signed-off-by: Ken Chen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-03-22 23:54:04 +0800
27a85ef1b [PATCH] hugepage: Make {alloc,free}_huge_page() local ... Browse Code »

Originally, mm/hugetlb.c just handled the hugepage physical allocation path
and its {alloc,free}_huge_page() functions were used from the arch specific
hugepage code. These days those functions are only used with mm/hugetlb.c
itself. Therefore, this patch makes them static and removes their
prototypes from hugetlb.h. This requires a small rearrangement of code in
mm/hugetlb.c to avoid a forward declaration.

This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
POWER5).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
b45b5bd65 [PATCH] hugepage: Strict page reservation for hugepage inodes ... Browse Code »

These days, hugepages are demand-allocated at first fault time. There's a
somewhat dubious (and racy) heuristic when making a new mmap() to check if
there are enough available hugepages to fully satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the pages in between allocations. In this case the size of
each block is compared against the total number of available hugepages.
It's thus easy for the process to become overcommitted, because each block
mapping will succeed, although the total number of hugepages required by
all blocks exceeds the number available. In particular, this defeats such
a program which will detect a mapping failure and adjust its hugepage usage
downward accordingly.

The patch below addresses this problem, by strictly reserving a number of
physical hugepages for hugepage inodes which have been mapped, but not
instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
trigger an OOM. (Actually SHARED mappings can technically still OOM, but
only if the sysadmin explicitly reduces the hugepage pool between mapping
and instantiation)

This patch appears to address the problem at hand - it allows DB2 to start
correctly, for instance, which previously suffered the failure described
above.

This patch causes no regressions on the libhugetblfs testsuite, and makes a
test (designed to catch this problem) pass which previously failed (ppc64,
POWER5).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
3935baa9b [PATCH] hugepage: serialize hugepage allocation and instantiation ... Browse Code »

Currently, no lock or mutex is held between allocating a hugepage and
inserting it into the pagetables / page cache. When we do go to insert the
page into pagetables or page cache, we recheck and may free the newly
allocated hugepage. However, since the number of hugepages in the system
is strictly limited, and it's usualy to want to use all of them, this can
still lead to spurious allocation failures.

For example, suppose two processes are both mapping (MAP_SHARED) the same
hugepage file, large enough to consume the entire available hugepage pool.
If they race instantiating the last page in the mapping, they will both
attempt to allocate the last available hugepage. One will fail, of course,
returning OOM from the fault and thus causing the process to be killed,
despite the fact that the entire mapping can, in fact, be instantiated.

The patch fixes this race by the simple method of adding a (sleeping) mutex
to serialize the hugepage fault path between allocation and insertion into
pagetables and/or page cache. It would be possible to avoid the
serialization by catching the allocation failures, waiting on some
condition, then rechecking to see if someone else has instantiated the page
for us. Given the likely frequency of hugepage instantiations, it seems
very doubtful it's worth the extra complexity.

This patch causes no regression on the libhugetlbfs testsuite, and one
test, which can trigger this race now passes where it previously failed.

Actually, the test still sometimes fails, though less often and only as a
shmat() failure, rather processes getting OOM killed by the VM. The dodgy
heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
space aren't protected by the new mutex, and would be ugly to do so, so
there's still a race there. Another patch to replace those tests with
something saner for this reason as well as others coming...

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
79ac6ba40 [PATCH] hugepage: Small fixes to hugepage clear/copy path ... Browse Code »

Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
own functions for clarity. As we do so, we add some checks of need_resched
- we are, after all copying megabytes of memory here. We also add
might_sleep() accordingly. We generally dropped locks around the clear and
copy, already but not everyone has PREEMPT enabled, so we should still be
checking explicitly.

For this to work, we need to remove the clear_huge_page() from
alloc_huge_page(), which is called with the page_table_lock held in the COW
path. We move the clear_huge_page() to just after the alloc_huge_page() in
the hugepage no-page path. In the COW path, the new page is about to be
copied over, so clearing it was just a waste of time anyway. So as a side
effect we also fix the fact that we held the page_table_lock for far too
long in this path by calling alloc_huge_page() under it.

It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
8f860591f [PATCH] Enable mprotect on huge pages ... Browse Code »

2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
mprotect.

From: David Gibson

Remove a test from the mprotect() path which checks that the mprotect()ed
range on a hugepage VMA is hugepage aligned (yes, really, the sense of
is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

In fact, we don't need this test. If the given addresses match the
beginning/end of a hugepage VMA they must already be suitably aligned. If
they don't, then mprotect_fixup() will attempt to split the VMA. The very
first test in split_vma() will check for a badly aligned address on a
hugepage VMA and return -EINVAL if necessary.

From: "Chen, Kenneth W"

On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
identify of hugetlb pte is lost when changing page protection via mprotect.
A page fault occurs later will trigger a bug check in huge_pte_alloc().

The fix is to always make new pte a hugetlb pte and also to clean up
legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

Signed-off-by: Zhang Yanmin
Cc: David Gibson
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: William Lee Irwin III
Signed-off-by: Ken Chen
Signed-off-by: Nishanth Aravamudan
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang, Yanmin
2006-03-22 23:54:03 +0800
7835e98b2 [PATCH] remove set_page_count() outside mm/ ... Browse Code »

set_page_count usage outside mm/ is limited to setting the refcount to 1.
Remove set_page_count from outside mm/, and replace those users with
init_page_count() and set_page_refcounted().

This allows more debug checking, and tighter control on how code is allowed
to play around with page->_count.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-03-22 23:54:02 +0800
a482289d4 [PATCH] hugepage allocator cleanup ... Browse Code »

Insert "fresh" huge pages into the hugepage allocator by the same means as
they are freed back into it. This reduces code size and allows
enqueue_huge_page to be inlined into the hugepage free fastpath.

Eliminate occurances of hugepages on the free list with non-zero refcount.
This can allow stricter refcount checks in future. Also required for
lockless pagecache.

Signed-off-by: Nick Piggin

"This patch also eliminates a leak "cleaned up" by re-clobbering the
refcount on every allocation from the hugepage freelists. With respect to
the lockless pagecache, the crucial aspect is to eliminate unconditional
set_page_count() to 0 on pages with potentially nonzero refcounts, though
closer inspection suggests the assignments removed are entirely spurious."

Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2006-03-22 23:53:58 +0800

15 Feb, 2006

1 commit

41d78ba55 [PATCH] compound page: use page[1].lru ... Browse Code »

If a compound page has its own put_page_testzero destructor (the only current
example is free_huge_page), that is noted in page[1].mapping of the compound
page. But that's rather a poor place to keep it: functions which call
set_page_dirty_lock after get_user_pages (e.g. Infiniband's
__ib_umem_release) ought to be checking first, otherwise set_page_dirty is
liable to crash on what's not the address of a struct address_space.

And now I'm about to make that worse: it turns out that every compound page
needs a destructor, so we can no longer rely on hugetlb pages going their own
special way, to avoid further problems of page->mapping reuse. For example,
not many people know that: on 50% of i386 -Os builds, the first tail page of a
compound page purports to be PageAnon (when its destructor has an odd
address), which surprises page_add_file_rmap.

Keep the compound page destructor in page[1].lru.next instead. And to free up
the common pairing of mapping and index, also move compound page order from
index to lru.prev. Slab reuses page->lru too: but if we ever need slab to use
compound pages, it can easily stack its use above this.

(akpm: decoded version of the above: the tail pages of a compound page now
have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
caller to check that they're not compund pages before doing the dirty).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-02-15 08:09:33 +0800

08 Feb, 2006

2 commits

0df420d8b [PATCH] hugetlbpage: return VM_FAULT_OOM on oom ... Browse Code »

Remove wrong and misleading comments.

Return VM_FAULT_OOM if the hugetlbpage fault handler cannot allocate a
page. do_no_page will end up doing do_exit(SIGKILL).

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-02-08 08:12:31 +0800
a2dfef694 [PATCH] Hugepages need clear_user_highpage() not clear_highpage() ... Browse Code »

When hugepages are newly allocated to a file in mm/hugetlb.c, we clear them
with a call to clear_highpage() on each of the subpages. We should be
using clear_user_highpage(): on powerpc, at least, clear_highpage() doesn't
correctly mark the page as icache dirty so if the page is executed shortly
after it's possible to get strange results.

Signed-off-by: David Gibson
Acked-by: William Lee Irwin III
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-02-08 08:12:31 +0800

06 Feb, 2006

1 commit

64b4a954b [PATCH] hugetlb: add comment explaining reasons for Bus Errors ... Browse Code »

I just spent some time researching a Bus Error. Turns out that the huge
page fault handler can return VM_FAULT_SIGBUS for various conditions where
no huge page is available.

Add a note explaining the reasoning in the source.

Signed-off-by: Christoph Lameter
Acked-by: William Lee Irwin III
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-02-06 03:06:53 +0800

09 Jan, 2006

1 commit

aea47ff36 [PATCH] mm: make hugepages obey cpusets. ... Browse Code »

See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2

Make hugepages obey cpusets.

Signed-off-by: Christoph Lameter
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-09 12:12:43 +0800

07 Jan, 2006

7 commits

6bda666a0 [PATCH] hugepages: fold find_or_alloc_pages into huge_no_page() ... Browse Code »

The number of parameters for find_or_alloc_page increases significantly after
policy support is added to huge pages. Simplify the code by folding
find_or_alloc_huge_page() into hugetlb_no_page().

Adam Litke objected to this piece in an earlier patch but I think this is a
good simplification. Diffstat shows that we can get rid of almost half of the
lines of find_or_alloc_page(). If we can find no consensus then lets simply
drop this patch.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Acked-by: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-07 00:33:23 +0800
5da7ca860 [PATCH] Add NUMA policy support for huge pages. ... Browse Code »

The huge_zonelist() function in the memory policy layer provides an list of
zones ordered by NUMA distance. The hugetlb layer will walk that list looking
for a zone that has available huge pages but is also in the nodeset of the
current cpuset.

This patch does not contain the folding of find_or_alloc_huge_page() that was
controversial in the earlier discussion.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Acked-by: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-07 00:33:23 +0800
96df9333c [PATCH] mm: dequeue a huge page near to this node ... Browse Code »

This was discussed at
http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2

This patch changes the dequeueing to select a huge page near the node
executing instead of always beginning to check for free nodes from node 0.
This will result in a placement of the huge pages near the executing
processor improving performance.

The existing implementation can place the huge pages far away from the
executing processor causing significant degradation of performance. The
search starting from zero also means that the lower zones quickly run out
of memory. Selecting a huge page near the process distributed the huge
pages better.

Signed-off-by: Christoph Lameter
Cc: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-07 00:33:23 +0800
1e8f889b1 [PATCH] Hugetlb: Copy on Write support ... Browse Code »

Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
supported. This helps us to safely use hugetlb pages in many more
applications. The patch makes the following changes. If needed, I also have
it broken out according to the following paragraphs.

1. Add a pair of functions to set/clear write access on huge ptes. The
writable check in make_huge_pte is moved out to the caller for use by COW
later.

2. Hugetlb copy-on-write requires special case handling in the following
situations:

- copy_hugetlb_page_range() - Copied pages must be write protected so
a COW fault will be triggered (if necessary) if those pages are written
to.

- find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
page cache. MAP_PRIVATE pages still need to be locked however.

3. Provide hugetlb_cow() and calls from hugetlb_fault() and
hugetlb_no_page() which handles the COW fault by making the actual copy.

4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED
mapping check.

Signed-off-by: David Gibson
Signed-off-by: Adam Litke
Cc: William Lee Irwin III
Cc: "Seth, Rohit"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-01-07 00:33:23 +0800
86e5216f8 [PATCH] Hugetlb: Reorganize hugetlb_fault to prepare for COW ... Browse Code »

This patch splits the "no_page()" type activity into its own function,
hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults
and delegates to the appropriate handler depending on the type of fault.
Right now we still have only hugetlb_no_page() but a later patch introduces a
COW fault.

Signed-off-by: David Gibson
Signed-off-by: Adam Litke
Cc: William Lee Irwin III
Cc: "Seth, Rohit"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2006-01-07 00:33:22 +0800
85ef47f74 [PATCH] Hugetlb: Rename find_lock_page to find_or_alloc_huge_page ... Browse Code »

find_lock_huge_page() isn't a great name, since it does extra things not
analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is
closer to the mark.

Signed-off-by: David Gibson
Signed-off-by: Adam Litke
Cc: William Lee Irwin III
Cc: "Seth, Rohit"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2006-01-07 00:33:22 +0800
f0916794f [PATCH] Hugetlb: Remove duplicate i_size check ... Browse Code »

cleanup

Signed-off-by: David Gibson
Signed-off-by: Adam Litke
Cc: William Lee Irwin III
Cc: "Seth, Rohit"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2006-01-07 00:33:22 +0800

23 Nov, 2005

1 commit

0bd0f9fb1 [PATCH] hugetlb: fix race in set_max_huge_pages for multiple updaters of nr_huge_pages ... Browse Code »

If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
it is possible for the nr_huge_pages variable to become incorrect. There
is no locking in the set_max_huge_pages function around
alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers
to alloc_fresh_huge_page could race against each other as could a call to
alloc_fresh_huge_page and a call to update_and_free_page. This patch just
expands the area covered by the hugetlb_lock to cover the call into
alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section
is performance critical where more specific locking would be needed.

My reproducer was to run a couple copies of the following script
simultaneously

while [ true ]; do
echo 1000 > /proc/sys/vm/nr_hugepages
echo 500 > /proc/sys/vm/nr_hugepages
echo 750 > /proc/sys/vm/nr_hugepages
echo 100 > /proc/sys/vm/nr_hugepages
echo 0 > /proc/sys/vm/nr_hugepages
done

and then watch /proc/meminfo and eventually you will see things like

HugePages_Total: 100
HugePages_Free: 109

After applying the patch all seemed well.

Signed-off-by: Eric Paris
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Paris
2005-11-23 01:13:43 +0800

07 Nov, 2005

2 commits

99697dc02 [PATCH] unexport hugetlb_total_pages ... Browse Code »

I didn't find any possible modular usage in the kernel.

Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2005-11-07 23:54:06 +0800
3c726f8de [PATCH] ppc64: support 64k pages ... Browse Code »

Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
base page size to 64K. The resulting kernel still boots on any
hardware. On current machines with 4K pages support only, the kernel
will maintain 16 "subpages" for each 64K page transparently.

Note that while real 64K capable HW has been tested, the current patch
will not enable it yet as such hardware is not released yet, and I'm
still verifying with the firmware architects the proper to get the
information from the newer hypervisors.

Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2005-11-07 08:56:47 +0800

30 Oct, 2005

5 commits

4c8872659 [PATCH] hugetlb: demand fault handler ... Browse Code »

Below is a patch to implement demand faulting for huge pages. The main
motivation for changing from prefaulting to demand faulting is so that huge
page memory areas can be allocated according to NUMA policy.

Thanks to consolidated hugetlb code, switching the behavior requires changing
only one fault handler. The bulk of the patch just moves the logic from
hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

Signed-off-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2005-10-30 12:40:43 +0800
508034a32 [PATCH] mm: unmap_vmas with inner ptlock ... Browse Code »

Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:41 +0800
c74df32c7 [PATCH] mm: ptd_alloc take ptlock ... Browse Code »

Second step in pushing down the page_table_lock. Remove the temporary
bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
to hold page_table_lock, whether it's on init_mm or a user mm; take
page_table_lock internally to check if a racing task already allocated.

Convert their callers from common code. But avoid coming back to change them
again later: instead of moving the spin_lock(&mm->page_table_lock) down,
switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
encapsulate the mapping+locking and unlocking+unmapping together, and in the
end may use alternatives to the mm page_table_lock itself.

These callers all hold mmap_sem (some exclusively, some not), so at no level
can a page table be whipped away from beneath them; and pte_alloc uses the
"atomic" pmd_present to test whether it needs to allocate. It appears that on
all arches we can safely descend without page_table_lock.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:40 +0800
365e9c87a [PATCH] mm: update_hiwaters just in time ... Browse Code »

update_mem_hiwater has attracted various criticisms, in particular from those
concerned with mm scalability. Originally it was called whenever rss or
total_vm got raised. Then many of those callsites were replaced by a timer
tick call from account_system_time. Now Frank van Maarseveen reports that to
be found inadequate. How about this? Works for Frank.

Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
by 1): those are hot paths. Do the opposite, update only when about to lower
rss (usually by many), or just before final accounting in do_exit. Handle
mm->hiwater_vm in the same way, though it's much less of an issue. Demand
that whoever collects these hiwater statistics do the work of taking the
maximum with rss or total_vm.

And there has been no collector of these hiwater statistics in the tree. The
new convention needs an example, so match Frank's usage by adding a VmPeak
line above VmSize to /proc//status, and also a VmHWM line above VmRSS
(High-Water-Mark or High-Water-Memory).

There was a particular anomaly during mremap move, that hiwater_vm might be
captured too high. A fleeting such anomaly remains, but it's quickly
corrected now, whereas before it would stick.

What locking? None: if the app is racy then these statistics will be racy,
it's not worth any overhead to make them exact. But whenever it suits,
hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
page_table_lock (for now) or with preemption disabled (later on): without
going to any trouble, minimize the time between reading current values and
updating, to minimize those occasions when a racing thread bumps a count up
and back down in between.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:39 +0800
4294621f4 [PATCH] mm: rss = file_rss + anon_rss ... Browse Code »

I was lazy when we added anon_rss, and chose to change as few places as
possible. So currently each anonymous page has to be counted twice, in rss
and in anon_rss. Which won't be so good if those are atomic counts in some
configurations.

Change that around: keep file_rss and anon_rss separately, and add them
together (with get_mm_rss macro) when the total is needed - reading two
atomics is much cheaper than updating two atomics. And update anon_rss
upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:38 +0800

21 Oct, 2005

1 commit

ac9b9c667 [PATCH] Fix handling spurious page fault for hugetlb region ... Browse Code »

This reverts commit 3359b54c8c07338f3a863d1109b42eebccdcf379 and
replaces it with a cleaner version that is purely based on page table
operations, so that the synchronization between inode size and hugetlb
mappings becomes moot.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-21 00:02:07 +0800

20 Oct, 2005

1 commit

1c59827d1 [PATCH] mm: hugetlb truncation fixes ... Browse Code »

hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
forgets that: crashes and misaccounting ensue.

copy_hugetlb_page_range better grab the src page_table_lock since we don't
want to guess what happens if concurrently truncated. unmap_hugepage_range
rss accounting must not assume the full range was mapped. follow_hugetlb_page
must guard with page_table_lock and be prepared to exit early.

Restyle copy_hugetlb_page_range with a for loop like the others there.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-20 14:04:30 +0800

05 Sep, 2005

1 commit

7bf07f3d4 [PATCH] hugetlb: move stale pte check into huge_pte_alloc() ... Browse Code »

Initial Post (Wed, 17 Aug 2005)

This patch moves the
if (! pte_none(*pte))
hugetlb_clean_stale_pgtable(pte);
logic into huge_pte_alloc() so all of its callers can be immune to the bug
described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

> It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
> huge_pte_alloc() might return a pmd that points to a PTE page. It happens
> if the virtual address for hugetlb mmap is recycled from previously used
> normal page mmap. free_pgtables() might not scrub the pmd entry on
> munmap and hugetlb_prefault skips on any pmd presence regardless what type
> it is.

Unless I am missing something, it seems more correct to place the check inside
huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
It also allows checking for this condition when lazily faulting huge pages
later in the series.

Signed-off-by: Adam Litke
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2005-09-05 15:05:46 +0800