Doug / smarc-fsl-linux-kernel | Embedian Git Server

12 Feb, 2009

1 commit

17c9d12e1 Do not account for hugetlbfs quota at mmap() time if mapping [SHM|MAP]_NORESERVE ... Browse Code »

Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
in line with the core VM by obeying VM_NORESERVE and not reserving
hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
are specified. However, it is still taking filesystem quota
unconditionally.

At fault time, if there are no reserves and attempt is made to allocate
the page and account for filesystem quota. If either fail, the fault
fails. The impact is that quota is getting accounted for twice. This
patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
help prevent this mistake happening again, it improves the documentation
of hugetlb_reserve_pages()

Reported-by: Andy Whitcroft
Signed-off-by: Mel Gorman
Acked-by: Andy Whitcroft
Signed-off-by: Linus Torvalds

Mel Gorman
2009-02-12 04:38:09 +0800

11 Feb, 2009

1 commit

5a6fe1259 Do not account for the address space used by hugetlbfs using VM_ACCOUNT ... Browse Code »

When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2009-02-11 02:48:42 +0800

07 Jan, 2009

4 commits

91f47662d mm: hugetlb: remove redundant `if' operation ... Browse Code »

At this point we already know that 'addr' is not NULL so get rid of
redundant 'if'. Probably gcc eliminate it by optimization pass.

[akpm@linux-foundation.org: use __weak, too]
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2009-01-07 07:59:10 +0800
ebdd4aea8 hugetlb: fix sparse warnings ... Browse Code »

Fix the following sparse warnings:

mm/hugetlb.c:375:3: warning: returning void-valued expression
mm/hugetlb.c:408:3: warning: returning void-valued expression

Signed-off-by: Hannes Eder
Acked-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hannes Eder
2009-01-07 07:59:06 +0800
3340289dd mm: report the MMU pagesize in /proc/pid/smaps ... Browse Code »

The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
kernel to back a VMA. This matches the size used by the MMU in the
majority of cases. However, one counter-example occurs on PPC64 kernels
whereby a kernel using 64K as a base pagesize may still use 4K pages for
the MMU on older processor. To distinguish, this patch reports
MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

Signed-off-by: Mel Gorman
Cc: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800
08fba6998 mm: report the pagesize backing a VMA in /proc/pid/smaps ... Browse Code »

It is useful to verify a hugepage-aware application is using the expected
pagesizes for its memory regions. This patch creates an entry called
KernelPageSize in /proc/pid/smaps that is the size of page used by the
kernel to back a VMA. The entry is not called PageSize as it is possible
the MMU uses a different size. This extension should not break any sensible
parser that skips lines containing unrecognised information.

Signed-off-by: Mel Gorman
Acked-by: "KOSAKI Motohiro"
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-01-07 07:58:58 +0800

13 Nov, 2008

1 commit

7526674de hugetlb: make unmap_ref_private multi-size-aware ... Browse Code »

Oops. Part of the hugetlb private reservation code was not fully
converted to use hstates.

When a huge page must be unmapped from VMAs due to a failed COW,
HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of
the page size being used. This works if the VMA is using the default
huge page size. Otherwise we might unmap too much, too little, or
trigger a BUG_ON. Rare but serious -- fix it.

Signed-off-by: Adam Litke
Cc: Jon Tollefson
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-11-13 09:17:16 +0800

07 Nov, 2008

2 commits

18229df5b hugetlb: pull gigantic page initialisation out of the default path ... Browse Code »

As we can determine exactly when a gigantic page is in use we can optimise
the common regular page cases by pulling out gigantic page initialisation
into its own function. As gigantic pages are never released to buddy we
do not need a destructor. This effectivly reverts the previous change to
the main buddy allocator. It also adds a paranoid check to ensure we
never release gigantic pages from hugetlbfs to the main buddy.

Signed-off-by: Andy Whitcroft
Cc: Jon Tollefson
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: [2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-11-07 07:41:18 +0800
69d177c2f hugetlbfs: handle pages higher order than MAX_ORDER ... Browse Code »

When working with hugepages, hugetlbfs assumes that those hugepages are
smaller than MAX_ORDER. Specifically it assumes that the mem_map is
contigious and uses that to optimise access to the elements of the mem_map
that represent the hugepage. Gigantic pages (such as 16GB pages on
powerpc) by definition are of greater order than MAX_ORDER (larger than
MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of
the buddy alloctor guarentees for the contiguity of the mem_map, which
ensures that the mem_map is at least contigious for maximmally aligned
areas of MAX_ORDER_NR_PAGES pages.

This patch adds new mem_map accessors and iterator helpers which handle
any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to
implement gigantic page versions of copy_huge_page and clear_huge_page,
and to allow follow_hugetlb_page handle gigantic pages.

Signed-off-by: Andy Whitcroft
Cc: Jon Tollefson
Cc: Mel Gorman
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: [2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-11-07 07:41:18 +0800

23 Oct, 2008

1 commit

e1759c215 proc: switch /proc/meminfo to seq_file ... Browse Code »

and move it to fs/proc/meminfo.c while I'm at it.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 17:52:40 +0800

20 Oct, 2008

3 commits

4b2e38ad7 hugepage: support ZERO_PAGE() ... Browse Code »

Presently hugepage doesn't use zero page at all because zero page is only
used for coredumping and hugepage can't core dump.

However we have now implemented hugepage coredumping. Therefore we should
implement the zero page of hugepage.

Implementation note:

o Why do we only check VM_SHARED for zero page?
normal page checked as ..

static inline int use_zero_page(struct vm_area_struct *vma)
{
if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
return 0;

return !vma->vm_ops || !vma->vm_ops->fault;
}

First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
doesn't have any file backing. Thus ops->fault checking is meaningless.

o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or some error happened. So we
shouldn't return zero page if any error occurred.

Signed-off-by: KOSAKI Motohiro
Cc: Adam Litke
Cc: Hugh Dickins
Cc: Kawai Hidehiro
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-10-20 23:52:32 +0800
2a4b3ded5 mm: hugetlb.c make functions static, use NULL rather than 0 ... Browse Code »

mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

Signed-off-by: Harvey Harrison
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-10-20 23:52:32 +0800
4f98a2fee vmscan: split LRU lists into anon & file sets ... Browse Code »

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon"). The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists. The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel
Signed-off-by: Lee Schermerhorn
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-10-20 23:50:25 +0800

17 Oct, 2008

1 commit

b4d1d99fd hugetlb: handle updating of ACCESSED and DIRTY in hugetlb_fault() ... Browse Code »

The page fault path for normal pages, if the fault is neither a no-page
fault nor a write-protect fault, will update the DIRTY and ACCESSED bits
in the page table appropriately.

The hugepage fault path, however, does not do this, handling only no-page
or write-protect type faults. It assumes that either the ACCESSED and
DIRTY bits are irrelevant for hugepages (usually true, since they are
never swapped) or that they are handled by the arch code.

This is inconvenient for some software-loaded TLB architectures, where the
_PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write)
access to the page at the TLB miss. This could be worked around in the
arch TLB miss code, but the TLB miss fast path can be made simple more
easily if the hugetlb_fault() path handles this, as the normal page fault
path does.

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Cc: Hugh Dickins
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2008-10-17 02:21:29 +0800

13 Aug, 2008

3 commits

2b26736c8 allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 ... Browse Code »

[Andrew this should replace the previous version which did not check
the returns from the region prepare for errors. This has been tested by
us and Gerald and it looks good.

Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors. Here is an updated patch to correct this. This passes
testing here.]

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800
57303d801 hugetlbfs: allocate structures for reservation tracking outside of spinlocks ... Browse Code »

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults. A struct file_region is used to track when
reservations have been consumed and where. These file_regions are
allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held. This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases. The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change. The second phase actually commits the
change. This patch makes use of this by checking the reservations before
the page_table_lock is taken; triggering any necessary allocations. This
may then be safely repeated within the locks without any allocations being
required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800
caff3a2c3 hugetlb: call arch_prepare_hugepage() for surplus pages ... Browse Code »

The s390 software large page emulation implements shared page tables by
using page->index of the first tail page from a compound large page to
store page table information. This is set up in arch_prepare_hugepage(),
which is called from alloc_fresh_huge_page_node().

A similar call to arch_prepare_hugepage() is missing for surplus large
pages that are allocated in alloc_buddy_huge_page(), which breaks the
software emulation mode for (surplus) large pages on s390. This patch
adds the missing call to arch_prepare_hugepage(). It will have no effect
on other architectures where arch_prepare_hugepage() is a nop.

Also, use the correct order in the error path in alloc_fresh_huge_page_node().

Acked-by: Martin Schwidefsky
Signed-off-by: Gerald Schaefer
Acked-by: Nick Piggin
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2008-08-13 07:07:27 +0800

07 Aug, 2008

1 commit

d6606683a Revert duplicate "mm/hugetlb.c must #include <asm/io.h>" ... Browse Code »

This reverts commit 7cb93181629c613ee2b8f4ffe3446f8003074842, since we
did that patch twice, and the problem was already fixed earlier by
78a34ae29bf1c9df62a5bd0f0798b6c62a54d520.

Reported-by: Andi Kleen
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-08-07 03:04:54 +0800

02 Aug, 2008

2 commits

0ef89d25d mm/hugetlb: don't crash when HPAGE_SHIFT is 0 ... Browse Code »

Some platform decide whether they support huge pages at boot time. On
these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is
set to 0 when there is no such support.

The patches to introduce multiple huge pages support broke that causing
the kernel to crash at boot time on machines such as POWER3 which lack
support for multiple page sizes.

Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-08-02 03:46:41 +0800
00e9028a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
mm/hugetlb.c must #include
video: Fix up hp6xx driver build regressions.
sh: defconfig updates.
sh: Kill off stray mach-rsk7203 reference.
serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
sh: Move out individual boards without mach groups.
sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
sh: Allow SH-3 and SH-5 to use common headers.
sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
sh/maple: clean maple bus code
sh: More header path fixups for mach dir refactoring.
sh: Move out the solution engine headers to arch/sh/include/mach-se/
sh: I2C fix for AP325RXA and Migo-R
sh: Shuffle the board directories in to mach groups.
sh: dma-sh: Fix up dreamcast dma.h mach path.
sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
sh: Add ARCH_DEFCONFIG entries for sh and sh64.
sh: Fix compile error of Solution Engine
sh: Proper __put_user_asm() size mismatch fix.
sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
...

Linus Torvalds
2008-08-02 01:53:43 +0800

30 Jul, 2008

1 commit

7cb931816 mm/hugetlb.c must #include <asm/io.h> ... Browse Code »

This patch fixes the following build error on sh caused by
commit aa888a74977a8f2120ae9332376e179c39a6b07d
(hugetlb: support larger than MAX_ORDER):

...
CC mm/hugetlb.o
/home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
/home/bunk/linux/kernel-2.6/git/linux-2.6/mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
make[2]: *** [mm/hugetlb.o] Error 1

Reported-by: Adrian Bunk
Signed-off-by: Adrian Bunk
Signed-off-by: Paul Mundt

Adrian Bunk
2008-07-30 01:18:26 +0800

29 Jul, 2008

2 commits

78a34ae29 mm/hugetlb.c must #include <asm/io.h> ... Browse Code »

This patch fixes the following build error on sh caused by commit
aa888a74977a8f2120ae9332376e179c39a6b07d ("hugetlb: support larger than
MAX_ORDER"):

mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'

Signed-off-by: Adrian Bunk
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-29 07:30:21 +0800
cddb8a5c1 mmu-notifiers: core ... Browse Code »

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.

Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).

The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.

To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.

This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).

At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.

Dependencies:

1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.

2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.

The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.

struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;

if (!kvm)
return ERR_PTR(-ENOMEM);

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}

mmu_notifier_unregister returns void and it's reliable.

The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Cc: Jack Steiner
Cc: Robin Holt
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Kanoj Sarcar
Cc: Roland Dreier
Cc: Steve Wise
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Rusty Russell
Cc: Anthony Liguori
Cc: Chris Wright
Cc: Marcelo Tosatti
Cc: Eric Dumazet
Cc: "Paul E. McKenney"
Cc: Izik Eidus
Cc: Anthony Liguori
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2008-07-29 07:30:21 +0800

27 Jul, 2008

1 commit

8a2134605 hugetlb: fix CONFIG_SYSCTL=n build ... Browse Code »

Fixes a build failure reported by Alan Cox:

mm/hugetlb.c: In function `hugetlb_acct_memory': mm/hugetlb.c:1507:
error: implicit declaration of function `cpuset_mems_nr'

Also reverts Ingo's

commit e44d1b2998d62a1f2f4d7eb17b56ba396535509f
Author: Ingo Molnar
Date: Fri Jul 25 12:57:41 2008 +0200

mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL

which fixed the build error but added some unused-static-function warnings.

Signed-off-by: Nishanth Aravamudan
Cc: Alan Cox
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-07-27 03:00:01 +0800

26 Jul, 2008

1 commit

e44d1b299 mm/hugetlb.c: fix build failure with !CONFIG_SYSCTL ... Browse Code »

on !CONFIG_SYSCTL on x86 with latest -git i get:

mm/hugetlb.c: In function 'decrement_hugepage_resv_vma':
mm/hugetlb.c:83: error: 'reserve' undeclared (first use in this function)
mm/hugetlb.c:83: error: (Each undeclared identifier is reported only once
mm/hugetlb.c:83: error: for each function it appears in.)

Signed-off-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Ingo Molnar
2008-07-26 02:35:41 +0800

25 Jul, 2008

15 commits

7251ff78b hugetlb: quota is not freed for unused reserved private huge pages ... Browse Code »

With shared reservations (and now also with private reservations), we reserve
huge pages at mmap time. We also account for the mapping against fs quota to
prevent a reservation from being preempted by quota exhaustion.

When testing with the libhugetlbfs test suite, I found a problem with quota
accounting. FS quota for allocated pages is handled correctly but we are not
releasing quota for private pages that were reserved but never allocated. Do
this in hugetlb_vm_op_close() at the same time as unused page reservations are
released.

Signed-off-by: Adam Litke
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: William Lee Irwin III
Cc: Hugh Dickins
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2008-07-25 01:47:19 +0800
7f09ca51e hugetlb: fix a hugepage reservation check for MAP_SHARED ... Browse Code »

When removing a huge page from the hugepage pool for a fault the system checks
to see if the mapping requires additional pages to be reserved, and if it does
whether there are any unreserved pages remaining. If not, the allocation
fails without even attempting to get a page. In order to determine whether to
apply this check we call vma_has_private_reserves() which tells us if this vma
is MAP_PRIVATE and is the owner. This incorrectly triggers the remaining
reservation test for MAP_SHARED mappings which prevents allocation of the
final page in the pool even though it is reserved for this mapping.

In reality we only want to check this for MAP_PRIVATE mappings where the
process is not the original mapper. Replace vma_has_private_reserves() with
vma_has_reserves() which indicates whether further reserves are required, and
update the caller.

Signed-off-by: Mel Gorman
Acked-by: Adam Litke
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:19 +0800
53ba51d21 hugetlb: allow arch overridden hugepage allocation ... Browse Code »

Allow alloc_bootmem_huge_page() to be overridden by architectures that
can't always use bootmem. This requires huge_boot_pages to be available
for use by this function.

This is required for powerpc 16G pages, which have to be reserved prior to
boot-time. The location of these pages are indicated in the device tree.

Acked-by: Adam Litke
Signed-off-by: Jon Tollefson
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jon Tollefson
2008-07-25 01:47:19 +0800
e11bfbfcb hugetlb: override default huge page size ... Browse Code »

Allow configurations with the default huge page size which is different to
the traditional HPAGE_SIZE size. The default huge page size is the one
represented in the legacy /proc ABIs, SHM, and which is defaulted to when
mounting hugetlbfs filesystems.

This is implemented with a new kernel option default_hugepagesz=, which
defaults to HPAGE_SIZE if not specified.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-07-25 01:47:19 +0800
ceb868796 hugetlb: introduce pud_huge ... Browse Code »

Straight forward extensions for huge pages located in the PUD instead of
PMDs.

Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:18 +0800
4abd32dba hugetlb: printk cleanup ... Browse Code »

- Reword sentence to clarify meaning with multiple options
- Add support for using GB prefixes for the page size
- Add extra printk to delayed > MAX_ORDER allocation code

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:18 +0800
8faa8b077 hugetlb: support boot allocate different sizes ... Browse Code »

Make some infrastructure changes to allow boot-time allocation of
different hugepage page sizes.

- move all basic hstate initialisation into hugetlb_add_hstate
- create a new function hugetlb_hstate_alloc_pages() to do the
actual initial page allocations. Call this function early in
order to allocate giant pages from bootmem.
- Check for multiple hugepages= parameters

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Acked-by: Andrew Hastings
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:18 +0800
aa888a749 hugetlb: support larger than MAX_ORDER ... Browse Code »

This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
not practical to enlarge MAX_ORDER to 1GB.

Instead the 1GB pages are only allocated at boot using the bootmem
allocator using the hugepages=... option.

These 1G bootmem pages are never freed. In theory it would be possible to
implement that with some complications, but since it would be a one-way
street (>= MAX_ORDER pages cannot be allocated later) I decided not to
currently.

The >= MAX_ORDER code is not ifdef'ed per architecture. It is not very
big and the ifdef uglyness seemed not be worth it.

Known problems: /proc/meminfo and "free" do not display the memory
allocated for gb pages in "Total". This is a little confusing for the
user.

Acked-by: Andrew Hastings
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:18 +0800
5ced66c90 hugetlb: abstract numa round robin selection ... Browse Code »

Need this as a separate function for a future patch.

No behaviour change.

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
a34378701 hugetlb: new sysfs interface ... Browse Code »

Provide new hugepages user APIs that are more suited to multiple hstates
in sysfs. There is a new directory, /sys/kernel/hugepages. Underneath
that directory there will be a directory per-supported hugepage size,
e.g.:

/sys/kernel/hugepages/hugepages-64kB
/sys/kernel/hugepages/hugepages-16384kB
/sys/kernel/hugepages/hugepages-16777216kB

corresponding to 64k, 16m and 16g respectively. Within each
hugepages-size directory there are a number of files, corresponding to the
tracked counters in the hstate, e.g.:

/sys/kernel/hugepages/hugepages-64/nr_hugepages
/sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64/free_hugepages
/sys/kernel/hugepages/hugepages-64/resv_hugepages
/sys/kernel/hugepages/hugepages-64/surplus_hugepages

Of these files, the first two are read-write and the latter three are
read-only. The size of the hugepage being manipulated is trivially
deducible from the enclosing directory and is always expressed in kB (to
match meminfo).

[dave@linux.vnet.ibm.com: fix build]
[nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
[nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
Acked-by: Greg Kroah-Hartman
Signed-off-by: Nishanth Aravamudan
Signed-off-by: Nick Piggin
Cc: Dave Hansen
Signed-off-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-07-25 01:47:17 +0800
a137e1cc6 hugetlbfs: per mount huge page sizes ... Browse Code »

Add the ability to configure the hugetlb hstate used on a per mount basis.

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- This option causes the mount code to find the hstate corresponding to the
specified size, and sets up a pointer to the hstate in the mount's
superblock.
- Change the hstate accessors to use this information rather than the
global_hstate they were using (requires a slight change in mm/memory.c
so we don't NULL deref in the error-unmap path -- see comments).

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
e5ff21594 hugetlb: multiple hstates for multiple page sizes ... Browse Code »

Add basic support for more than one hstate in hugetlbfs. This is the key
to supporting multiple hugetlbfs page sizes at once.

- Rather than a single hstate, we now have an array, with an iterator
- default_hstate continues to be the struct hstate which we use by default
- Add functions for architectures to register new hstates

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
a55164389 hugetlb: modular state for hugetlb page size ... Browse Code »

The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).

The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.

This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.

Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
b7ba30c67 hugetlb: factor out prep_new_huge_page ... Browse Code »

Needed to avoid code duplication in follow up patches.

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
a858f7b2e vma_page_offset() has no callees: drop it ... Browse Code »

Hugh adds: vma_pagecache_offset() has a dangerously misleading name, since
it's using hugepage units: rename it to vma_hugecache_offset().

[apw@shadowen.org: restack onto fixed MAP_PRIVATE reservations]
[akpm@linux-foundation.org: vma_split conversion]
Signed-off-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Cc: Adam Litke
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: Nick Piggin
Signed-off-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-07-25 01:47:16 +0800