Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

23 Oct, 2008

1 commit

e1759c215 proc: switch /proc/meminfo to seq_file ... Browse Code »

and move it to fs/proc/meminfo.c while I'm at it.

Signed-off-by: Alexey Dobriyan

Alexey Dobriyan
2008-10-23 17:52:40 +0800

27 Jul, 2008

1 commit

510a35d4a hugetlb: remove unused variable warning ... Browse Code »

Remove the following warning when CONFIG_HUGETLB_PAGE is not set:

ipc/shm.c: In function `shm_get_stat':
ipc/shm.c:565: warning: unused variable `h'

[akpm@linux-foundation.org: use tabs, not spaces]
Signed-off-by: Andrea Righi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Righi
2008-07-27 11:16:47 +0800

25 Jul, 2008

8 commits

53ba51d21 hugetlb: allow arch overridden hugepage allocation ... Browse Code »

Allow alloc_bootmem_huge_page() to be overridden by architectures that
can't always use bootmem. This requires huge_boot_pages to be available
for use by this function.

This is required for powerpc 16G pages, which have to be reserved prior to
boot-time. The location of these pages are indicated in the device tree.

Acked-by: Adam Litke
Signed-off-by: Jon Tollefson
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jon Tollefson
2008-07-25 01:47:19 +0800
ceb868796 hugetlb: introduce pud_huge ... Browse Code »

Straight forward extensions for huge pages located in the PUD instead of
PMDs.

Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:18 +0800
a34378701 hugetlb: new sysfs interface ... Browse Code »

Provide new hugepages user APIs that are more suited to multiple hstates
in sysfs. There is a new directory, /sys/kernel/hugepages. Underneath
that directory there will be a directory per-supported hugepage size,
e.g.:

/sys/kernel/hugepages/hugepages-64kB
/sys/kernel/hugepages/hugepages-16384kB
/sys/kernel/hugepages/hugepages-16777216kB

corresponding to 64k, 16m and 16g respectively. Within each
hugepages-size directory there are a number of files, corresponding to the
tracked counters in the hstate, e.g.:

/sys/kernel/hugepages/hugepages-64/nr_hugepages
/sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
/sys/kernel/hugepages/hugepages-64/free_hugepages
/sys/kernel/hugepages/hugepages-64/resv_hugepages
/sys/kernel/hugepages/hugepages-64/surplus_hugepages

Of these files, the first two are read-write and the latter three are
read-only. The size of the hugepage being manipulated is trivially
deducible from the enclosing directory and is always expressed in kB (to
match meminfo).

[dave@linux.vnet.ibm.com: fix build]
[nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
[nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
Acked-by: Greg Kroah-Hartman
Signed-off-by: Nishanth Aravamudan
Signed-off-by: Nick Piggin
Cc: Dave Hansen
Signed-off-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-07-25 01:47:17 +0800
a137e1cc6 hugetlbfs: per mount huge page sizes ... Browse Code »

Add the ability to configure the hugetlb hstate used on a per mount basis.

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- This option causes the mount code to find the hstate corresponding to the
specified size, and sets up a pointer to the hstate in the mount's
superblock.
- Change the hstate accessors to use this information rather than the
global_hstate they were using (requires a slight change in mm/memory.c
so we don't NULL deref in the error-unmap path -- see comments).

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
e5ff21594 hugetlb: multiple hstates for multiple page sizes ... Browse Code »

Add basic support for more than one hstate in hugetlbfs. This is the key
to supporting multiple hugetlbfs page sizes at once.

- Rather than a single hstate, we now have an array, with an iterator
- default_hstate continues to be the struct hstate which we use by default
- Add functions for architectures to register new hstates

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
a55164389 hugetlb: modular state for hugetlb page size ... Browse Code »

The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).

The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.

This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.

Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
04f2cbe35 hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) o… ... Browse Code »

…n hugetlbfs will succeed

After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.

We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.

Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.

1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed

This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.

To summarise the new behaviour:

1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.

2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.

2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.

If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().

[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2008-07-25 01:47:16 +0800
a1e78772d hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() ... Browse Code »

This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.

The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;

1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.

Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.

Signed-off-by: Mel Gorman
Acked-by: Adam Litke
Cc: Andy Whitcroft
Cc: William Lee Irwin III
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:16 +0800

28 Apr, 2008

1 commit

6d779079b hugetlbfs: architecture header cleanup ... Browse Code »

This patch moves all architecture functions for hugetlb to architecture header
files (include/asm-foo/hugetlb.h) and converts all macros to inline functions.
It also removes (!) ARCH_HAS_HUGEPAGE_ONLY_RANGE,
ARCH_HAS_HUGETLB_FREE_PGD_RANGE, ARCH_HAS_PREPARE_HUGEPAGE_RANGE,
ARCH_HAS_SETCLEAR_HUGE_PTE and ARCH_HAS_HUGETLB_PREFAULT_HOOK.

Getting rid of the ARCH_HAS_xxx #ifdef and macro fugliness should increase
readability and maintainability, at the price of some code duplication. An
asm-generic common part would have reduced the loc, but we would end up with
new ARCH_HAS_xxx defines eventually.

Acked-by: Martin Schwidefsky
Signed-off-by: Gerald Schaefer
Cc: Paul Mundt
Cc: "Luck, Tony"
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2008-04-28 23:58:25 +0800

14 Feb, 2008

1 commit

064d9efe9 hugetlb: fix overcommit locking ... Browse Code »

proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.

Signed-off-by: Nishanth Aravamudan
Reported-by: Miles Lane
Cc: Adam Litke
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-02-14 08:21:18 +0800

09 Feb, 2008

1 commit

a3d0c6aa1 hugetlb: add locking for overcommit sysctl ... Browse Code »

When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock. Add a
callback into the hugetlb code similar to the one for nr_hugepages. Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2008-02-09 01:22:23 +0800

18 Dec, 2007

2 commits

368d2c635 Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" ... Browse Code »

This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")

Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).

(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800
d1c3fb1f8 hugetlb: introduce nr_overcommit_hugepages sysctl ... Browse Code »

hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

nr_overcommit_hugepages > 0

indicates the same administrative setting as

hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan
Acked-by: Adam Litke
Cc: William Lee Irwin III
Cc: Dave Hansen
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2007-12-18 11:28:17 +0800

15 Nov, 2007

3 commits

45c682a68 hugetlb: fix i_blocks accounting ... Browse Code »

For administrative purpose, we want to query actual block usage for
hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that
up since kernel already has all the information to track it properly.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: Badari Pulavarty
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-11-15 10:45:40 +0800
9a119c056 hugetlb: allow bulk updating in hugetlb_*_quota() ... Browse Code »

Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
5b23dbe81 hugetlb: follow_hugetlb_page() for write access ... Browse Code »

When calling get_user_pages(), a write flag is passed in by the caller to
indicate if write access is required on the faulted-in pages. Currently,
follow_hugetlb_page() ignores this flag and always faults pages for
read-only access. This can cause data corruption because a device driver
that calls get_user_pages() with write set will not expect COW faults to
occur on the returned pages.

This patch passes the write flag down to follow_hugetlb_page() and makes
sure hugetlb_fault() is called with the right write_access parameter.

[ezk@cs.sunysb.edu: build fix]
Signed-off-by: Adam Litke
Reviewed-by: Ken Chen
Cc: David Gibson
Cc: William Lee Irwin III
Cc: Badari Pulavarty
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:39 +0800

17 Oct, 2007

1 commit

54f9f80d6 hugetlb: Add hugetlb_dynamic_pool sysctl ... Browse Code »
13

The maximum size of the huge page pool can be controlled using the overall
size of the hugetlb filesystem (via its 'size' mount option). However in the
common case the this will not be set as the pool is traditionally fixed in
size at boot time. In order to maintain the expected semantics, we need to
prevent the pool expanding by default.

This patch introduces a new sysctl controlling dynamic pool resizing. When
this is enabled the pool will expand beyond its base size up to the size of
the hugetlb filesystem. It is disabled by default.

Signed-off-by: Adam Litke
Acked-by: Andy Whitcroft
Acked-by: Dave McCracken
Cc: William Irwin
Cc: David Gibson
Cc: Ken Chen
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-10-17 00:43:02 +0800

31 Aug, 2007

1 commit

dec4ad86c hugepage: fix broken check for offset alignment in hugepage mappings ... Browse Code »

For hugepage mappings, the file offset, like the address and size, needs to
be aligned to the size of a hugepage.

In commit 68589bc353037f233fe510ad9ff432338c95db66, the check for this was
moved into prepare_hugepage_range() along with the address and size checks.
But since BenH's rework of the get_unmapped_area() paths leading up to
commit 4b1d89290b62bb2db476c94c82cf7442aab440c8, prepare_hugepage_range()
is only called for MAP_FIXED mappings, not for other mappings. This means
we're no longer ever checking for an aligned offset - I've confirmed that
mmap() will (apparently) succeed with a misaligned offset on both powerpc
and i386 at least.

This patch restores the check, removing it from prepare_hugepage_range()
and putting it back into hugetlbfs_file_mmap(). I'm putting it there,
rather than in the get_unmapped_area() path so it only needs to go in one
place, than separately in the half-dozen or so arch-specific
implementations of hugetlb_get_unmapped_area().

Signed-off-by: David Gibson
Cc: Adam Litke
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2007-08-31 16:42:23 +0800

30 Jul, 2007

1 commit

4e950f6f0 Remove fs.h from mm.h ... Browse Code »

Remove fs.h from mm.h. For this,
1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
2) Add back fs.h or less bloated headers (err.h) to files that need it.

As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
rebuilt down to 3444 (-12.3%).

Cross-compile tested without regressions on my two usual configs and (sigh):

alpha arm-mx1ads mips-bigsur powerpc-ebony
alpha-allnoconfig arm-neponset mips-capcella powerpc-g5
alpha-defconfig arm-netwinder mips-cobalt powerpc-holly
alpha-up arm-netx mips-db1000 powerpc-iseries
arm arm-ns9xxx mips-db1100 powerpc-linkstation
arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200
arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple
arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2
arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads
arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb
arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds
arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb
arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx
arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp
arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds
arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds
arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads
arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds
arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads
arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds
arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds
arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn
arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads
arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads
arm-ep93xx i386-up mips-pb1100 powerpc-pasemi
arm-footbridge ia64 mips-pb1500 powerpc-pmac32
arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64
arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800
arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3
arm-h7202 ia64-gensparse mips-qemu powerpc-pseries
arm-hackkit ia64-sim mips-rbhma4200 powerpc-up
arm-integrator ia64-sn2 mips-rbhma4500 s390
arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig
arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig
arm-iop33x ia64-zx1 mips-sead s390-up
arm-ixp2000 m68k mips-tb0219 sparc
arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig
arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig
arm-jornada720 m68k-atari mips-workpad sparc-up
arm-kafa m68k-bvme6000 mips-wrppmc sparc64
arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig
arm-ks8695 m68k-mac parisc sparc64-defconfig
arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up
arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64
arm-lpd7a400 m68k-q40 parisc-up x86_64
arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig
arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig
arm-lusl7200 mips powerpc-celleb x86_64-up
arm-mainstone mips-atlas powerpc-chrp32

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2007-07-30 08:09:29 +0800

18 Jul, 2007

1 commit

396faf030 Allow huge page allocations to use GFP_HIGH_MOVABLE ... Browse Code »

Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
can be used to satisfy hugepage allocations even when the system has been
running a long time. This allows an administrator to resize the hugepage pool
at runtime depending on the size of ZONE_MOVABLE.

This patch adds a new sysctl called hugepages_treat_as_movable. When a
non-zero value is written to it, future allocations for the huge page pool
will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.

[akpm@linux-foundation.org: various fixes]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2007-07-18 01:22:59 +0800

17 Jun, 2007

1 commit

9d66586f7 shm: fix the filename of hugetlb sysv shared memory ... Browse Code »

Some user space tools need to identify SYSV shared memory when examining
/proc//maps. To do so they look for a block device with major zero, a
dentry named SYSV, and having the minor of the internal sysv
shared memory kernel mount.

To help these tools and to make it easier for people just browsing
/proc//maps this patch modifies hugetlb sysv shared memory to use the
SYSV dentry naming convention.

User space tools will still have to be aware that hugetlb sysv shared
memory lives on a different internal kernel mount and so has a different
block device minor number from the rest of sysv shared memory.

Signed-off-by: Eric W. Biederman
Cc: "Serge E. Hallyn"
Cc: Albert Cahalan
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-06-17 04:16:16 +0800

08 May, 2007

1 commit

d2ba27e80 proper prototype for hugetlb_get_unmapped_area() ... Browse Code »

Add a proper prototype for hugetlb_get_unmapped_area() in
include/linux/hugetlb.h.

Signed-off-by: Adrian Bunk
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2007-05-08 03:12:51 +0800

02 Mar, 2007

1 commit

516dffdcd [PATCH] Fix get_unmapped_area and fsync for hugetlb shm segments ... Browse Code »

This patch provides the following hugetlb-related fixes to the recent stacked
shm files changes:
- Update is_file_hugepages() so it will reconize hugetlb shm segments.
- get_unmapped_area must be called with the nested file struct to handle
the sfd->file->f_ops->get_unmapped_area == NULL case.
- The fsync f_op must be wrapped since it is specified in the hugetlbfs
f_ops.

This is based on proposed fixes from Eric Biederman that were debugged and
tested by me. Without it, attempting to use hugetlb shared memory segments
on powerpc (and likely ia64) will kill your box.

Signed-off-by: Adam Litke
Cc: Eric Biederman
Cc: Andrew Morton
Acked-by: William Irwin
Signed-off-by: Linus Torvalds

Adam Litke
2007-03-02 09:18:39 +0800

08 Dec, 2006

1 commit

39dde65c9 [PATCH] shared page table for hugetlb page ... Browse Code »

Following up with the work on shared page table done by Dave McCracken. This
set of patch target shared page table for hugetlb memory only.

The shared page table is particular useful in the situation of large number of
independent processes sharing large shared memory segments. In the normal
page case, the amount of memory saved from process' page table is quite
significant. For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.

With PT sharing, pte entries are shared among hundreds of processes, the cache
consumption used by all the page table is smaller and in return, application
gets much higher cache hit ratio. One other effect is that cache hit ratio
with hardware page walker hitting on pte in cache will be higher and this
helps to reduce tlb miss latency. These two effects contribute to higher
application performance.

Signed-off-by: Ken Chen
Acked-by: Hugh Dickins
Cc: Dave McCracken
Cc: William Lee Irwin III
Cc: "Luck, Tony"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: David Gibson
Cc: Adam Litke
Cc: Paul Mundt
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-12-08 00:39:21 +0800

15 Nov, 2006

1 commit

68589bc35 [PATCH] hugetlb: prepare_hugepage_range check offset too ... Browse Code »

(David:)

If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
because the given file offset is not hugepage aligned - then do_mmap_pgoff
will go to the unmap_and_free_vma backout path.

But at this stage the vma hasn't been marked as hugepage, and the backout path
will call unmap_region() on it. That will eventually call down to the
non-hugepage version of unmap_page_range(). On ppc64, at least, that will
cause serious problems if there are any existing hugepage pagetable entries in
the vicinity - for example if there are any other hugepage mappings under the
same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
entries. I suspect this will also cause bad problems on ia64, though I don't
have a machine to test it on.

(Hugh:)

prepare_hugepage_range() should check file offset alignment when it checks
virtual address and length, to stop MAP_FIXED with a bad huge offset from
unmapping before it fails further down. PowerPC should apply the same
prepare_hugepage_range alignment checks as ia64 and all the others do.

Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
is the check for too small a mapping); but even so, move up setting of
VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
when unwinding from error will go the non-huge way, which may cause bad
behaviour on architectures (powerpc and ia64) which segregate their huge
mappings into a separate region of the address space.

Signed-off-by: Hugh Dickins
Cc: "Luck, Tony"
Cc: "David S. Miller"
Acked-by: Adam Litke
Acked-by: David Gibson
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2006-11-15 01:09:27 +0800

12 Oct, 2006

1 commit

502717f4e [PATCH] hugetlb: fix linked list corruption in unmap_hugepage_range() ... Browse Code »

commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
libhugetlbfs test suite. The problem is that hugetlb pages can be shared
by multiple mappings. Multiple threads can fight over page->lru in the
unmap path and bad things happen. We now serialize __unmap_hugepage_range
to void concurrent linked list manipulation. Such serialization is also
needed for shared page table page on hugetlb area. This patch will fixed
the bug and also serve as a prepatch for shared page table.

Signed-off-by: Ken Chen
Cc: Hugh Dickins
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-10-12 02:14:15 +0800

23 Jun, 2006

1 commit

a43a8c39b [PATCH] tightening hugetlb strict accounting ... Browse Code »

Current hugetlb strict accounting for shared mapping always assume mapping
starts at zero file offset and reserves pages between zero and size of the
file. This assumption often reserves (or lock down) a lot more pages then
necessary if application maps at none zero file offset. libhugetlbfs is
one example that requires proper reservation on shared mapping starts at
none zero offset.

This patch extends the reservation and hugetlb strict accounting to support
any arbitrary pair of (offset, len), resulting a much more robust and
accurate scheme. More importantly, it won't lock down any hugetlb pages
outside file mapping.

Signed-off-by: Ken Chen
Acked-by: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-06-23 22:42:48 +0800

29 Mar, 2006

1 commit

4b6f5d20b [PATCH] Make most file operations structs in fs/ const ... Browse Code »

This is a conversion to make the various file_operations structs in fs/
const. Basically a regexp job, with a few manual fixups

The goal is both to increase correctness (harder to accidentally write to
shared datastructures) and reducing the false sharing of cachelines with
things that get dirty in .data (while .rodata is nicely read only and thus
cache clean)

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2006-03-29 01:16:06 +0800

22 Mar, 2006

6 commits

42b88befd [PATCH] hugepage: is_aligned_hugepage_range() cleanup ... Browse Code »

Quite a long time back, prepare_hugepage_range() replaced
is_aligned_hugepage_range() as the callback from mm/mmap.c to arch code to
verify if an address range is suitable for a hugepage mapping.
is_aligned_hugepage_range() stuck around, but only to implement
prepare_hugepage_range() on archs which didn't implement their own.

Most archs (everything except ia64 and powerpc) used the same
implementation of is_aligned_hugepage_range(). On powerpc, which
implements its own prepare_hugepage_range(), the custom version was never
used.

In addition, "is_aligned_hugepage_range()" was a bad name, because it
suggests it returns true iff the given range is a good hugepage range,
whereas in fact it returns 0-or-error (so the sense is reversed).

This patch cleans up by abolishing is_aligned_hugepage_range(). Instead
prepare_hugepage_range() is defined directly. Most archs use the default
version, which simply checks the given region is aligned to the size of a
hugepage. ia64 and powerpc define custom versions. The ia64 one simply
checks that the range is in the correct address space region in addition to
being suitably aligned. The powerpc version (just as previously) checks
for suitable addresses, and if necessary performs low-level MMU frobbing to
set up new areas for use by hugepages.

No libhugetlbfs testsuite regressions on ppc64 (POWER5 LPAR).

Signed-off-by: David Gibson
Signed-off-by: Zhang Yanmin
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:04 +0800
3915bcf38 [PATCH] hugepage: Move hugetlb_free_pgd_range() prototype to hugetlb.h ... Browse Code »

The optional hugepage callback, hugetlb_free_pgd_range() is presently
implemented non-trivially only on ia64 (but I plan to add one for powerpc
shortly). It has its own prototype for the function in asm-ia64/pgtable.h.
However, since the function is called from generic code, it make sense for
its prototype to be in the generic hugetlb.h header file, as the protypes
other arch callbacks already are (prepare_hugepage_range(),
set_huge_pte_at(), etc.). This patch makes it so.

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:04 +0800
9da61aef0 [PATCH] hugepage: Fix hugepage logic in free_pgtables() ... Browse Code »

free_pgtables() has special logic to call hugetlb_free_pgd_range() instead
of the normal free_pgd_range() on hugepage VMAs. However, the test it uses
to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized
range at the start of the vma. is_hugepage_only_range() will return true
if the given range has any intersection with a hugepage address region, and
in this case the given region need not be hugepage aligned. So, for
example, this test can return true if called on, say, a 4k VMA immediately
preceding a (nicely aligned) hugepage VMA.

At present we get away with this because the powerpc version of
hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the
only other arch with a non-trivial is_hugepage_only_range()) we get away
with it for a different reason; the hugepage area is not contiguous with
the rest of the user address space, and VMAs are not permitted in between,
so the test can't return a false positive there.

Nonetheless this should be fixed. We do that in the patch below by
replacing the is_hugepage_only_range() test with an explicit test of the
VMA using is_vm_hugetlb_page().

This in turn changes behaviour for platforms where is_hugepage_only_range()
returns false always (everything except powerpc and ia64). We address this
by ensuring that hugetlb_free_pgd_range() is defined to be identical to
free_pgd_range() (instead of a no-op) on everything except ia64. Even so,
it will prevent some otherwise possible coalescing of calls down to
free_pgd_range(). Since this only happens for hugepage VMAs, removing this
small optimization seems unlikely to cause any trouble.

This patch causes no regressions on the libhugetlbfs testsuite - ppc64
POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
27a85ef1b [PATCH] hugepage: Make {alloc,free}_huge_page() local ... Browse Code »

Originally, mm/hugetlb.c just handled the hugepage physical allocation path
and its {alloc,free}_huge_page() functions were used from the arch specific
hugepage code. These days those functions are only used with mm/hugetlb.c
itself. Therefore, this patch makes them static and removes their
prototypes from hugetlb.h. This requires a small rearrangement of code in
mm/hugetlb.c to avoid a forward declaration.

This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
POWER5).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
b45b5bd65 [PATCH] hugepage: Strict page reservation for hugepage inodes ... Browse Code »

These days, hugepages are demand-allocated at first fault time. There's a
somewhat dubious (and racy) heuristic when making a new mmap() to check if
there are enough available hugepages to fully satisfy that mapping.

A particularly obvious case where the heuristic breaks down is where a
process maps its hugepages not as a single chunk, but as a bunch of
individually mmap()ed (or shmat()ed) blocks without touching and
instantiating the pages in between allocations. In this case the size of
each block is compared against the total number of available hugepages.
It's thus easy for the process to become overcommitted, because each block
mapping will succeed, although the total number of hugepages required by
all blocks exceeds the number available. In particular, this defeats such
a program which will detect a mapping failure and adjust its hugepage usage
downward accordingly.

The patch below addresses this problem, by strictly reserving a number of
physical hugepages for hugepage inodes which have been mapped, but not
instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
trigger an OOM. (Actually SHARED mappings can technically still OOM, but
only if the sysadmin explicitly reduces the hugepage pool between mapping
and instantiation)

This patch appears to address the problem at hand - it allows DB2 to start
correctly, for instance, which previously suffered the failure described
above.

This patch causes no regressions on the libhugetblfs testsuite, and makes a
test (designed to catch this problem) pass which previously failed (ppc64,
POWER5).

Signed-off-by: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2006-03-22 23:54:03 +0800
8f860591f [PATCH] Enable mprotect on huge pages ... Browse Code »

2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
mprotect.

From: David Gibson

Remove a test from the mprotect() path which checks that the mprotect()ed
range on a hugepage VMA is hugepage aligned (yes, really, the sense of
is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

In fact, we don't need this test. If the given addresses match the
beginning/end of a hugepage VMA they must already be suitably aligned. If
they don't, then mprotect_fixup() will attempt to split the VMA. The very
first test in split_vma() will check for a badly aligned address on a
hugepage VMA and return -EINVAL if necessary.

From: "Chen, Kenneth W"

On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
identify of hugetlb pte is lost when changing page protection via mprotect.
A page fault occurs later will trigger a bug check in huge_pte_alloc().

The fix is to always make new pte a hugetlb pte and also to clean up
legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

Signed-off-by: Zhang Yanmin
Cc: David Gibson
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: William Lee Irwin III
Signed-off-by: Ken Chen
Signed-off-by: Nishanth Aravamudan
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang, Yanmin
2006-03-22 23:54:03 +0800

07 Jan, 2006

1 commit

5da7ca860 [PATCH] Add NUMA policy support for huge pages. ... Browse Code »

The huge_zonelist() function in the memory policy layer provides an list of
zones ordered by NUMA distance. The hugetlb layer will walk that list looking
for a zone that has available huge pages but is also in the nodeset of the
current cpuset.

This patch does not contain the folding of find_or_alloc_huge_page() that was
controversial in the earlier discussion.

Signed-off-by: Christoph Lameter
Cc: Andi Kleen
Acked-by: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-01-07 00:33:23 +0800

14 Nov, 2005

1 commit

51c6f666f [PATCH] mm: ZAP_BLOCK causes redundant work ... Browse Code »

The address based work estimate for unmapping (for lockbreak) is and always
was horribly inefficient for sparse mappings. The problem is most simply
explained with an example:

If we find a pgd is clear, we still have to call into unmap_page_range
PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in
order to progress the working address to the next pgd.

The fundamental way to solve the problem is to keep track of the end
address we've processed and pass it back to the higher layers.

From: Nick Piggin

Modification to completely get away from address based work estimate
and instead use an abstract count, with a very small cost for empty
entries as opposed to present pages.

On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB
of virtual address space takes 1.69s; with the following patch applied,
this operation can be done 1000 times in less than 0.01s

From: Andrew Morton

With CONFIG_HUTETLB_PAGE=n:

mm/memory.c: In function `unmap_vmas':
mm/memory.c:779: warning: division by zero

Due to

zap_work -= (end - start) /
(HPAGE_SIZE / PAGE_SIZE);

So make the dummy HPAGE_SIZE non-zero

Signed-off-by: Robin Holt
Signed-off-by: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2005-11-14 10:14:12 +0800

30 Oct, 2005

1 commit

508034a32 [PATCH] mm: unmap_vmas with inner ptlock ... Browse Code »

Remove the page_table_lock from around the calls to unmap_vmas, and replace
the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
now safe to descend without page_table_lock.

Don't attempt fancy locking for hugepages, just take page_table_lock in
unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
does unmap_vmas have much use for its mm arg now.

The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
page_table_lock: if they're implemented at all, they typically come down to
flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
(which we already audited for the mprotect case).

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-30 12:40:41 +0800

21 Oct, 2005

1 commit

ac9b9c667 [PATCH] Fix handling spurious page fault for hugetlb region ... Browse Code »

This reverts commit 3359b54c8c07338f3a863d1109b42eebccdcf379 and
replaces it with a cleaner version that is purely based on page table
operations, so that the synchronization between inode size and hugetlb
mappings becomes moot.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2005-10-21 00:02:07 +0800