Eric Lee / linux-smarc-t335x-v3.2

22 Apr, 2009

1 commit

c12ddba09 hugetlbfs: return negative error code for bad mount option ... Browse Code »

This fixes the following BUG:

# mount -o size=MM -t hugetlbfs none /huge
hugetlbfs: Bad value 'MM' for mount option 'size=MM'
------------[ cut here ]------------
kernel BUG at fs/super.c:996!

Due to

BUG_ON(!mnt->mnt_sb);

in vfs_kern_mount().

Also, remove unused #include

Cc: William Irwin
Cc:
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2009-04-22 04:41:48 +0800

01 Apr, 2009

2 commits

2584e5173 mm: reintroduce and deprecate rlimit based access for SHM_HUGETLB ... Browse Code »

Allow non root users with sufficient mlock rlimits to be able to allocate
hugetlb backed shm for now. Deprecate this though. This is being
deprecated because the mlock based rlimit checks for SHM_HUGETLB is not
consistent with mmap based huge page allocations.

Signed-off-by: Ravikiran Thirumalai
Reviewed-by: Mel Gorman
Cc: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ravikiran G Thirumalai
2009-04-01 23:59:12 +0800
8a0bdec19 mm: fix SHM_HUGETLB to work with users in hugetlb_shm_group ... Browse Code »

Fix hugetlb subsystem so that non root users belonging to
hugetlb_shm_group can actually allocate hugetlb backed shm.

Currently non root users cannot even map one large page using SHM_HUGETLB
when they belong to the gid in /proc/sys/vm/hugetlb_shm_group. This is
because allocation size is verified against RLIMIT_MEMLOCK resource limit
even if the user belongs to hugetlb_shm_group.

This patch
1. Fixes hugetlb subsystem so that users with CAP_IPC_LOCK and users
belonging to hugetlb_shm_group don't need to be restricted with
RLIMIT_MEMLOCK resource limits
2. This patch also disables mlock based rlimit checking (which will
be reinstated and marked deprecated in a subsequent patch).

Signed-off-by: Ravikiran Thirumalai
Reviewed-by: Mel Gorman
Cc: William Lee Irwin III
Cc: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ravikiran G Thirumalai
2009-04-01 23:59:12 +0800

11 Feb, 2009

1 commit

5a6fe1259 Do not account for the address space used by hugetlbfs using VM_ACCOUNT ... Browse Code »

When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2009-02-11 02:48:42 +0800

07 Jan, 2009

1 commit

91bf189c3 hugetlb: unsigned ret cannot be negative ... Browse Code »

unsigned long ret cannot be negative, but ret can get -EFAULT.

Signed-off-by: Roel Kluin
Cc: Hugh Dickins
Cc: Christoph Lameter
Cc: Adam Litke
Cc: David Gibson
Cc: Ken Chen
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-01-07 07:59:08 +0800

06 Jan, 2009

1 commit

56ff5efad zero i_uid/i_gid on inode allocation ... Browse Code »

... and don't bother in callers. Don't bother with zeroing i_blocks,
while we are at it - it's already been zeroed.

i_mode is not worth the effort; it has no common default value.

Signed-off-by: Al Viro

Al Viro
2009-01-06 00:54:28 +0800

14 Nov, 2008

3 commits

86a264abe CRED: Wrap current->cred and a few other accessors ... Browse Code »

Wrap current->cred and a few other accessors to hide their actual
implementation.

Signed-off-by: David Howells
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: James Morris

David Howells
2008-11-14 07:39:18 +0800
b6dff3ec5 CRED: Separate task security context from task_struct ... Browse Code »

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne

Signed-off-by: David Howells
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: James Morris

David Howells
2008-11-14 07:39:16 +0800
77c70de15 CRED: Wrap task credential accesses in the hugetlbfs filesystem ... Browse Code »

Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.

Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.

Signed-off-by: David Howells
Reviewed-by: James Morris
Acked-by: Serge Hallyn
Cc: William Irwin
Signed-off-by: James Morris

David Howells
2008-11-14 07:38:56 +0800

14 Oct, 2008

1 commit

a447c0932 vfs: Use const for kernel parser table ... Browse Code »

This is a much better version of a previous patch to make the parser
tables constant. Rather than changing the typedef, we put the "const" in
all the various places where its required, allowing the __initconst
exception for nfsroot which was the cause of the previous trouble.

This was posted for review some time ago and I believe its been in -mm
since then.

Signed-off-by: Steven Whitehouse
Cc: Alexander Viro
Signed-off-by: Linus Torvalds

Steven Whitehouse
2008-10-14 01:10:37 +0800

27 Jul, 2008

1 commit

51cc50685 SL*B: drop kmem cache argument from constructor ... Browse Code »

Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.

Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c

This is flag day, yes.

Signed-off-by: Alexey Dobriyan
Acked-by: Pekka Enberg
Acked-by: Christoph Lameter
Cc: Jon Tollefson
Cc: Nick Piggin
Cc: Matt Mackall
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2008-07-27 03:00:07 +0800

25 Jul, 2008

4 commits

a137e1cc6 hugetlbfs: per mount huge page sizes ... Browse Code »

Add the ability to configure the hugetlb hstate used on a per mount basis.

- Add a new pagesize= option to the hugetlbfs mount that allows setting
the page size
- This option causes the mount code to find the hstate corresponding to the
specified size, and sets up a pointer to the hstate in the mount's
superblock.
- Change the hstate accessors to use this information rather than the
global_hstate they were using (requires a slight change in mm/memory.c
so we don't NULL deref in the error-unmap path -- see comments).

[np: take hstate out of hugetlbfs inode and vma->vm_private_data]

Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
a55164389 hugetlb: modular state for hugetlb page size ... Browse Code »

The goal of this patchset is to support multiple hugetlb page sizes. This
is achieved by introducing a new struct hstate structure, which
encapsulates the important hugetlb state and constants (eg. huge page
size, number of huge pages currently allocated, etc).

The hstate structure is then passed around the code which requires these
fields, they will do the right thing regardless of the exact hstate they
are operating on.

This patch adds the hstate structure, with a single global instance of it
(default_hstate), and does the basic work of converting hugetlb to use the
hstate.

Future patches will add more hstate structures to allow for different
hugetlbfs mounts to have different page sizes.

[akpm@linux-foundation.org: coding-style fixes]
Acked-by: Adam Litke
Acked-by: Nishanth Aravamudan
Signed-off-by: Andi Kleen
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2008-07-25 01:47:17 +0800
04f2cbe35 hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) o… ... Browse Code »

…n hugetlbfs will succeed

After patch 2 in this series, a process that successfully calls mmap() for
a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
process calls fork(). At that point, the next write fault from the parent
could fail due to COW if the child still has a reference.

We only reserve pages for the parent but a copy must be made to avoid
leaking data from the parent to the child after fork(). Reserves could be
taken for both parent and child at fork time to guarantee faults but if
the mapping is large it is highly likely we will not have sufficient pages
for the reservation, and it is common to fork only to exec() immediatly
after. A failure here would be very undesirable.

Note that the current behaviour of mainline with MAP_PRIVATE pages is
pretty bad. The following situation is allowed to occur today.

1. Process calls mmap(MAP_PRIVATE)
2. Process calls mlock() to fault all pages and makes sure it succeeds
3. Process forks()
4. Process writes to MAP_PRIVATE mapping while child still exists
5. If the COW fails at this point, the process gets SIGKILLed even though it
had taken care to ensure the pages existed

This patch improves the situation by guaranteeing the reliability of the
process that successfully calls mmap(). When the parent performs COW, it
will try to satisfy the allocation without using reserves. If that fails
the parent will steal the page leaving any children without a page.
Faults from the child after that point will result in failure. If the
child COW happens first, an attempt will be made to allocate the page
without reserves and the child will get SIGKILLed on failure.

To summarise the new behaviour:

1. If the original mapper performs COW on a private mapping with multiple
references, it will attempt to allocate a hugepage from the pool or
the buddy allocator without using the existing reserves. On fail, VMAs
mapping the same area are traversed and the page being COW'd is unmapped
where found. It will then steal the original page as the last mapper in
the normal way.

2. The VMAs the pages were unmapped from are flagged to note that pages
with data no longer exist. Future no-page faults on those VMAs will
terminate the process as otherwise it would appear that data was corrupted.
A warning is printed to the console that this situation occured.

2. If the child performs COW first, it will attempt to satisfy the COW
from the pool if there are enough pages or via the buddy allocator if
overcommit is allowed and the buddy allocator can satisfy the request. If
it fails, the child will be killed.

If the pool is large enough, existing applications will not notice that
the reserves were a factor. Existing applications depending on the
no-reserves been set are unlikely to exist as for much of the history of
hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
point or failing the mmap().

[npiggin@suse.de: fix CONFIG_HUGETLB=n build]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2008-07-25 01:47:16 +0800
a1e78772d hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() ... Browse Code »

This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
a similar manner to the reservations taken for MAP_SHARED mappings. The
reserve count is accounted both globally and on a per-VMA basis for
private mappings. This guarantees that a process that successfully calls
mmap() will successfully fault all pages in the future unless fork() is
called.

The characteristics of private mappings of hugetlbfs files behaviour after
this patch are;

1. The process calling mmap() is guaranteed to succeed all future faults until
it forks().
2. On fork(), the parent may die due to SIGKILL on writes to the private
mapping if enough pages are not available for the COW. For reasonably
reliable behaviour in the face of a small huge page pool, children of
hugepage-aware processes should not reference the mappings; such as
might occur when fork()ing to exec().
3. On fork(), the child VMAs inherit no reserves. Reads on pages already
faulted by the parent will succeed. Successful writes will depend on enough
huge pages being free in the pool.
4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
and at fault time otherwise.

Before this patch, all reads or writes in the child potentially needs page
allocations that can later lead to the death of the parent. This applies
to reads and writes of uninstantiated pages as well as COW. After the
patch it is only a write to an instantiated page that causes problems.

Signed-off-by: Mel Gorman
Acked-by: Adam Litke
Cc: Andy Whitcroft
Cc: William Lee Irwin III
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-07-25 01:47:16 +0800

30 Apr, 2008

1 commit

e4ad08fe6 mm: bdi: add separate writeback accounting capability ... Browse Code »

Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
set, then don't update the per-bdi writeback stats from
test_set_page_writeback() and test_clear_page_writeback().

Misc cleanups:

- convert bdi_cap_writeback_dirty() and friends to static inline functions
- create a flag that includes all three dirty/writeback related flags,
since almst all users will want to have them toghether

Signed-off-by: Miklos Szeredi
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-04-30 23:29:50 +0800

28 Apr, 2008

2 commits

71fe804b6 mempolicy: use struct mempolicy pointer in shmem_sb_info ... Browse Code »

This patch replaces the mempolicy mode, mode_flags, and nodemask in the
shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
inode.c and simplifies the interfaces.

mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
argument that causes the input nodemask to be stored in the w.user_nodemask of
the created mempolicy for use when the mempolicy is installed in a tmpfs inode
shared policy tree. At that time, any cpuset contextualization is applied to
the original input nodemask. This preserves the previous behavior where the
input nodemask was stored in the superblock. We can think of the returned
mempolicy as "context free".

Because mpol_parse_str() is now calling mpol_new(), we can remove from
mpol_to_str() the semantic checks that mpol_new() already performs.

Add 'no_context' parameter to mpol_to_str() to specify that it should format
the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

Change mpol_shared_policy_init() to take a pointer to a "context free" struct
mempolicy and to create a new, "contextualized" mempolicy using the mode,
mode_flags and user_nodemask from the input mempolicy.

Note: we know that the mempolicy passed to mpol_to_str() or
mpol_shared_policy_init() from a tmpfs superblock is "context free". This
is currently the only instance thereof. However, if we found more uses for
this concept, and introduced any ambiguity as to whether a mempolicy was
context free or not, we could add another internal mode flag to identify
context free mempolicies. Then, we could remove the 'no_context' argument
from mpol_to_str().

Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
if one exists, to pass to mpol_shared_policy_init(). We must add the
reference under the sb stat_lock to prevent races with replacement of the mpol
by remount. This reference is removed in mpol_shared_policy_init().

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: yet another build fix]
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Mel Gorman
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2008-04-28 23:58:25 +0800
028fec414 mempolicy: support optional mode flags ... Browse Code »

With the evolution of mempolicies, it is necessary to support mempolicy mode
flags that specify how the policy shall behave in certain circumstances. The
most immediate need for mode flag support is to suppress remapping the
nodemask of a policy at the time of rebind.

Both the mempolicy mode and flags are passed by the user in the 'int policy'
formal of either the set_mempolicy() or mbind() syscall. A new constant,
MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
passed as part of this int. Mempolicies that include illegal flags as part of
their policy are rejected as invalid.

An additional member to struct mempolicy is added to support the mode flags:

struct mempolicy {
...
unsigned short policy;
unsigned short flags;
}

The splitting of the 'int' actual passed by the user is done in
sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
there are additional flags, and storing it in the new 'flags' member of struct
mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
the 'policy' member of the struct and all current users of pol->policy remain
unchanged.

The union of the policy mode and optional mode flags is passed back to the
user in get_mempolicy().

This combination of mode and flags within the same actual does not break
userspace code that relies on get_mempolicy(&policy, ...) and either

switch (policy) {
case MPOL_BIND:
...
case MPOL_INTERLEAVE:
...
};

statements or

if (policy == MPOL_INTERLEAVE) {
...
}

statements. Such applications would need to use optional mode flags when
calling set_mempolicy() or mbind() for these previously implemented statements
to stop working. If an application does start using optional mode flags, it
will need to mask the optional flags off the policy in switch and conditional
statements that only test mode.

An additional member is also added to struct shmem_sb_info to store the
optional mode flags.

[hugh@veritas.com: shmem mpol: fix build warning]
Cc: Paul Jackson
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: Andi Kleen
Signed-off-by: David Rientjes
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-04-28 23:58:19 +0800

19 Mar, 2008

1 commit

b4d232e65 [PATCH] double iput() on failure exit in hugetlb ... Browse Code »

once we'd done d_instantiate(), we should only do dput().

Signed-off-by: Al Viro

Al Viro
2008-03-19 18:55:01 +0800

09 Feb, 2008

1 commit

10f19a86a mount options: fix hugetlbfs ... Browse Code »

Add a .show_options super operation to hugetlbfs.

Use generic_show_options() and save the complete option string in
hugetlbfs_fill_super().

Signed-off-by: Miklos Szeredi
Cc: Adam Litke
Cc: Badari Pulavarty
Cc: Ken Chen
Cc: William Lee Irwin III
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-02-09 01:22:40 +0800

06 Feb, 2008

1 commit

75897d60a hugetlb: allow sticky directory mount option ... Browse Code »

Allow sticky directory mount option for hugetlbfs. This allows admin
to create a shared hugetlbfs mount point for multiple users, while
prevent accidental file deletion that users may step on each other.
It is similiar to default tmpfs mount option, or typical option used
on /tmp.

Signed-off-by: Ken Chen
Cc: Badari Pulavarty
Cc: Adam Litke
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2008-02-06 01:44:14 +0800

15 Nov, 2007

2 commits

9a119c056 hugetlb: allow bulk updating in hugetlb_*_quota() ... Browse Code »

Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
allow bulk updating of the sbinfo->free_blocks counter. This will be used by
the next patch in the series.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800
c79fb75e5 hugetlb: fix quota management for private mappings ... Browse Code »

The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
mappings when that support was added. Currently, quota is debited at page
instantiation and credited at file truncation. This approach works correctly
for shared pages but is incomplete for private pages. In addition to
hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
this function does not respect quotas.

Private huge pages are treated very much like normal, anonymous pages. They
are not "backed" by the hugetlbfs file and are not stored in the mapping's
radix tree. This means that private pages are invisible to
truncate_hugepages() so that function will not credit the quota.

This patch (based on a prototype provided by Ken Chen) moves quota crediting
for all pages into free_huge_page(). page->private is used to store a pointer
to the mapping to which this page belongs. This is used to credit quota on
the appropriate hugetlbfs instance.

Signed-off-by: Adam Litke
Cc: Ken Chen
Cc: Ken Chen
Cc: Andy Whitcroft
Cc: Dave Hansen
Cc: David Gibson
Cc: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adam Litke
2007-11-15 10:45:40 +0800

17 Oct, 2007

7 commits

ce8d2cdf3 r/o bind mounts: filesystem helpers for custom 'struct file's ... Browse Code »

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem. In the
process of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount.

This has a number of uses. It allows chroots to have parts of filesystems
writable. It will be useful for containers in the future because users may
have root inside a container, but should not be allowed to write to
somefilesystems. This also replaces patches that vserver has had out of the
tree for several years.

It allows security enhancement by making sure that parts of your filesystem
read-only (such as when you don't trust your FTP server), when you don't want
to have entire new filesystems mounted, or when you want atime selectively
updated. I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular bind and a r/o
bind mount of it. It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail, like creating a
file on the r/o mount.

This patch:

Some filesystems forego the vfs and may_open() and create their own 'struct
file's.

This patch creates a couple of helper functions which can be used by these
filesystems, and will provide a unified place which the r/o bind mount code
may patch.

Also, rename an existing, static-scope init_file() to a less generic name.

Signed-off-by: Dave Hansen
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2007-10-17 23:43:04 +0800
1c0eeaf56 introduce I_SYNC ... Browse Code »

I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

[bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
Signed-off-by: Joern Engel
Cc: Dave Kleikamp
Cc: David Chinner
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Adrian Bunk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joern Engel
2007-10-17 23:43:02 +0800
4ba9b9d0b Slab API: remove useless ctor parameter and reorder parameters ... Browse Code »

Slab constructors currently have a flags parameter that is never used. And
the order of the arguments is opposite to other slab functions. The object
pointer is placed before the kmem_cache pointer.

Convert

ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-10-17 23:42:45 +0800
e0bf68dde mm: bdi init hooks ... Browse Code »

provide BDI constructor/destructor hooks

[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2007-10-17 23:42:45 +0800
e63e1e5a6 hugetlbfs read() support ... Browse Code »

Support for reading from hugetlbfs files. libhugetlbfs lets application
text/data to be placed in large pages. When we do that, oprofile doesn't
work - since libbfd tries to read from it.

This code is very similar to what do_generic_mapping_read() does, but I
can't use it since it has PAGE_CACHE_SIZE assumptions.

[akpm@linux-foundation.org: cleanups, fix leak]
[bunk@stusta.de: make hugetlbfs_read() static]
Signed-off-by: Badari Pulavarty
Acked-by: William Irwin
Tested-by: Nishanth Aravamudan
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Badari Pulavarty
2007-10-17 00:43:03 +0800
7aa91e104 hugetlb: allow extending ftruncate on hugetlbfs ... Browse Code »

For historical reason, expanding ftruncate that increases file size on
hugetlbfs is not allowed due to pages were pre-faulted and lack of fault
handler. Now that we have demand faulting on hugetlb since 2.6.15, there
is no reason to hold back that limitation.

This will make hugetlbfs behave more like a normal fs. I'm writing a user
level code that uses hugetlbfs but will fall back to tmpfs if there are no
hugetlb page available in the system. Having hugetlbfs specific ftruncate
behavior is a bit quirky and I would like to remove that artificial
limitation.

Signed-off-by:
Acked-by: Wiliam Irwin
Cc: Adam Litke
Cc: David Gibson
Cc: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Chen
2007-10-17 00:43:02 +0800
800d15a53 implement simple fs aops ... Browse Code »

Implement new aops for some of the simpler filesystems.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2007-10-17 00:42:55 +0800

31 Aug, 2007

1 commit

dec4ad86c hugepage: fix broken check for offset alignment in hugepage mappings ... Browse Code »

For hugepage mappings, the file offset, like the address and size, needs to
be aligned to the size of a hugepage.

In commit 68589bc353037f233fe510ad9ff432338c95db66, the check for this was
moved into prepare_hugepage_range() along with the address and size checks.
But since BenH's rework of the get_unmapped_area() paths leading up to
commit 4b1d89290b62bb2db476c94c82cf7442aab440c8, prepare_hugepage_range()
is only called for MAP_FIXED mappings, not for other mappings. This means
we're no longer ever checking for an aligned offset - I've confirmed that
mmap() will (apparently) succeed with a misaligned offset on both powerpc
and i386 at least.

This patch restores the check, removing it from prepare_hugepage_range()
and putting it back into hugetlbfs_file_mmap(). I'm putting it there,
rather than in the get_unmapped_area() path so it only needs to go in one
place, than separately in the half-dozen or so arch-specific
implementations of hugetlb_get_unmapped_area().

Signed-off-by: David Gibson
Cc: Adam Litke
Cc: Andi Kleen
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Gibson
2007-08-31 16:42:23 +0800

20 Jul, 2007

1 commit

20c2df83d mm: Remove slab destructors from kmem_cache_create(). ... Browse Code »

Slab destructors were no longer supported after Christoph's
c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt

Paul Mundt
2007-07-20 09:11:58 +0800

17 Jul, 2007

2 commits

b4c07bce7 hugetlbfs: handle empty options string ... Browse Code »

I was seeing a null pointer deref in fs/super.c:vfs_kern_mount().
Some file system get_sb() handler was returning NULL mnt_sb with
a non-negative return value. I also noticed a "hugetlbfs: Bad
mount option:" message in the log.

Turns out that hugetlbfs_parse_options() was not checking for an
empty option string after call to strsep(). On failure,
hugetlbfs_parse_options() returns 1. hugetlbfs_fill_super() just
passed this return code back up the call stack where
vfs_kern_mount() missed the error and proceeded with a NULL mnt_sb.

Apparently introduced by patch:
hugetlbfs-use-lib-parser-fix-docs.patch

The problem was exposed by this line in my fstab:

none /huge hugetlbfs defaults 0 0

It can also be demonstrated by invoking mount of hugetlbfs
directly with no options or a bogus option.

This patch:

1) adds the check for empty option to hugetlbfs_parse_options(),
2) enhances the error message to bracket any unrecognized
option with quotes ,
3) modifies hugetlbfs_parse_options() to return -EINVAL on any
unrecognized option,
4) adds a BUG_ON() to vfs_kern_mount() to catch any get_sb()
handler that returns a NULL mnt->mnt_sb with a return value
>= 0.

Signed-off-by: Lee Schermerhorn
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2007-07-17 00:05:46 +0800
e73a75fa7 hugetlbfs: use lib/parser, fix docs ... Browse Code »

Use lib/parser.c to parse hugetlbfs mount options. Correct docs in
hugetlbpage.txt.

old size of hugetlbfs_fill_super: 675 bytes
new size of hugetlbfs_fill_super: 686 bytes
(hugetlbfs_parse_options() is inlined)

Signed-off-by: Randy Dunlap
Cc: Hugh Dickins
Cc: David Gibson
Cc: Adam Litke
Acked-by: William Lee Irwin III
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2007-07-17 00:05:46 +0800

17 Jun, 2007

1 commit

9d66586f7 shm: fix the filename of hugetlb sysv shared memory ... Browse Code »

Some user space tools need to identify SYSV shared memory when examining
/proc//maps. To do so they look for a block device with major zero, a
dentry named SYSV, and having the minor of the internal sysv
shared memory kernel mount.

To help these tools and to make it easier for people just browsing
/proc//maps this patch modifies hugetlb sysv shared memory to use the
SYSV dentry naming convention.

User space tools will still have to be aware that hugetlb sysv shared
memory lives on a different internal kernel mount and so has a different
block device minor number from the rest of sysv shared memory.

Signed-off-by: Eric W. Biederman
Cc: "Serge E. Hallyn"
Cc: Albert Cahalan
Cc: Badari Pulavarty
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric W. Biederman
2007-06-17 04:16:16 +0800

17 May, 2007

1 commit

a35afb830 Remove SLAB_CTOR_CONSTRUCTOR ... Browse Code »

SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter
Cc: David Howells
Cc: Jens Axboe
Cc: Steven French
Cc: Michael Halcrow
Cc: OGAWA Hirofumi
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Roman Zippel
Cc: David Woodhouse
Cc: Dave Kleikamp
Cc: Trond Myklebust
Cc: "J. Bruce Fields"
Cc: Anton Altaparmakov
Cc: Mark Fasheh
Cc: Paul Mackerras
Cc: Christoph Hellwig
Cc: Jan Kara
Cc: David Chinner
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-17 20:23:04 +0800

08 May, 2007

4 commits

5bc98594d hugetlbfs: add NULL check in hugetlb_zero_setup() ... Browse Code »

If hugetlbfs module_init() fails, hugetlbfs_vfsmount is not initialized and
shmget() with SHM_HUGETLB flag will cause NULL pointer dereference.

Signed-off-by: Akinobu Mita
Acked-by: William Irwin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2007-05-08 03:12:57 +0800
50953fe9e slab allocators: Remove SLAB_DEBUG_INITIAL flag ... Browse Code »

I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again? The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free. That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on. If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code. But there is no such code
in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e. add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches. Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support. Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:57 +0800
036e08568 get_unmapped_area handles MAP_FIXED in hugetlbfs ... Browse Code »

Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling
prepare_hugepage_range()

Signed-off-by: Benjamin Herrenschmidt
Acked-by: William Irwin
Cc: Paul Mackerras
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Russell King
Cc: David Howells
Cc: Andi Kleen
Cc: "Luck, Tony"
Cc: Kyle McMartin
Cc: Grant Grundler
Cc: Matthew Wilcox
Cc: "David S. Miller"
Cc: Adam Litke
Cc: David Gibson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2007-05-08 03:12:57 +0800
d85f33855 Make page->private usable in compound pages ... Browse Code »

If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct. I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads. This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim. Compound pages cannot currently
be reclaimed. Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter
Signed-off-by: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2007-05-08 03:12:53 +0800