22 Apr, 2009

1 commit

  • This fixes the following BUG:

    # mount -o size=MM -t hugetlbfs none /huge
    hugetlbfs: Bad value 'MM' for mount option 'size=MM'
    ------------[ cut here ]------------
    kernel BUG at fs/super.c:996!

    Due to

    BUG_ON(!mnt->mnt_sb);

    in vfs_kern_mount().

    Also, remove unused #include

    Cc: William Irwin
    Cc:
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

01 Apr, 2009

2 commits

  • Allow non root users with sufficient mlock rlimits to be able to allocate
    hugetlb backed shm for now. Deprecate this though. This is being
    deprecated because the mlock based rlimit checks for SHM_HUGETLB is not
    consistent with mmap based huge page allocations.

    Signed-off-by: Ravikiran Thirumalai
    Reviewed-by: Mel Gorman
    Cc: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     
  • Fix hugetlb subsystem so that non root users belonging to
    hugetlb_shm_group can actually allocate hugetlb backed shm.

    Currently non root users cannot even map one large page using SHM_HUGETLB
    when they belong to the gid in /proc/sys/vm/hugetlb_shm_group. This is
    because allocation size is verified against RLIMIT_MEMLOCK resource limit
    even if the user belongs to hugetlb_shm_group.

    This patch
    1. Fixes hugetlb subsystem so that users with CAP_IPC_LOCK and users
    belonging to hugetlb_shm_group don't need to be restricted with
    RLIMIT_MEMLOCK resource limits
    2. This patch also disables mlock based rlimit checking (which will
    be reinstated and marked deprecated in a subsequent patch).

    Signed-off-by: Ravikiran Thirumalai
    Reviewed-by: Mel Gorman
    Cc: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ravikiran G Thirumalai
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jan, 2009

1 commit

  • unsigned long ret cannot be negative, but ret can get -EFAULT.

    Signed-off-by: Roel Kluin
    Cc: Hugh Dickins
    Cc: Christoph Lameter
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Ken Chen
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     

06 Jan, 2009

1 commit


14 Nov, 2008

3 commits

  • Wrap current->cred and a few other accessors to hide their actual
    implementation.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Wrap access to task credentials so that they can be separated more easily from
    the task_struct during the introduction of COW creds.

    Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().

    Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
    sense to use RCU directly rather than a convenient wrapper; these will be
    addressed by later patches.

    Signed-off-by: David Howells
    Reviewed-by: James Morris
    Acked-by: Serge Hallyn
    Cc: William Irwin
    Signed-off-by: James Morris

    David Howells
     

14 Oct, 2008

1 commit

  • This is a much better version of a previous patch to make the parser
    tables constant. Rather than changing the typedef, we put the "const" in
    all the various places where its required, allowing the __initconst
    exception for nfsroot which was the cause of the previous trouble.

    This was posted for review some time ago and I believe its been in -mm
    since then.

    Signed-off-by: Steven Whitehouse
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Steven Whitehouse
     

27 Jul, 2008

1 commit

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

25 Jul, 2008

4 commits

  • Add the ability to configure the hugetlb hstate used on a per mount basis.

    - Add a new pagesize= option to the hugetlbfs mount that allows setting
    the page size
    - This option causes the mount code to find the hstate corresponding to the
    specified size, and sets up a pointer to the hstate in the mount's
    superblock.
    - Change the hstate accessors to use this information rather than the
    global_hstate they were using (requires a slight change in mm/memory.c
    so we don't NULL deref in the error-unmap path -- see comments).

    [np: take hstate out of hugetlbfs inode and vma->vm_private_data]

    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • The goal of this patchset is to support multiple hugetlb page sizes. This
    is achieved by introducing a new struct hstate structure, which
    encapsulates the important hugetlb state and constants (eg. huge page
    size, number of huge pages currently allocated, etc).

    The hstate structure is then passed around the code which requires these
    fields, they will do the right thing regardless of the exact hstate they
    are operating on.

    This patch adds the hstate structure, with a single global instance of it
    (default_hstate), and does the basic work of converting hugetlb to use the
    hstate.

    Future patches will add more hstate structures to allow for different
    hugetlbfs mounts to have different page sizes.

    [akpm@linux-foundation.org: coding-style fixes]
    Acked-by: Adam Litke
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • …n hugetlbfs will succeed

    After patch 2 in this series, a process that successfully calls mmap() for
    a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
    process calls fork(). At that point, the next write fault from the parent
    could fail due to COW if the child still has a reference.

    We only reserve pages for the parent but a copy must be made to avoid
    leaking data from the parent to the child after fork(). Reserves could be
    taken for both parent and child at fork time to guarantee faults but if
    the mapping is large it is highly likely we will not have sufficient pages
    for the reservation, and it is common to fork only to exec() immediatly
    after. A failure here would be very undesirable.

    Note that the current behaviour of mainline with MAP_PRIVATE pages is
    pretty bad. The following situation is allowed to occur today.

    1. Process calls mmap(MAP_PRIVATE)
    2. Process calls mlock() to fault all pages and makes sure it succeeds
    3. Process forks()
    4. Process writes to MAP_PRIVATE mapping while child still exists
    5. If the COW fails at this point, the process gets SIGKILLed even though it
    had taken care to ensure the pages existed

    This patch improves the situation by guaranteeing the reliability of the
    process that successfully calls mmap(). When the parent performs COW, it
    will try to satisfy the allocation without using reserves. If that fails
    the parent will steal the page leaving any children without a page.
    Faults from the child after that point will result in failure. If the
    child COW happens first, an attempt will be made to allocate the page
    without reserves and the child will get SIGKILLed on failure.

    To summarise the new behaviour:

    1. If the original mapper performs COW on a private mapping with multiple
    references, it will attempt to allocate a hugepage from the pool or
    the buddy allocator without using the existing reserves. On fail, VMAs
    mapping the same area are traversed and the page being COW'd is unmapped
    where found. It will then steal the original page as the last mapper in
    the normal way.

    2. The VMAs the pages were unmapped from are flagged to note that pages
    with data no longer exist. Future no-page faults on those VMAs will
    terminate the process as otherwise it would appear that data was corrupted.
    A warning is printed to the console that this situation occured.

    2. If the child performs COW first, it will attempt to satisfy the COW
    from the pool if there are enough pages or via the buddy allocator if
    overcommit is allowed and the buddy allocator can satisfy the request. If
    it fails, the child will be killed.

    If the pool is large enough, existing applications will not notice that
    the reserves were a factor. Existing applications depending on the
    no-reserves been set are unlikely to exist as for much of the history of
    hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
    point or failing the mmap().

    [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Acked-by: Adam Litke <agl@us.ibm.com>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Cc: William Lee Irwin III <wli@holomorphy.com>
    Cc: Hugh Dickins <hugh@veritas.com>
    Cc: Nick Piggin <npiggin@suse.de>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
    a similar manner to the reservations taken for MAP_SHARED mappings. The
    reserve count is accounted both globally and on a per-VMA basis for
    private mappings. This guarantees that a process that successfully calls
    mmap() will successfully fault all pages in the future unless fork() is
    called.

    The characteristics of private mappings of hugetlbfs files behaviour after
    this patch are;

    1. The process calling mmap() is guaranteed to succeed all future faults until
    it forks().
    2. On fork(), the parent may die due to SIGKILL on writes to the private
    mapping if enough pages are not available for the COW. For reasonably
    reliable behaviour in the face of a small huge page pool, children of
    hugepage-aware processes should not reference the mappings; such as
    might occur when fork()ing to exec().
    3. On fork(), the child VMAs inherit no reserves. Reads on pages already
    faulted by the parent will succeed. Successful writes will depend on enough
    huge pages being free in the pool.
    4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
    and at fault time otherwise.

    Before this patch, all reads or writes in the child potentially needs page
    allocations that can later lead to the death of the parent. This applies
    to reads and writes of uninstantiated pages as well as COW. After the
    patch it is only a write to an instantiated page that causes problems.

    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Cc: Andy Whitcroft
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

28 Apr, 2008

2 commits

  • This patch replaces the mempolicy mode, mode_flags, and nodemask in the
    shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
    This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
    inode.c and simplifies the interfaces.

    mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
    pointer arg, a struct mempolicy pointer on success. For MPOL_DEFAULT, the
    returned pointer is NULL. Further, mpol_parse_str() now takes a 'no_context'
    argument that causes the input nodemask to be stored in the w.user_nodemask of
    the created mempolicy for use when the mempolicy is installed in a tmpfs inode
    shared policy tree. At that time, any cpuset contextualization is applied to
    the original input nodemask. This preserves the previous behavior where the
    input nodemask was stored in the superblock. We can think of the returned
    mempolicy as "context free".

    Because mpol_parse_str() is now calling mpol_new(), we can remove from
    mpol_to_str() the semantic checks that mpol_new() already performs.

    Add 'no_context' parameter to mpol_to_str() to specify that it should format
    the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.

    Change mpol_shared_policy_init() to take a pointer to a "context free" struct
    mempolicy and to create a new, "contextualized" mempolicy using the mode,
    mode_flags and user_nodemask from the input mempolicy.

    Note: we know that the mempolicy passed to mpol_to_str() or
    mpol_shared_policy_init() from a tmpfs superblock is "context free". This
    is currently the only instance thereof. However, if we found more uses for
    this concept, and introduced any ambiguity as to whether a mempolicy was
    context free or not, we could add another internal mode flag to identify
    context free mempolicies. Then, we could remove the 'no_context' argument
    from mpol_to_str().

    Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
    if one exists, to pass to mpol_shared_policy_init(). We must add the
    reference under the sb stat_lock to prevent races with replacement of the mpol
    by remount. This reference is removed in mpol_shared_policy_init().

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: another build fix]
    [akpm@linux-foundation.org: yet another build fix]
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • With the evolution of mempolicies, it is necessary to support mempolicy mode
    flags that specify how the policy shall behave in certain circumstances. The
    most immediate need for mode flag support is to suppress remapping the
    nodemask of a policy at the time of rebind.

    Both the mempolicy mode and flags are passed by the user in the 'int policy'
    formal of either the set_mempolicy() or mbind() syscall. A new constant,
    MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
    passed as part of this int. Mempolicies that include illegal flags as part of
    their policy are rejected as invalid.

    An additional member to struct mempolicy is added to support the mode flags:

    struct mempolicy {
    ...
    unsigned short policy;
    unsigned short flags;
    }

    The splitting of the 'int' actual passed by the user is done in
    sys_set_mempolicy() and sys_mbind() for their respective syscalls. This is
    done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
    there are additional flags, and storing it in the new 'flags' member of struct
    mempolicy. The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
    the 'policy' member of the struct and all current users of pol->policy remain
    unchanged.

    The union of the policy mode and optional mode flags is passed back to the
    user in get_mempolicy().

    This combination of mode and flags within the same actual does not break
    userspace code that relies on get_mempolicy(&policy, ...) and either

    switch (policy) {
    case MPOL_BIND:
    ...
    case MPOL_INTERLEAVE:
    ...
    };

    statements or

    if (policy == MPOL_INTERLEAVE) {
    ...
    }

    statements. Such applications would need to use optional mode flags when
    calling set_mempolicy() or mbind() for these previously implemented statements
    to stop working. If an application does start using optional mode flags, it
    will need to mask the optional flags off the policy in switch and conditional
    statements that only test mode.

    An additional member is also added to struct shmem_sb_info to store the
    optional mode flags.

    [hugh@veritas.com: shmem mpol: fix build warning]
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Signed-off-by: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

19 Mar, 2008

1 commit


09 Feb, 2008

1 commit

  • Add a .show_options super operation to hugetlbfs.

    Use generic_show_options() and save the complete option string in
    hugetlbfs_fill_super().

    Signed-off-by: Miklos Szeredi
    Cc: Adam Litke
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: William Lee Irwin III
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

06 Feb, 2008

1 commit

  • Allow sticky directory mount option for hugetlbfs. This allows admin
    to create a shared hugetlbfs mount point for multiple users, while
    prevent accidental file deletion that users may step on each other.
    It is similiar to default tmpfs mount option, or typical option used
    on /tmp.

    Signed-off-by: Ken Chen
    Cc: Badari Pulavarty
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

15 Nov, 2007

2 commits

  • Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
    allow bulk updating of the sbinfo->free_blocks counter. This will be used by
    the next patch in the series.

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
    mappings when that support was added. Currently, quota is debited at page
    instantiation and credited at file truncation. This approach works correctly
    for shared pages but is incomplete for private pages. In addition to
    hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
    this function does not respect quotas.

    Private huge pages are treated very much like normal, anonymous pages. They
    are not "backed" by the hugetlbfs file and are not stored in the mapping's
    radix tree. This means that private pages are invisible to
    truncate_hugepages() so that function will not credit the quota.

    This patch (based on a prototype provided by Ken Chen) moves quota crediting
    for all pages into free_huge_page(). page->private is used to store a pointer
    to the mapping to which this page belongs. This is used to credit quota on
    the appropriate hugetlbfs instance.

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

17 Oct, 2007

7 commits

  • Why do we need r/o bind mounts?

    This feature allows a read-only view into a read-write filesystem. In the
    process of doing that, it also provides infrastructure for keeping track of
    the number of writers to any given mount.

    This has a number of uses. It allows chroots to have parts of filesystems
    writable. It will be useful for containers in the future because users may
    have root inside a container, but should not be allowed to write to
    somefilesystems. This also replaces patches that vserver has had out of the
    tree for several years.

    It allows security enhancement by making sure that parts of your filesystem
    read-only (such as when you don't trust your FTP server), when you don't want
    to have entire new filesystems mounted, or when you want atime selectively
    updated. I've been using the following script to test that the feature is
    working as desired. It takes a directory and makes a regular bind and a r/o
    bind mount of it. It then performs some normal filesystem operations on the
    three directories, including ones that are expected to fail, like creating a
    file on the r/o mount.

    This patch:

    Some filesystems forego the vfs and may_open() and create their own 'struct
    file's.

    This patch creates a couple of helper functions which can be used by these
    filesystems, and will provide a unified place which the r/o bind mount code
    may patch.

    Also, rename an existing, static-scope init_file() to a less generic name.

    Signed-off-by: Dave Hansen
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • I_LOCK was used for several unrelated purposes, which caused deadlock
    situations in certain filesystems as a side effect. One of the purposes
    now uses the new I_SYNC bit.

    Also document the various bits and change their order from historical to
    logical.

    [bunk@stusta.de: make fs/inode.c:wake_up_inode() static]
    Signed-off-by: Joern Engel
    Cc: Dave Kleikamp
    Cc: David Chinner
    Cc: Anton Altaparmakov
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joern Engel
     
  • Slab constructors currently have a flags parameter that is never used. And
    the order of the arguments is opposite to other slab functions. The object
    pointer is placed before the kmem_cache pointer.

    Convert

    ctor(void *object, struct kmem_cache *s, unsigned long flags)

    to

    ctor(struct kmem_cache *s, void *object)

    throughout the kernel

    [akpm@linux-foundation.org: coupla fixes]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • provide BDI constructor/destructor hooks

    [akpm@linux-foundation.org: compile fix]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Support for reading from hugetlbfs files. libhugetlbfs lets application
    text/data to be placed in large pages. When we do that, oprofile doesn't
    work - since libbfd tries to read from it.

    This code is very similar to what do_generic_mapping_read() does, but I
    can't use it since it has PAGE_CACHE_SIZE assumptions.

    [akpm@linux-foundation.org: cleanups, fix leak]
    [bunk@stusta.de: make hugetlbfs_read() static]
    Signed-off-by: Badari Pulavarty
    Acked-by: William Irwin
    Tested-by: Nishanth Aravamudan
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Badari Pulavarty
     
  • For historical reason, expanding ftruncate that increases file size on
    hugetlbfs is not allowed due to pages were pre-faulted and lack of fault
    handler. Now that we have demand faulting on hugetlb since 2.6.15, there
    is no reason to hold back that limitation.

    This will make hugetlbfs behave more like a normal fs. I'm writing a user
    level code that uses hugetlbfs but will fall back to tmpfs if there are no
    hugetlb page available in the system. Having hugetlbfs specific ftruncate
    behavior is a bit quirky and I would like to remove that artificial
    limitation.

    Signed-off-by:
    Acked-by: Wiliam Irwin
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • Implement new aops for some of the simpler filesystems.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

31 Aug, 2007

1 commit

  • For hugepage mappings, the file offset, like the address and size, needs to
    be aligned to the size of a hugepage.

    In commit 68589bc353037f233fe510ad9ff432338c95db66, the check for this was
    moved into prepare_hugepage_range() along with the address and size checks.
    But since BenH's rework of the get_unmapped_area() paths leading up to
    commit 4b1d89290b62bb2db476c94c82cf7442aab440c8, prepare_hugepage_range()
    is only called for MAP_FIXED mappings, not for other mappings. This means
    we're no longer ever checking for an aligned offset - I've confirmed that
    mmap() will (apparently) succeed with a misaligned offset on both powerpc
    and i386 at least.

    This patch restores the check, removing it from prepare_hugepage_range()
    and putting it back into hugetlbfs_file_mmap(). I'm putting it there,
    rather than in the get_unmapped_area() path so it only needs to go in one
    place, than separately in the half-dozen or so arch-specific
    implementations of hugetlb_get_unmapped_area().

    Signed-off-by: David Gibson
    Cc: Adam Litke
    Cc: Andi Kleen
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

17 Jul, 2007

2 commits

  • I was seeing a null pointer deref in fs/super.c:vfs_kern_mount().
    Some file system get_sb() handler was returning NULL mnt_sb with
    a non-negative return value. I also noticed a "hugetlbfs: Bad
    mount option:" message in the log.

    Turns out that hugetlbfs_parse_options() was not checking for an
    empty option string after call to strsep(). On failure,
    hugetlbfs_parse_options() returns 1. hugetlbfs_fill_super() just
    passed this return code back up the call stack where
    vfs_kern_mount() missed the error and proceeded with a NULL mnt_sb.

    Apparently introduced by patch:
    hugetlbfs-use-lib-parser-fix-docs.patch

    The problem was exposed by this line in my fstab:

    none /huge hugetlbfs defaults 0 0

    It can also be demonstrated by invoking mount of hugetlbfs
    directly with no options or a bogus option.

    This patch:

    1) adds the check for empty option to hugetlbfs_parse_options(),
    2) enhances the error message to bracket any unrecognized
    option with quotes ,
    3) modifies hugetlbfs_parse_options() to return -EINVAL on any
    unrecognized option,
    4) adds a BUG_ON() to vfs_kern_mount() to catch any get_sb()
    handler that returns a NULL mnt->mnt_sb with a return value
    >= 0.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Use lib/parser.c to parse hugetlbfs mount options. Correct docs in
    hugetlbpage.txt.

    old size of hugetlbfs_fill_super: 675 bytes
    new size of hugetlbfs_fill_super: 686 bytes
    (hugetlbfs_parse_options() is inlined)

    Signed-off-by: Randy Dunlap
    Cc: Hugh Dickins
    Cc: David Gibson
    Cc: Adam Litke
    Acked-by: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

17 Jun, 2007

1 commit

  • Some user space tools need to identify SYSV shared memory when examining
    /proc//maps. To do so they look for a block device with major zero, a
    dentry named SYSV, and having the minor of the internal sysv
    shared memory kernel mount.

    To help these tools and to make it easier for people just browsing
    /proc//maps this patch modifies hugetlb sysv shared memory to use the
    SYSV dentry naming convention.

    User space tools will still have to be aware that hugetlb sysv shared
    memory lives on a different internal kernel mount and so has a different
    block device minor number from the rest of sysv shared memory.

    Signed-off-by: Eric W. Biederman
    Cc: "Serge E. Hallyn"
    Cc: Albert Cahalan
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

17 May, 2007

1 commit

  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 May, 2007

4 commits

  • If hugetlbfs module_init() fails, hugetlbfs_vfsmount is not initialized and
    shmget() with SHM_HUGETLB flag will cause NULL pointer dereference.

    Signed-off-by: Akinobu Mita
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling
    prepare_hugepage_range()

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • If we add a new flag so that we can distinguish between the first page and the
    tail pages then we can avoid to use page->private in the first page.
    page->private == page for the first page, so there is no real information in
    there.

    Freeing up page->private makes the use of compound pages more transparent.
    They become more usable like real pages. Right now we have to be careful f.e.
    if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
    can then no longer use the private field. This is one of the issues that
    cause us not to support debugging for page size slabs in SLAB.

    Having page->private available for SLUB would allow more meta information in
    the page struct. I can probably avoid the 16 bit ints that I have in there
    right now.

    Also if page->private is available then a compound page may be equipped with
    buffer heads. This may free up the way for filesystems to support larger
    blocks than page size.

    We add PageTail as an alias of PageReclaim. Compound pages cannot currently
    be reclaimed. Because of the alias one needs to check PageCompound first.

    The RFC for the this approach was discussed at
    http://marc.info/?t=117574302800001&r=1&w=2

    [nacc@us.ibm.com: fix hugetlbfs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter