20 Jul, 2007

1 commit

  • Slab destructors were no longer supported after Christoph's
    c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.

    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).

    Signed-off-by: Paul Mundt

    Paul Mundt
     

17 Jul, 2007

2 commits

  • I was seeing a null pointer deref in fs/super.c:vfs_kern_mount().
    Some file system get_sb() handler was returning NULL mnt_sb with
    a non-negative return value. I also noticed a "hugetlbfs: Bad
    mount option:" message in the log.

    Turns out that hugetlbfs_parse_options() was not checking for an
    empty option string after call to strsep(). On failure,
    hugetlbfs_parse_options() returns 1. hugetlbfs_fill_super() just
    passed this return code back up the call stack where
    vfs_kern_mount() missed the error and proceeded with a NULL mnt_sb.

    Apparently introduced by patch:
    hugetlbfs-use-lib-parser-fix-docs.patch

    The problem was exposed by this line in my fstab:

    none /huge hugetlbfs defaults 0 0

    It can also be demonstrated by invoking mount of hugetlbfs
    directly with no options or a bogus option.

    This patch:

    1) adds the check for empty option to hugetlbfs_parse_options(),
    2) enhances the error message to bracket any unrecognized
    option with quotes ,
    3) modifies hugetlbfs_parse_options() to return -EINVAL on any
    unrecognized option,
    4) adds a BUG_ON() to vfs_kern_mount() to catch any get_sb()
    handler that returns a NULL mnt->mnt_sb with a return value
    >= 0.

    Signed-off-by: Lee Schermerhorn
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Use lib/parser.c to parse hugetlbfs mount options. Correct docs in
    hugetlbpage.txt.

    old size of hugetlbfs_fill_super: 675 bytes
    new size of hugetlbfs_fill_super: 686 bytes
    (hugetlbfs_parse_options() is inlined)

    Signed-off-by: Randy Dunlap
    Cc: Hugh Dickins
    Cc: David Gibson
    Cc: Adam Litke
    Acked-by: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

17 Jun, 2007

1 commit

  • Some user space tools need to identify SYSV shared memory when examining
    /proc//maps. To do so they look for a block device with major zero, a
    dentry named SYSV, and having the minor of the internal sysv
    shared memory kernel mount.

    To help these tools and to make it easier for people just browsing
    /proc//maps this patch modifies hugetlb sysv shared memory to use the
    SYSV dentry naming convention.

    User space tools will still have to be aware that hugetlb sysv shared
    memory lives on a different internal kernel mount and so has a different
    block device minor number from the rest of sysv shared memory.

    Signed-off-by: Eric W. Biederman
    Cc: "Serge E. Hallyn"
    Cc: Albert Cahalan
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

17 May, 2007

1 commit

  • SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

    Signed-off-by: Christoph Lameter
    Cc: David Howells
    Cc: Jens Axboe
    Cc: Steven French
    Cc: Michael Halcrow
    Cc: OGAWA Hirofumi
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Roman Zippel
    Cc: David Woodhouse
    Cc: Dave Kleikamp
    Cc: Trond Myklebust
    Cc: "J. Bruce Fields"
    Cc: Anton Altaparmakov
    Cc: Mark Fasheh
    Cc: Paul Mackerras
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: David Chinner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

08 May, 2007

5 commits

  • If hugetlbfs module_init() fails, hugetlbfs_vfsmount is not initialized and
    shmget() with SHM_HUGETLB flag will cause NULL pointer dereference.

    Signed-off-by: Akinobu Mita
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • I have never seen a use of SLAB_DEBUG_INITIAL. It is only supported by
    SLAB.

    I think its purpose was to have a callback after an object has been freed
    to verify that the state is the constructor state again? The callback is
    performed before each freeing of an object.

    I would think that it is much easier to check the object state manually
    before the free. That also places the check near the code object
    manipulation of the object.

    Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
    compiled with SLAB debugging on. If there would be code in a constructor
    handling SLAB_DEBUG_INITIAL then it would have to be conditional on
    SLAB_DEBUG otherwise it would just be dead code. But there is no such code
    in the kernel. I think SLUB_DEBUG_INITIAL is too problematic to make real
    use of, difficult to understand and there are easier ways to accomplish the
    same effect (i.e. add debug code before kfree).

    There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
    clear in fs inode caches. Remove the pointless checks (they would even be
    pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

    This is the last slab flag that SLUB did not support. Remove the check for
    unimplemented flags from SLUB.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Generic hugetlb_get_unmapped_area() now handles MAP_FIXED by just calling
    prepare_hugepage_range()

    Signed-off-by: Benjamin Herrenschmidt
    Acked-by: William Irwin
    Cc: Paul Mackerras
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: David Howells
    Cc: Andi Kleen
    Cc: "Luck, Tony"
    Cc: Kyle McMartin
    Cc: Grant Grundler
    Cc: Matthew Wilcox
    Cc: "David S. Miller"
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • If we add a new flag so that we can distinguish between the first page and the
    tail pages then we can avoid to use page->private in the first page.
    page->private == page for the first page, so there is no real information in
    there.

    Freeing up page->private makes the use of compound pages more transparent.
    They become more usable like real pages. Right now we have to be careful f.e.
    if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
    can then no longer use the private field. This is one of the issues that
    cause us not to support debugging for page size slabs in SLAB.

    Having page->private available for SLUB would allow more meta information in
    the page struct. I can probably avoid the 16 bit ints that I have in there
    right now.

    Also if page->private is available then a compound page may be equipped with
    buffer heads. This may free up the way for filesystems to support larger
    blocks than page size.

    We add PageTail as an alias of PageReclaim. Compound pages cannot currently
    be reclaimed. Because of the alias one needs to check PageCompound first.

    The RFC for the this approach was discussed at
    http://marc.info/?t=117574302800001&r=1&w=2

    [nacc@us.ibm.com: fix hugetlbfs]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Add a proper prototype for hugetlb_get_unmapped_area() in
    include/linux/hugetlb.h.

    Signed-off-by: Adrian Bunk
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     

13 Feb, 2007

2 commits

  • This patch is inspired by Arjan's "Patch series to mark struct
    file_operations and struct inode_operations const".

    Compile tested with gcc & sparse.

    Signed-off-by: Josef 'Jeff' Sipek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef 'Jeff' Sipek
     
  • Many struct inode_operations in the kernel can be "const". Marking them const
    moves these to the .rodata section, which avoids false sharing with potential
    dirty data. In addition it'll catch accidental writes at compile time to
    these shared resources.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

10 Feb, 2007

1 commit

  • __unmap_hugepage_range() is buggy that it does not preserve dirty state of
    huge_pte when unmapping hugepage range. It causes data corruption in the
    event of dop_caches being used by sys admin. For example, an application
    creates a hugetlb file, modify pages, then unmap it. While leaving the
    hugetlb file alive, comes along sys admin doing a "echo 3 >
    /proc/sys/vm/drop_caches".

    drop_pagecache_sb() will happily free all pages that aren't marked dirty if
    there are no active mapping. Later when application remaps the hugetlb
    file back and all data are gone, triggering catastrophic flip over on
    application.

    Not only that, the internal resv_huge_pages count will also get all messed
    up. Fix it up by marking page dirty appropriately.

    Signed-off-by: Ken Chen
    Cc: "Nish Aravamudan"
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

22 Dec, 2006

1 commit

  • They were horribly easy to mis-use because of their tempting naming, and
    they also did way more than any users of them generally wanted them to
    do.

    A dirty page can become clean under two circumstances:

    (a) when we write it out. We have "clear_page_dirty_for_io()" for
    this, and that function remains unchanged.

    In the "for IO" case it is not sufficient to just clear the dirty
    bit, you also have to mark the page as being under writeback etc.

    (b) when we actually remove a page due to it becoming inaccessible to
    users, notably because it was truncate()'d away or the file (or
    metadata) no longer exists, and we thus want to cancel any
    outstanding dirty state.

    For the (b) case, we now introduce "cancel_dirty_page()", which only
    touches the page state itself, and verifies that the page is not mapped
    (since cancelling writes on a mapped page would be actively wrong as it
    is still accessible to users).

    Some filesystems need to be fixed up for this: CIFS, FUSE, JFS,
    ReiserFS, XFS all use the old confusing functions, and will be fixed
    separately in subsequent commits (with some of them just removing the
    offending logic, and others using clear_page_dirty_for_io()).

    This was confirmed by Martin Michlmayr to fix the apt database
    corruption on ARM.

    Cc: Martin Michlmayr
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Cc: Arjan van de Ven
    Cc: Andrei Popa
    Cc: Andrew Morton
    Cc: Dave Kleikamp
    Cc: Gordon Farquharson
    Cc: Martin Schwidefsky
    Cc: Trond Myklebust
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Dec, 2006

1 commit


08 Dec, 2006

2 commits

  • Replace all uses of kmem_cache_t with struct kmem_cache.

    The patch was generated using the following script:

    #!/bin/sh
    #
    # Replace one string by another in all the kernel sources.
    #

    set -e

    for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do
    quilt add $file
    sed -e "1,\$s/$1/$2/g" $file >/tmp/$$
    mv /tmp/$$ $file
    quilt refresh
    done

    The script was run like this

    sh replace kmem_cache_t "struct kmem_cache"

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • SLAB_KERNEL is an alias of GFP_KERNEL.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

15 Nov, 2006

1 commit

  • (David:)

    If hugetlbfs_file_mmap() returns a failure to do_mmap_pgoff() - for example,
    because the given file offset is not hugepage aligned - then do_mmap_pgoff
    will go to the unmap_and_free_vma backout path.

    But at this stage the vma hasn't been marked as hugepage, and the backout path
    will call unmap_region() on it. That will eventually call down to the
    non-hugepage version of unmap_page_range(). On ppc64, at least, that will
    cause serious problems if there are any existing hugepage pagetable entries in
    the vicinity - for example if there are any other hugepage mappings under the
    same PUD. unmap_page_range() will trigger a bad_pud() on the hugepage pud
    entries. I suspect this will also cause bad problems on ia64, though I don't
    have a machine to test it on.

    (Hugh:)

    prepare_hugepage_range() should check file offset alignment when it checks
    virtual address and length, to stop MAP_FIXED with a bad huge offset from
    unmapping before it fails further down. PowerPC should apply the same
    prepare_hugepage_range alignment checks as ia64 and all the others do.

    Then none of the alignment checks in hugetlbfs_file_mmap are required (nor
    is the check for too small a mapping); but even so, move up setting of
    VM_HUGETLB and add a comment to warn of what David Gibson discovered - if
    hugetlbfs_file_mmap fails before setting it, do_mmap_pgoff's unmap_region
    when unwinding from error will go the non-huge way, which may cause bad
    behaviour on architectures (powerpc and ia64) which segregate their huge
    mappings into a separate region of the address space.

    Signed-off-by: Hugh Dickins
    Cc: "Luck, Tony"
    Cc: "David S. Miller"
    Acked-by: Adam Litke
    Acked-by: David Gibson
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

29 Oct, 2006

2 commits

  • hugetlb_vmtruncate_list was misconverted to prio_tree: its prio_tree is in
    units of PAGE_SIZE (PAGE_CACHE_SIZE) like any other, not HPAGE_SIZE (whereas
    its radix_tree is kept in units of HPAGE_SIZE, otherwise slots would be
    absurdly sparse).

    At first I thought the error benign, just calling __unmap_hugepage_range on
    more vmas than necessary; but on 32-bit machines, when the prio_tree is
    searched correctly, it happens to ensure the v_offset calculation won't
    overflow. As it stood, when truncating at or beyond 4GB, it was liable to
    discard pages COWed from lower offsets; or even to clear pmd entries of
    preceding vmas, triggering exit_mmap's BUG_ON(nr_ptes).

    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: David Gibson
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • On 32-bit machines, mount -t hugetlbfs -o size=4G gave a 0GB filesystem,
    size=5G gave a 1GB filesystem etc: there's no point in masking size with
    HPAGE_MASK just before shifting its lower bits away, and since HPAGE_MASK is a
    UL, that removed all the higher bits of the unsigned long long size.

    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: David Gibson
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Oct, 2006

1 commit

  • commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
    libhugetlbfs test suite. The problem is that hugetlb pages can be shared
    by multiple mappings. Multiple threads can fight over page->lru in the
    unmap path and bad things happen. We now serialize __unmap_hugepage_range
    to void concurrent linked list manipulation. Such serialization is also
    needed for shared page table page on hugetlb area. This patch will fixed
    the bug and also serve as a prepatch for shared page table.

    Signed-off-by: Ken Chen
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

01 Oct, 2006

1 commit


30 Sep, 2006

1 commit


27 Sep, 2006

1 commit

  • This eliminates the i_blksize field from struct inode. Filesystems that want
    to provide a per-inode st_blksize can do so by providing their own getattr
    routine instead of using the generic_fillattr() function.

    Note that some filesystems were providing pretty much random (and incorrect)
    values for i_blksize.

    [bunk@stusta.de: cleanup]
    [akpm@osdl.org: generic_fillattr() fix]
    Signed-off-by: "Theodore Ts'o"
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

11 Jul, 2006

1 commit

  • Sometimes, applications need below call to be successful although
    "/mnt/hugepages/file1" doesn't exist.

    fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
    *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);

    As for regular pages (or files), above call does work, but as for huge
    pages, above call would fail because hugetlbfs_file_mmap would fail if
    (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).

    This capability on huge page is useful on ia64 when the process wants to
    protect one area on region 4, so other threads couldn't read/write this
    area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the
    capability.

    Signed-off-by: Zhang Yanmin
    Cc: David Gibson
    Cc: Hugh Dickins
    [ Expand-on-mmap semantics again... this time matching normal fs's. wli ]
    Acked-by: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     

29 Jun, 2006

1 commit


23 Jun, 2006

3 commits

  • Current hugetlb strict accounting for shared mapping always assume mapping
    starts at zero file offset and reserves pages between zero and size of the
    file. This assumption often reserves (or lock down) a lot more pages then
    necessary if application maps at none zero file offset. libhugetlbfs is
    one example that requires proper reservation on shared mapping starts at
    none zero offset.

    This patch extends the reservation and hugetlb strict accounting to support
    any arbitrary pair of (offset, len), resulting a much more robust and
    accurate scheme. More importantly, it won't lock down any hugetlb pages
    outside file mapping.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Give the statfs superblock operation a dentry pointer rather than a superblock
    pointer.

    This complements the get_sb() patch. That reduced the significance of
    sb->s_root, allowing NFS to place a fake root there. However, NFS does
    require a dentry to use as a target for the statfs operation. This permits
    the root in the vfsmount to be used instead.

    linux/mount.h has been added where necessary to make allyesconfig build
    successfully.

    Interest has also been expressed for use with the FUSE and XFS filesystems.

    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Extend the get_sb() filesystem operation to take an extra argument that
    permits the VFS to pass in the target vfsmount that defines the mountpoint.

    The filesystem is then required to manually set the superblock and root dentry
    pointers. For most filesystems, this should be done with simple_set_mnt()
    which will set the superblock pointer and then set the root dentry to the
    superblock's s_root (as per the old default behaviour).

    The get_sb() op now returns an integer as there's now no need to return the
    superblock pointer.

    This patch permits a superblock to be implicitly shared amongst several mount
    points, such as can be done with NFS to avoid potential inode aliasing. In
    such a case, simple_set_mnt() would not be called, and instead the mnt_root
    and mnt_sb would be set directly.

    The patch also makes the following changes:

    (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount
    pointer argument and return an integer, so most filesystems have to change
    very little.

    (*) If one of the convenience function is not used, then get_sb() should
    normally call simple_set_mnt() to instantiate the vfsmount. This will
    always return 0, and so can be tail-called from get_sb().

    (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the
    dcache upon superblock destruction rather than shrink_dcache_anon().

    This is required because the superblock may now have multiple trees that
    aren't actually bound to s_root, but that still need to be cleaned up. The
    currently called functions assume that the whole tree is rooted at s_root,
    and that anonymous dentries are not the roots of trees which results in
    dentries being left unculled.

    However, with the way NFS superblock sharing are currently set to be
    implemented, these assumptions are violated: the root of the filesystem is
    simply a dummy dentry and inode (the real inode for '/' may well be
    inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries
    with child trees.

    [*] Anonymous until discovered from another tree.

    (*) The documentation has been adjusted, including the additional bit of
    changing ext2_* into foo_* in the documentation.

    [akpm@osdl.org: convert ipath_fs, do other stuff]
    Signed-off-by: David Howells
    Acked-by: Al Viro
    Cc: Nathan Scott
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     

29 Mar, 2006

1 commit

  • This is a conversion to make the various file_operations structs in fs/
    const. Basically a regexp job, with a few manual fixups

    The goal is both to increase correctness (harder to accidentally write to
    shared datastructures) and reducing the false sharing of cachelines with
    things that get dirty in .data (while .rodata is nicely read only and thus
    cache clean)

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

22 Mar, 2006

2 commits

  • Implementation of hugetlbfs_counter() is functionally equivalent to
    atomic_inc_return(). Use the simpler atomic form.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • These days, hugepages are demand-allocated at first fault time. There's a
    somewhat dubious (and racy) heuristic when making a new mmap() to check if
    there are enough available hugepages to fully satisfy that mapping.

    A particularly obvious case where the heuristic breaks down is where a
    process maps its hugepages not as a single chunk, but as a bunch of
    individually mmap()ed (or shmat()ed) blocks without touching and
    instantiating the pages in between allocations. In this case the size of
    each block is compared against the total number of available hugepages.
    It's thus easy for the process to become overcommitted, because each block
    mapping will succeed, although the total number of hugepages required by
    all blocks exceeds the number available. In particular, this defeats such
    a program which will detect a mapping failure and adjust its hugepage usage
    downward accordingly.

    The patch below addresses this problem, by strictly reserving a number of
    physical hugepages for hugepage inodes which have been mapped, but not
    instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
    mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
    trigger an OOM. (Actually SHARED mappings can technically still OOM, but
    only if the sysadmin explicitly reduces the hugepage pool between mapping
    and instantiation)

    This patch appears to address the problem at hand - it allows DB2 to start
    correctly, for instance, which previously suffered the failure described
    above.

    This patch causes no regressions on the libhugetblfs testsuite, and makes a
    test (designed to catch this problem) pass which previously failed (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

02 Feb, 2006

1 commit

  • 2.6.15's hugepage faulting introduced huge_pages_needed accounting into
    hugetlbfs: to count how many pages are already in cache, for spot check on
    how far a new mapping may be allowed to extend the file. But it's muddled:
    each hugepage found covers HPAGE_SIZE, not PAGE_SIZE. Once pages were
    already in cache, it would overshoot, wrap its hugepages count backwards,
    and so fail a harmless repeat mapping with -ENOMEM. Fixes the problem
    found by Don Dupuis.

    Signed-off-by: Hugh Dickins
    Acked-By: Adam Litke
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 Jan, 2006

1 commit

  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

12 Jan, 2006

1 commit


10 Jan, 2006

1 commit


07 Jan, 2006

1 commit

  • Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
    supported. This helps us to safely use hugetlb pages in many more
    applications. The patch makes the following changes. If needed, I also have
    it broken out according to the following paragraphs.

    1. Add a pair of functions to set/clear write access on huge ptes. The
    writable check in make_huge_pte is moved out to the caller for use by COW
    later.

    2. Hugetlb copy-on-write requires special case handling in the following
    situations:

    - copy_hugetlb_page_range() - Copied pages must be write protected so
    a COW fault will be triggered (if necessary) if those pages are written
    to.

    - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
    page cache. MAP_PRIVATE pages still need to be locked however.

    3. Provide hugetlb_cow() and calls from hugetlb_fault() and
    hugetlb_no_page() which handles the COW fault by making the actual copy.

    4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
    will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED
    mapping check.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

23 Nov, 2005

1 commit

  • Currently, if a hugetlbfs is mounted without limits (the default), statfs()
    will return -1 for max/free/used blocks. This does not appear to be in
    line with normal convention: simple_statfs() and shmem_statfs() both return
    0 in similar cases. Worse, it confuses the translation logic in
    put_compat_statfs(), causing it to return -EOVERFLOW on such a mount.

    This patch alters hugetlbfs_statfs() to return 0 for max/free/used blocks
    on a mount without limits. Note that we need the test in the patch below,
    rather than just using 0 in the sbinfo structure, because the -1 marked in
    the free blocks field is used internally to tell the

    Signed-off-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

09 Nov, 2005

1 commit


30 Oct, 2005

1 commit

  • Basic overcommit checking for hugetlb_file_map() based on an implementation
    used with demand faulting in SLES9.

    Since demand faulting can't guarantee the availability of pages at mmap
    time, this patch implements a basic sanity check to ensure that the number
    of huge pages required to satisfy the mmap are currently available.
    Despite the obvious race, I think it is a good start on doing proper
    accounting. I'd like to work towards an accounting system that mimics the
    semantics of normal pages (especially for the MAP_PRIVATE/COW case). That
    work is underway and builds on what this patch starts.

    Huge page shared memory segments are simpler and still maintain their
    commit on shmget semantics.

    Signed-off-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke