15 Jan, 2006

1 commit

  • Anything that writes into a tmpfs filesystem is liable to disproportionately
    decrease the available memory on a particular node. Since there's no telling
    what sort of application (e.g. dd/cp/cat) might be dropping large files
    there, this lets the admin choose the appropriate default behavior for their
    site's situation.

    Introduce a tmpfs mount option which allows specifying a memory policy and
    a second option to specify the nodelist for that policy. With the default
    policy, tmpfs will behave as it does today. This patch adds support for
    preferred, bind, and interleave policies.

    The default policy will cause pages to be added to tmpfs files on the node
    which is doing the writing. Some jobs expect a single process to create
    and manage the tmpfs files. This results in a node which has a
    significantly reduced number of free pages.

    With this patch, the administrator can specify the policy and nodes for
    that policy where they would prefer allocations.

    This patch was originally written by Brent Casavant and Hugh Dickins. I
    added support for the bind and preferred policies and the mpol_nodelist
    mount option.

    Signed-off-by: Brent Casavant
    Signed-off-by: Hugh Dickins
    Signed-off-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

12 Jan, 2006

1 commit


10 Jan, 2006

1 commit


07 Jan, 2006

1 commit

  • Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
    supported. This helps us to safely use hugetlb pages in many more
    applications. The patch makes the following changes. If needed, I also have
    it broken out according to the following paragraphs.

    1. Add a pair of functions to set/clear write access on huge ptes. The
    writable check in make_huge_pte is moved out to the caller for use by COW
    later.

    2. Hugetlb copy-on-write requires special case handling in the following
    situations:

    - copy_hugetlb_page_range() - Copied pages must be write protected so
    a COW fault will be triggered (if necessary) if those pages are written
    to.

    - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
    page cache. MAP_PRIVATE pages still need to be locked however.

    3. Provide hugetlb_cow() and calls from hugetlb_fault() and
    hugetlb_no_page() which handles the COW fault by making the actual copy.

    4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
    will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED
    mapping check.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

23 Nov, 2005

1 commit

  • Currently, if a hugetlbfs is mounted without limits (the default), statfs()
    will return -1 for max/free/used blocks. This does not appear to be in
    line with normal convention: simple_statfs() and shmem_statfs() both return
    0 in similar cases. Worse, it confuses the translation logic in
    put_compat_statfs(), causing it to return -EOVERFLOW on such a mount.

    This patch alters hugetlbfs_statfs() to return 0 for max/free/used blocks
    on a mount without limits. Note that we need the test in the patch below,
    rather than just using 0 in the sbinfo structure, because the -1 marked in
    the free blocks field is used internally to tell the

    Signed-off-by: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

09 Nov, 2005

1 commit


30 Oct, 2005

7 commits

  • Basic overcommit checking for hugetlb_file_map() based on an implementation
    used with demand faulting in SLES9.

    Since demand faulting can't guarantee the availability of pages at mmap
    time, this patch implements a basic sanity check to ensure that the number
    of huge pages required to satisfy the mmap are currently available.
    Despite the obvious race, I think it is a good start on doing proper
    accounting. I'd like to work towards an accounting system that mimics the
    semantics of normal pages (especially for the MAP_PRIVATE/COW case). That
    work is underway and builds on what this patch starts.

    Huge page shared memory segments are simpler and still maintain their
    commit on shmget semantics.

    Signed-off-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Below is a patch to implement demand faulting for huge pages. The main
    motivation for changing from prefaulting to demand faulting is so that huge
    page memory areas can be allocated according to NUMA policy.

    Thanks to consolidated hugetlb code, switching the behavior requires changing
    only one fault handler. The bulk of the patch just moves the logic from
    hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

    Signed-off-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Reformat hugelbfs_forget_inode and add the missing but harmless
    write_inode_now call. It looks the same as generic_forget_inode now except
    for the call to truncate_hugepages instead of truncate_inode_pages.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • hugetlbfs_do_delete_inode is the same as generic_delete_inode now, so remove
    it in favour of the latter.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of
    missing updates to it at the same time. Rename it to
    hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that
    implements ->delete_inode.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode. This keeps
    the code simpler, fixes a loeak where a failing inode allocation wouldn't
    decrement the counter and moves hugetlbfs_delete_inode and
    hugetlbfs_forget_inode closer to their generic counterparts.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Remove the page_table_lock from around the calls to unmap_vmas, and replace
    the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
    now safe to descend without page_table_lock.

    Don't attempt fancy locking for hugepages, just take page_table_lock in
    unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
    zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
    does unmap_vmas have much use for its mm arg now.

    The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
    page_table_lock: if they're implemented at all, they typically come down to
    flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
    (which we already audited for the mprotect case).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Jun, 2005

1 commit

  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds