22 Jun, 2005

2 commits

  • The topdown changes in 2.6.12-rc1 can cause large allocations with large
    stack limit to fail, despite there being space available. The
    mmap_base-len is only valid when len >= mmap_base. However, nothing in
    topdown allocator checks this. It's only (now) caught at higher level,
    which will cause allocation to simply fail. The following change restores
    the fallback to bottom-up path, which will allow large allocations with
    large stack limit to potentially still succeed.

    Signed-off-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • Ingo recently introduced a great speedup for allocating new mmaps using the
    free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
    causes huge performance increases in thread creation.

    The downside of this patch is that it does lead to fragmentation in the
    mmap-ed areas (visible via /proc/self/maps), such that some applications
    that work fine under 2.4 kernels quickly run out of memory on any 2.6
    kernel.

    The problem is twofold:

    1) the free_area_cache is used to continue a search for memory where
    the last search ended. Before the change new areas were always
    searched from the base address on.

    So now new small areas are cluttering holes of all sizes
    throughout the whole mmap-able region whereas before small holes
    tended to close holes near the base leaving holes far from the base
    large and available for larger requests.

    2) the free_area_cache also is set to the location of the last
    munmap-ed area so in scenarios where we allocate e.g. five regions of
    1K each, then free regions 4 2 3 in this order the next request for 1K
    will be placed in the position of the old region 3, whereas before we
    appended it to the still active region 1, placing it at the location
    of the old region 2. Before we had 1 free region of 2K, now we only
    get two free regions of 1K -> fragmentation.

    The patch addresses thes issues by introducing yet another cache descriptor
    cached_hole_size that contains the largest known hole size below the
    current free_area_cache. If a new request comes in the size is compared
    against the cached_hole_size and if the request can be filled with a hole
    below free_area_cache the search is started from the base instead.

    The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
    (earlier posted) leakme.c test program terminates after 50000+ iterations
    with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
    (as expected) with thread creation, Ingo's test_str02 with 20000 threads
    requires 0.7s system time.

    Taking out Ingo's patch (un-patch available per request) by basically
    deleting all mentions of free_area_cache from the kernel and starting the
    search for new memory always at the respective bases we observe: leakme
    terminates successfully with 11 distinctive hardly fragmented areas in
    /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
    time for Ingo's test_str02 with 20000 threads.

    Now - drumroll ;-) the appended patch works fine with leakme: it ends with
    only 7 distinct areas in /proc/self/maps and also thread creation seems
    sufficiently fast with 0.71s for 20000 threads.

    Signed-off-by: Wolfgang Wander
    Credit-to: "Richard Purdie"
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar (partly)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wolfgang Wander
     

20 May, 2005

1 commit


19 May, 2005

1 commit

  • Prevent the topdown allocator from allocating mmap areas all the way
    down to address zero.

    We still allow a MAP_FIXED mapping of page 0 (needed for various things,
    ranging from Wine and DOSEMU to people who want to allow speculative
    loads off a NULL pointer).

    Tested by Chris Wright.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 May, 2005

2 commits

  • Always use page counts when doing RLIMIT_MEMLOCK checking to avoid possible
    overflow.

    Signed-off-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • Address bug #4508: there's potential for wraparound in the various places
    where we perform RLIMIT_AS checking.

    (I'm a bit worried about acct_stack_growth(). Are we sure that vma->vm_mm is
    always equal to current->mm? If not, then we're comparing some other
    process's total_vm with the calling process's rlimits).

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

20 Apr, 2005

5 commits

  • Once all the MMU architectures define FIRST_USER_ADDRESS, remove hack from
    mmap.c which derived it from FIRST_USER_PGD_NR.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The patches to free_pgtables by vma left problems on any architectures which
    leave some user address page table entries unencapsulated by vma. Andi has
    fixed the 32-bit vDSO on x86_64 to use a vma. Now fix arm (and arm26), whose
    first PAGE_SIZE is reserved (perhaps) for machine vectors.

    Our calls to free_pgtables must not touch that area, and exit_mmap's
    BUG_ON(nr_ptes) must allow that arm's get_pgd_slow may (or may not) have
    allocated an extra page table, which its free_pgd_slow would free later.

    FIRST_USER_PGD_NR has misled me and others: until all the arches define
    FIRST_USER_ADDRESS instead, a hack in mmap.c to derive one from t'other. This
    patch fixes the bugs, the remaining patches just clean it up.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While dabbling here in mmap.c, clean up mysterious "mpnt"s to "vma"s.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's only one usage of MM_VM_SIZE(mm) left, and it's a troublesome macro
    because mm doesn't contain the (32-bit emulation?) info needed. But it too is
    only needed because we ignore the end from the vma list.

    We could make flush_pgtables return that end, or unmap_vmas. Choose the
    latter, since it's a natural fit with unmap_mapping_range_vma needing to know
    its restart addr. This does make more than minimal change, but if unmap_vmas
    had returned the end before, this is how we'd have done it, rather than
    storing the break_addr in zap_details.

    unmap_vmas used to return count of vmas scanned, but that's just debug which
    hasn't been useful in a while; and if we want the map_count 0 on exit check
    back, it can easily come from the final remove_vm_struct loop.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
    clear_page_range regression since 2.6.10's clear_page_tables; and its
    long-standing well-known inefficiency in searching throughout the higher-level
    page tables for those few entries to clear and free: all can be blamed on
    ignoring the list of vmas when we free page tables.

    Replace exit_mmap's clear_page_range of the total user address space by
    free_pgtables operating on the mm's vma list; unmap_region use it in the same
    way, giving floor and ceiling beyond which it may not free tables. This
    brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
    in which case latency fixes spoil unmap_vmas throughput).

    Beware: the do_mmap_pgoff driver failure case must now use unmap_region
    instead of zap_page_range, since a page table might have been allocated, and
    can only be freed while it is touched by some vma.

    Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
    from the clear_page_range levels. (Most of free_pgtables' old code was
    actually for a non-existent case, prev not properly set up, dating from before
    hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
    might want to add latency lockdrops later; but no attempt to do so yet, going
    by vma should itself reduce latency.

    But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
    examination: put that off until a later patch of the series.

    What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?

    And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
    we need to do more than is done here - every PMD_SIZE ever occupied will be
    flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
    A shame to complicate it unnecessarily.

    Special thanks to David Miller for time spent repairing my ceilings.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds