01 May, 2005

14 commits

  • This patch changes calls to synchronize_kernel(), deprecated in the earlier
    "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
    synchronize_rcu() and synchronize_sched() APIs.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Remove PAGE_BUG - repalce it with BUG and BUG_ON.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Replace a number of memory barriers with smp_ variants. This means we won't
    take the unnecessary hit on UP machines.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • The patch makes the following function calls available to allocate memory
    on a specific node without changing the basic operation of the slab
    allocator:

    kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node);
    kmalloc_node(size_t size, unsigned int flags, int node);

    in a similar way to the existing node-blind functions:

    kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags);
    kmalloc(size, flags);

    kmem_cache_alloc_node was changed to pass flags and the node information
    through the existing layers of the slab allocator (which lead to some minor
    rearrangements). The functions at the lowest layer (kmem_getpages,
    cache_grow) are already node aware. Also __alloc_percpu can call
    kmalloc_node now.

    Performance measurements (using the pageset localization patch) yields:

    w/o patches:
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005
    100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005
    200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005
    300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005
    400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005
    500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005
    600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005
    700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005
    800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005
    900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005
    1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005

    with pageset localization and slab API patches:
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005
    100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005
    200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005
    300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005
    400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005
    500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005
    600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005
    700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005
    800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005
    900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005
    1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005

    Signed-off-by: Christoph Lameter
    Signed-off-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The smp_mb() is becaus sync_page() doesn't have PG_locked while it accesses
    page_mapping(page). The comments in the patch (the entire patch is the
    addition of this comment) try to explain further how and why smp_mb() is
    used.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     
  • Always use page counts when doing RLIMIT_MEMLOCK checking to avoid possible
    overflow.

    Signed-off-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • This is a patch for counting the number of pages for bounce buffers. It's
    shown in /proc/vmstat.

    Currently, the number of bounce pages are not counted anywhere. So, if
    there are many bounce pages, it seems that there are leaked pages. And
    it's difficult for a user to imagine the usage of bounce pages. So, it's
    meaningful to show # of bouce pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Use the new __GFP_NOMEMALLOC to simplify the previous handling of
    PF_MEMALLOC.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Mempool is pretty clever. Looks too clever for its own good :) It
    shouldn't really know so much about page reclaim internals.

    - don't guess about what effective page reclaim might involve.

    - don't randomly flush out all dirty data if some unlikely thing
    happens (alloc returns NULL). page reclaim can (sort of :P) handle
    it.

    I think the main motivation is trying to avoid pool->lock at all costs.
    However the first allocation is attempted with __GFP_WAIT cleared, so it
    will be 'can_try_harder' if it hits the page allocator. So if allocation
    still fails, then we can probably afford to hit the pool->lock - and what's
    the alternative? Try page reclaim and hit zone->lru_lock?

    A nice upshot is that we don't need to do any fancy memory barriers or do
    (intentionally) racy access to pool-> fields outside the lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Mempools have 2 problems.

    The first is that mempool_alloc can possibly get stuck in __alloc_pages
    when they should opt to fail, and take an element from their reserved pool.

    The second is that it will happily eat emergency PF_MEMALLOC reserves
    instead of going to their reserved pools.

    Fix the first by passing __GFP_NORETRY in the allocation calls in
    mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag which
    directs the page allocator not to allocate from the reserve pool.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Jack Steiner reported this to have fixed his problem (bad colouring):
    "The patches fix both problems that I found - bad
    coloring & excessive pages in pagesets."

    In most workloads this is not likely to be such a pronounced problem,
    however it should help corner cases. And avoiding powers of 2 in these
    types of memory operations is always a good idea.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • mm/rmap.c:page_referenced_one() and mm/rmap.c:try_to_unmap_one() contain
    identical code that

    - takes mm->page_table_lock;

    - drills through page tables;

    - checks that correct pte is reached.

    Coalesce this into page_check_address()

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     
  • Address bug #4508: there's potential for wraparound in the various places
    where we perform RLIMIT_AS checking.

    (I'm a bit worried about acct_stack_growth(). Are we sure that vma->vm_mm is
    always equal to current->mm? If not, then we're comparing some other
    process's total_vm with the calling process's rlimits).

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • Anton Altaparmakov points out:

    - It calls fault_in_pages_readable() which is completely bogus if @nr_segs >
    1. It needs to be replaced by a to be written
    "fault_in_pages_readable_iovec()".

    - It increments @buf even in the iovec case thus @buf can point to random
    memory really quickly (in the iovec case) and then it calls
    fault_in_pages_readable() on this random memory.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

25 Apr, 2005

1 commit

  • zonelist_policy() forgot to mask non-zone bits from gfp when comparing
    zone number with policy_zone.

    ACKed-by: Andi Kleen
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

20 Apr, 2005

7 commits

  • Once all the MMU architectures define FIRST_USER_ADDRESS, remove hack from
    mmap.c which derived it from FIRST_USER_PGD_NR.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove use of FIRST_USER_PGD_NR from sys_mincore: it's inconsistent (no other
    syscall refers to it), unnecessary (sys_mincore loops over vmas further down)
    and incorrect (misses user addresses in ARM's first pgd).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The patches to free_pgtables by vma left problems on any architectures which
    leave some user address page table entries unencapsulated by vma. Andi has
    fixed the 32-bit vDSO on x86_64 to use a vma. Now fix arm (and arm26), whose
    first PAGE_SIZE is reserved (perhaps) for machine vectors.

    Our calls to free_pgtables must not touch that area, and exit_mmap's
    BUG_ON(nr_ptes) must allow that arm's get_pgd_slow may (or may not) have
    allocated an extra page table, which its free_pgd_slow would free later.

    FIRST_USER_PGD_NR has misled me and others: until all the arches define
    FIRST_USER_ADDRESS instead, a hack in mmap.c to derive one from t'other. This
    patch fixes the bugs, the remaining patches just clean it up.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While dabbling here in mmap.c, clean up mysterious "mpnt"s to "vma"s.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ia64 and ppc64 had hugetlb_free_pgtables functions which were no longer being
    called, and it wasn't obvious what to do about them.

    The ppc64 case turns out to be easy: the associated tables are noted elsewhere
    and freed later, safe to either skip its hugetlb areas or go through the
    motions of freeing nothing. Since ia64 does need a special case, restore to
    ppc64 the special case of skipping them.

    The ia64 hugetlb case has been broken since pgd_addr_end went in, though it
    probably appeared to work okay if you just had one such area; in fact it's
    been broken much longer if you consider a long munmap spanning from another
    region into the hugetlb region.

    In the ia64 hugetlb region, more virtual address bits are available than in
    the other regions, yet the page tables are structured the same way: the page
    at the bottom is larger. Here we need to scale down each addr before passing
    it to the standard free_pgd_range. Was about to write a hugely_scaled_down
    macro, but found htlbpage_to_page already exists for just this purpose. Fixed
    off-by-one in ia64 is_hugepage_only_range.

    Uninline free_pgd_range to make it available to ia64. Make sure the
    vma-gathering loop in free_pgtables cannot join a hugepage_only_range to any
    other (safe to join huges? probably but don't bother).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's only one usage of MM_VM_SIZE(mm) left, and it's a troublesome macro
    because mm doesn't contain the (32-bit emulation?) info needed. But it too is
    only needed because we ignore the end from the vma list.

    We could make flush_pgtables return that end, or unmap_vmas. Choose the
    latter, since it's a natural fit with unmap_mapping_range_vma needing to know
    its restart addr. This does make more than minimal change, but if unmap_vmas
    had returned the end before, this is how we'd have done it, rather than
    storing the break_addr in zap_details.

    unmap_vmas used to return count of vmas scanned, but that's just debug which
    hasn't been useful in a while; and if we want the map_count 0 on exit check
    back, it can easily come from the final remove_vm_struct loop.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
    clear_page_range regression since 2.6.10's clear_page_tables; and its
    long-standing well-known inefficiency in searching throughout the higher-level
    page tables for those few entries to clear and free: all can be blamed on
    ignoring the list of vmas when we free page tables.

    Replace exit_mmap's clear_page_range of the total user address space by
    free_pgtables operating on the mm's vma list; unmap_region use it in the same
    way, giving floor and ceiling beyond which it may not free tables. This
    brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
    in which case latency fixes spoil unmap_vmas throughput).

    Beware: the do_mmap_pgoff driver failure case must now use unmap_region
    instead of zap_page_range, since a page table might have been allocated, and
    can only be freed while it is touched by some vma.

    Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
    from the clear_page_range levels. (Most of free_pgtables' old code was
    actually for a non-existent case, prev not properly set up, dating from before
    hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
    might want to add latency lockdrops later; but no attempt to do so yet, going
    by vma should itself reduce latency.

    But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
    examination: put that off until a later patch of the series.

    What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?

    And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
    we need to do more than is done here - every PMD_SIZE ever occupied will be
    flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
    A shame to complicate it unnecessarily.

    Special thanks to David Miller for time spent repairing my ceilings.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Apr, 2005

4 commits

  • )

    We only call pageout() for dirty pages, so this test is redundant.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so
    make the magical value of -17 in /proc//oom_adj defeat the oom-killer
    altogether.

    (akpm: we still need to document oom_adj and friends in
    Documentation/filesystems/proc.txt!)

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We will return NULL from filemap_getpage when a page does not exist in the
    page cache and MAP_NONBLOCK is specified, here:

    page = find_get_page(mapping, pgoff);
    if (!page) {
    if (nonblock)
    return NULL;
    goto no_cached_page;
    }

    But we forget to do so when the page in the cache is not uptodate. The
    following could result in a blocking call:

    /*
    * Ok, found a page in the page cache, now we need to check
    * that it's up-to-date.
    */
    if (!PageUptodate(page))
    goto page_not_uptodate;

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds