25 May, 2005

1 commit

  • try_to_unmap_cluster() does:
    for (pte = pte_offset_map(pmd, address);
    address < end; pte++, address += PAGE_SIZE) {
    ...
    }

    pte_unmap(pte);

    It may take a little staring to notice, but pte can actually fall off the
    end of the pte page in this iteration, which makes life difficult for
    kmap_atomic() and the users not expecting it to BUG(). Of course, we're
    somewhat lucky in that arithmetic elsewhere in the function guarantees that
    at least one iteration is made, lest this force larger rearrangements to be
    made. This issue and patch also apply to non-mm mainline and with trivial
    adjustments, at least two related kernels.

    Discovered during internal testing at Oracle.

    Signed-off-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     

22 May, 2005

1 commit

  • I came across the following problem while running ltp-aiodio testcases from
    ltp-full-20050405 on linux-2.6.12-rc3-mm3. I tried running the tests with
    EXT3 as well as JFS filesystems.

    One or two fsx-linux testcases were hung after some time. These testcases
    were hanging at wait_for_all_aios().

    Debugging shows that there were some iocbs which were not getting completed
    eventhough the last retry for those returned -EIOCBQUEUED. Also all such
    pending iocbs represented READ operation.

    Further debugging revealed that all such iocbs hit EOF in the DIO layer.
    To be more precise, the "pos" from which they were trying to read was
    greater than the "size" of the file. So the generic_file_direct_IO
    returned 0.

    This happens rarely as there is already a check in
    __generic_file_aio_read(), for whether "pos" < "size" before calling direct
    IO routine.

    >size = i_size_read(inode);
    >if (pos < size) {
    > retval = generic_file_direct_IO(READ, iocb,
    > iov, pos, nr_segs);

    But for READ, we are taking the inode->i_sem only in the DIO layer. So it
    is possible that some other process can change the size of the file before
    we take the i_sem. In such a case ( when "pos" > "size"), the
    __generic_file_aio_read() would return -EIOCBQUEUED even though there were
    no I/O requests submitted by the DIO layer. This would cause the AIO layer
    to expect aio_complete() for THE iocb, which doesnot happen. And thus the
    test hangs forever, waiting for an I/O completion, where there are no
    requests submitted at all.

    The following patch makes __generic_file_aio_read() return 0 (instead of
    returning -EIOCBQUEUED), on getting 0 from generic_file_direct_IO(), so
    that the AIO layer does the aio_complete().

    Testing:

    I have tested the patch on a SMP machine(with 2 Pentium 4 (HT)) running
    linux-2.6.12-rc3-mm3. I ran the ltp-aiodio testcases and none of the
    fsx-linux tests hung. Also the aio-stress tests ran without any problem.

    Signed-off-by: Suzuki K P
    Signed-off-by: Suparna Bhattacharya
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suparna Bhattacharya
     

21 May, 2005

1 commit

  • Caused oopses again. Also fix potential mismatch in checking if
    change_page_attr was needed.

    To do it without races I needed to change mm/vmalloc.c to export a
    __remove_vm_area that does not take vmlist lock.

    Noticed by Terence Ripperda and based on a patch of his.

    Signed-off-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

20 May, 2005

1 commit


19 May, 2005

1 commit

  • Prevent the topdown allocator from allocating mmap areas all the way
    down to address zero.

    We still allow a MAP_FIXED mapping of page 0 (needed for various things,
    ranging from Wine and DOSEMU to people who want to allow speculative
    loads off a NULL pointer).

    Tested by Chris Wright.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 May, 2005

5 commits


06 May, 2005

1 commit


04 May, 2005

1 commit


01 May, 2005

16 commits

  • Some KernelDoc descriptions are updated to match the current code.
    No code changes.

    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Martin Waitz
     
  • I have recompiled Linux kernel 2.6.11.5 documentation for me and our
    university students again. The documentation could be extended for more
    sources which are equipped by structured comments for recent 2.6 kernels. I
    have tried to proceed with that task. I have done that more times from 2.6.0
    time and it gets boring to do same changes again and again. Linux kernel
    compiles after changes for i386 and ARM targets. I have added references to
    some more files into kernel-api book, I have added some section names as well.
    So please, check that changes do not break something and that categories are
    not too much skewed.

    I have changed kernel-doc to accept "fastcall" and "asmlinkage" words reserved
    by kernel convention. Most of the other changes are modifications in the
    comments to make kernel-doc happy, accept some parameters description and do
    not bail out on errors. Changed to @pid in the description, moved some
    #ifdef before comments to correct function to comments bindings, etc.

    You can see result of the modified documentation build at
    http://cmp.felk.cvut.cz/~pisa/linux/lkdb-2.6.11.tar.gz

    Some more sources are ready to be included into kernel-doc generated
    documentation. Sources has been added into kernel-api for now. Some more
    section names added and probably some more chaos introduced as result of quick
    cleanup work.

    Signed-off-by: Pavel Pisa
    Signed-off-by: Martin Waitz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Pisa
     
  • This patch changes calls to synchronize_kernel(), deprecated in the earlier
    "Deprecate synchronize_kernel, GPL replacement" patch to instead call the new
    synchronize_rcu() and synchronize_sched() APIs.

    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     
  • Remove PAGE_BUG - repalce it with BUG and BUG_ON.

    Signed-off-by: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Mackall
     
  • Replace a number of memory barriers with smp_ variants. This means we won't
    take the unnecessary hit on UP machines.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • The patch makes the following function calls available to allocate memory
    on a specific node without changing the basic operation of the slab
    allocator:

    kmem_cache_alloc_node(kmem_cache_t *cachep, unsigned int flags, int node);
    kmalloc_node(size_t size, unsigned int flags, int node);

    in a similar way to the existing node-blind functions:

    kmem_cache_alloc(kmem_cache_t *cachep, unsigned int flags);
    kmalloc(size, flags);

    kmem_cache_alloc_node was changed to pass flags and the node information
    through the existing layers of the slab allocator (which lead to some minor
    rearrangements). The functions at the lowest layer (kmem_getpages,
    cache_grow) are already node aware. Also __alloc_percpu can call
    kmalloc_node now.

    Performance measurements (using the pageset localization patch) yields:

    w/o patches:
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.27 100 484.2736 12.02 1.97 Wed Mar 30 20:50:43 2005
    100 25170.83 91 251.7083 23.12 150.10 Wed Mar 30 20:51:06 2005
    200 34601.66 84 173.0083 33.64 294.14 Wed Mar 30 20:51:40 2005
    300 37154.47 86 123.8482 46.99 436.56 Wed Mar 30 20:52:28 2005
    400 39839.82 80 99.5995 58.43 580.46 Wed Mar 30 20:53:27 2005
    500 40036.32 79 80.0726 72.68 728.60 Wed Mar 30 20:54:40 2005
    600 44074.21 79 73.4570 79.23 872.10 Wed Mar 30 20:55:59 2005
    700 44016.60 78 62.8809 92.56 1015.84 Wed Mar 30 20:57:32 2005
    800 40411.05 80 50.5138 115.22 1161.13 Wed Mar 30 20:59:28 2005
    900 42298.56 79 46.9984 123.83 1303.42 Wed Mar 30 21:01:33 2005
    1000 40955.05 80 40.9551 142.11 1441.92 Wed Mar 30 21:03:55 2005

    with pageset localization and slab API patches:
    Tasks jobs/min jti jobs/min/task real cpu
    1 484.19 100 484.1930 12.02 1.98 Wed Mar 30 21:10:18 2005
    100 27428.25 92 274.2825 21.22 149.79 Wed Mar 30 21:10:40 2005
    200 37228.94 86 186.1447 31.27 293.49 Wed Mar 30 21:11:12 2005
    300 41725.42 85 139.0847 41.84 434.10 Wed Mar 30 21:11:54 2005
    400 43032.22 82 107.5805 54.10 582.06 Wed Mar 30 21:12:48 2005
    500 42211.23 83 84.4225 68.94 722.61 Wed Mar 30 21:13:58 2005
    600 40084.49 82 66.8075 87.12 873.11 Wed Mar 30 21:15:25 2005
    700 44169.30 79 63.0990 92.24 1008.77 Wed Mar 30 21:16:58 2005
    800 43097.94 79 53.8724 108.03 1155.88 Wed Mar 30 21:18:47 2005
    900 41846.75 79 46.4964 125.17 1303.38 Wed Mar 30 21:20:52 2005
    1000 40247.85 79 40.2478 144.60 1442.21 Wed Mar 30 21:23:17 2005

    Signed-off-by: Christoph Lameter
    Signed-off-by: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The smp_mb() is becaus sync_page() doesn't have PG_locked while it accesses
    page_mapping(page). The comments in the patch (the entire patch is the
    addition of this comment) try to explain further how and why smp_mb() is
    used.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    William Lee Irwin III
     
  • Always use page counts when doing RLIMIT_MEMLOCK checking to avoid possible
    overflow.

    Signed-off-by: Chris Wright
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wright
     
  • This is a patch for counting the number of pages for bounce buffers. It's
    shown in /proc/vmstat.

    Currently, the number of bounce pages are not counted anywhere. So, if
    there are many bounce pages, it seems that there are leaked pages. And
    it's difficult for a user to imagine the usage of bounce pages. So, it's
    meaningful to show # of bouce pages.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Use the new __GFP_NOMEMALLOC to simplify the previous handling of
    PF_MEMALLOC.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Mempool is pretty clever. Looks too clever for its own good :) It
    shouldn't really know so much about page reclaim internals.

    - don't guess about what effective page reclaim might involve.

    - don't randomly flush out all dirty data if some unlikely thing
    happens (alloc returns NULL). page reclaim can (sort of :P) handle
    it.

    I think the main motivation is trying to avoid pool->lock at all costs.
    However the first allocation is attempted with __GFP_WAIT cleared, so it
    will be 'can_try_harder' if it hits the page allocator. So if allocation
    still fails, then we can probably afford to hit the pool->lock - and what's
    the alternative? Try page reclaim and hit zone->lru_lock?

    A nice upshot is that we don't need to do any fancy memory barriers or do
    (intentionally) racy access to pool-> fields outside the lock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Mempools have 2 problems.

    The first is that mempool_alloc can possibly get stuck in __alloc_pages
    when they should opt to fail, and take an element from their reserved pool.

    The second is that it will happily eat emergency PF_MEMALLOC reserves
    instead of going to their reserved pools.

    Fix the first by passing __GFP_NORETRY in the allocation calls in
    mempool_alloc. Fix the second by introducing a __GFP_MEMPOOL flag which
    directs the page allocator not to allocate from the reserve pool.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Jack Steiner reported this to have fixed his problem (bad colouring):
    "The patches fix both problems that I found - bad
    coloring & excessive pages in pagesets."

    In most workloads this is not likely to be such a pronounced problem,
    however it should help corner cases. And avoiding powers of 2 in these
    types of memory operations is always a good idea.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • mm/rmap.c:page_referenced_one() and mm/rmap.c:try_to_unmap_one() contain
    identical code that

    - takes mm->page_table_lock;

    - drills through page tables;

    - checks that correct pte is reached.

    Coalesce this into page_check_address()

    Signed-off-by: Nikita Danilov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikita Danilov
     
  • Address bug #4508: there's potential for wraparound in the various places
    where we perform RLIMIT_AS checking.

    (I'm a bit worried about acct_stack_growth(). Are we sure that vma->vm_mm is
    always equal to current->mm? If not, then we're comparing some other
    process's total_vm with the calling process's rlimits).

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • Anton Altaparmakov points out:

    - It calls fault_in_pages_readable() which is completely bogus if @nr_segs >
    1. It needs to be replaced by a to be written
    "fault_in_pages_readable_iovec()".

    - It increments @buf even in the iovec case thus @buf can point to random
    memory really quickly (in the iovec case) and then it calls
    fault_in_pages_readable() on this random memory.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     

25 Apr, 2005

1 commit

  • zonelist_policy() forgot to mask non-zone bits from gfp when comparing
    zone number with policy_zone.

    ACKed-by: Andi Kleen
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

20 Apr, 2005

7 commits

  • Once all the MMU architectures define FIRST_USER_ADDRESS, remove hack from
    mmap.c which derived it from FIRST_USER_PGD_NR.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove use of FIRST_USER_PGD_NR from sys_mincore: it's inconsistent (no other
    syscall refers to it), unnecessary (sys_mincore loops over vmas further down)
    and incorrect (misses user addresses in ARM's first pgd).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The patches to free_pgtables by vma left problems on any architectures which
    leave some user address page table entries unencapsulated by vma. Andi has
    fixed the 32-bit vDSO on x86_64 to use a vma. Now fix arm (and arm26), whose
    first PAGE_SIZE is reserved (perhaps) for machine vectors.

    Our calls to free_pgtables must not touch that area, and exit_mmap's
    BUG_ON(nr_ptes) must allow that arm's get_pgd_slow may (or may not) have
    allocated an extra page table, which its free_pgd_slow would free later.

    FIRST_USER_PGD_NR has misled me and others: until all the arches define
    FIRST_USER_ADDRESS instead, a hack in mmap.c to derive one from t'other. This
    patch fixes the bugs, the remaining patches just clean it up.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • While dabbling here in mmap.c, clean up mysterious "mpnt"s to "vma"s.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • ia64 and ppc64 had hugetlb_free_pgtables functions which were no longer being
    called, and it wasn't obvious what to do about them.

    The ppc64 case turns out to be easy: the associated tables are noted elsewhere
    and freed later, safe to either skip its hugetlb areas or go through the
    motions of freeing nothing. Since ia64 does need a special case, restore to
    ppc64 the special case of skipping them.

    The ia64 hugetlb case has been broken since pgd_addr_end went in, though it
    probably appeared to work okay if you just had one such area; in fact it's
    been broken much longer if you consider a long munmap spanning from another
    region into the hugetlb region.

    In the ia64 hugetlb region, more virtual address bits are available than in
    the other regions, yet the page tables are structured the same way: the page
    at the bottom is larger. Here we need to scale down each addr before passing
    it to the standard free_pgd_range. Was about to write a hugely_scaled_down
    macro, but found htlbpage_to_page already exists for just this purpose. Fixed
    off-by-one in ia64 is_hugepage_only_range.

    Uninline free_pgd_range to make it available to ia64. Make sure the
    vma-gathering loop in free_pgtables cannot join a hugepage_only_range to any
    other (safe to join huges? probably but don't bother).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's only one usage of MM_VM_SIZE(mm) left, and it's a troublesome macro
    because mm doesn't contain the (32-bit emulation?) info needed. But it too is
    only needed because we ignore the end from the vma list.

    We could make flush_pgtables return that end, or unmap_vmas. Choose the
    latter, since it's a natural fit with unmap_mapping_range_vma needing to know
    its restart addr. This does make more than minimal change, but if unmap_vmas
    had returned the end before, this is how we'd have done it, rather than
    storing the break_addr in zap_details.

    unmap_vmas used to return count of vmas scanned, but that's just debug which
    hasn't been useful in a while; and if we want the map_count 0 on exit check
    back, it can easily come from the final remove_vm_struct loop.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Recent woes with some arches needing their own pgd_addr_end macro; and 4-level
    clear_page_range regression since 2.6.10's clear_page_tables; and its
    long-standing well-known inefficiency in searching throughout the higher-level
    page tables for those few entries to clear and free: all can be blamed on
    ignoring the list of vmas when we free page tables.

    Replace exit_mmap's clear_page_range of the total user address space by
    free_pgtables operating on the mm's vma list; unmap_region use it in the same
    way, giving floor and ceiling beyond which it may not free tables. This
    brings lmbench fork/exec/sh numbers back to 2.6.10 (unless preempt is enabled,
    in which case latency fixes spoil unmap_vmas throughput).

    Beware: the do_mmap_pgoff driver failure case must now use unmap_region
    instead of zap_page_range, since a page table might have been allocated, and
    can only be freed while it is touched by some vma.

    Move free_pgtables from mmap.c to memory.c, where its lower levels are adapted
    from the clear_page_range levels. (Most of free_pgtables' old code was
    actually for a non-existent case, prev not properly set up, dating from before
    hch gave us split_vma.) Pass mmu_gather** in the public interfaces, since we
    might want to add latency lockdrops later; but no attempt to do so yet, going
    by vma should itself reduce latency.

    But what if is_hugepage_only_range? Those ia64 and ppc64 cases need careful
    examination: put that off until a later patch of the series.

    What of x86_64's 32bit vdso page __map_syscall32 maps outside any vma?

    And the range to sparc64's flush_tlb_pgtables? It's less clear to me now that
    we need to do more than is done here - every PMD_SIZE ever occupied will be
    flushed, do we really have to flush every PGDIR_SIZE ever partially occupied?
    A shame to complicate it unnecessarily.

    Special thanks to David Miller for time spent repairing my ceilings.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Apr, 2005

4 commits

  • )

    We only call pageout() for dirty pages, so this test is redundant.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • iscsi/lvm2/multipath needs guaranteed protection from the oom-killer, so
    make the magical value of -17 in /proc//oom_adj defeat the oom-killer
    altogether.

    (akpm: we still need to document oom_adj and friends in
    Documentation/filesystems/proc.txt!)

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We will return NULL from filemap_getpage when a page does not exist in the
    page cache and MAP_NONBLOCK is specified, here:

    page = find_get_page(mapping, pgoff);
    if (!page) {
    if (nonblock)
    return NULL;
    goto no_cached_page;
    }

    But we forget to do so when the page in the cache is not uptodate. The
    following could result in a blocking call:

    /*
    * Ok, found a page in the page cache, now we need to check
    * that it's up-to-date.
    */
    if (!PageUptodate(page))
    goto page_not_uptodate;

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     
  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds