03 Jan, 2006

1 commit


17 Dec, 2005

1 commit


16 Dec, 2005

1 commit


14 Dec, 2005

1 commit


13 Dec, 2005

2 commits

  • Nick Piggin points out that a few drivers play games with VM_IO (why?
    who knows..) and thus a pfn-remapped area may not have that bit set even
    if remap_pfn_range() set it originally.

    So make it explicit in get_user_pages() that we don't follow VM_PFNMAP
    pages, since pretty much by definition they do not have a "struct page"
    associated with them.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Hitting BUG_ON() in __alloc_bootmem_core() when there is no free page
    available in the first node's memory. For the case of kdump on PPC64
    (Power 4 machine), the captured kernel is used two memory regions - memory
    for TCE tables (tce-base and tce-size at top of RAM and reserved) and
    captured kernel memory region (crashk_base and crashk_size). Since we
    reserve the memory for the first node, we should be returning from
    __alloc_bootmem_core() to search for the next node (pg_dat).

    Currently, find_next_zero_bit() is returning the n^th bit (eidx) when there
    is no free page. Then, test_bit() is failed since we set 0xff only for the
    actual size initially (init_bootmem_core()) even though rounded up to one
    page for bdata->node_bootmem_map. We are hitting the BUG_ON after failing
    to enter second "for" loop.

    Signed-off-by: Haren Myneni
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haren Myneni
     

12 Dec, 2005

3 commits

  • The VM layer (for historical reasons) turns a read-only shared mmap into
    a private-like mapping with the VM_MAYWRITE bit clear. Thus checking
    just VM_SHARED isn't actually sufficient.

    So use a trivial helper function for the cases where we wanted to inquire
    if a mapping was COW-like or not.

    Moo!

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • With the previous commit, we can handle arbitrary shared re-mappings
    even without this complexity, and since the only known private mappings
    are for strange users of /dev/mem (which never create an incomplete one),
    there seems to be no reason to support it.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • A shared mapping doesn't cause COW-pages, so we don't need to worry
    about the whole vm_pgoff logic to decide if a PFN-remapped page has
    gone through COW or not.

    This makes it possible to entirely avoid the special "partial remapping"
    logic for the common case.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

04 Dec, 2005

2 commits


01 Dec, 2005

1 commit

  • This is what a lot of drivers will actually want to use to insert
    individual pages into a user VMA. It doesn't have the old PageReserved
    restrictions of remap_pfn_range(), and it doesn't complain about partial
    remappings.

    The page you insert needs to be a nice clean kernel allocation, so you
    can't insert arbitrary page mappings with this, but that's not what
    people want.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2005

7 commits


29 Nov, 2005

6 commits

  • The system call gate area handling called vm_normal_page() with the
    wrong vma (which was always NULL, and caused an oops).

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • With Andrew Morton

    The slab scanning code tries to balance the scanning rate of slabs versus the
    scanning rate of LRU pages. To do this, it retains state concerning how many
    slabs have been scanned - if a particular slab shrinker didn't scan enough
    objects, we remember that for next time, and scan more objects on the next
    pass.

    The problem with this is that with (say) a huge number of GFP_NOIO
    direct-reclaim attempts, the number of objects which are to be scanned when we
    finally get a GFP_KERNEL request can be huge. Because some shrinker handlers
    just bail out if !__GFP_FS.

    So the patch clamps the number of objects-to-be-scanned to 2* the total number
    of objects in the slab cache.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some users (hi Zwane) have seen a problem when running a workload that
    eats nearly all of physical memory - th system does an OOM kill, even
    when there is still a lot of swap free.

    The problem appears to be a very big task that is holding the swap
    token, and the VM has a very hard time finding any other page in the
    system that is swappable.

    Instead of ignoring the swap token when sc->priority reaches 0, we could
    simply take the swap token away from the memory hog and make sure we
    don't give it back to the memory hog for a few seconds.

    This patch resolves the problem Zwane ran into.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • I believe this patch is required to fix breakage in the asynch reclaim
    watermark logic introduced by this patch:

    http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7fb1d9fca5c6e3b06773b69165a73f3fb786b8ee

    Just some background of the watermark logic in case it isn't clear...
    Basically what we have is this:

    --- pages_high
    |
    | (a)
    |
    --- pages_low
    |
    | (b)
    |
    --- pages_min
    |
    | (c)
    |
    --- 0

    Now when pages_low is reached, we want to kick asynch reclaim, which gives us
    an interval of "b" before we must start synch reclaim, and gives kswapd an
    interval of "a" before it need go back to sleep.

    When pages_min is reached, normal allocators must enter synch reclaim, but
    PF_MEMALLOC, ALLOC_HARDER, and ALLOC_HIGH (ie. atomic allocations, recursive
    allocations, etc.) get access to varying amounts of the reserve "c".

    Signed-off-by: Nick Piggin
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • LD .tmp_vmlinux1
    mm/built-in.o(.text+0x100d6): In function `copy_page_range':
    : undefined reference to `__pud_alloc'
    mm/built-in.o(.text+0x1010b): In function `copy_page_range':
    : undefined reference to `__pmd_alloc'
    mm/built-in.o(.text+0x11ef4): In function `__handle_mm_fault':
    : undefined reference to `__pud_alloc'
    fs/built-in.o(.text+0xc930): In function `install_arg_page':
    : undefined reference to `__pud_alloc'
    make: *** [.tmp_vmlinux1] Error 1

    Those missing references in mm/memory.c arise from this code in
    include/linux/mm.h, combined with the fact that __PGTABLE_PMD_FOLDED and
    __PGTABLE_PUD_FOLDED are both set and __ARCH_HAS_4LEVEL_HACK is not:

    /*
    * The following ifdef needed to get the 4level-fixup.h header to work.
    * Remove it when 4level-fixup.h has been removed.
    */
    #if defined(CONFIG_MMU) && !defined(__ARCH_HAS_4LEVEL_HACK)
    static inline pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
    {
    return (unlikely(pgd_none(*pgd)) && __pud_alloc(mm, pgd, address))?
    NULL: pud_offset(pgd, address);
    }

    static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
    {
    return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
    NULL: pmd_offset(pud, address);
    }
    #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */

    With my configuration the pgd_none and pud_none routines are inlines
    returning a constant 0. Apparently the old compiler avoids generating
    calls to __pud_alloc and __pmd_alloc but still lists them as undefined
    references in the module's symbol table.

    I don't know which change caused this problem. I think it was added
    somewhere between 2.6.14 and 2.6.15-rc1, because I remember building
    several 2.6.14-rc kernels without difficulty. However I can't point to an
    individual culprit.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Stern
     
  • This replaces the (in my opinion horrible) VM_UNMAPPED logic with very
    explicit support for a "remapped page range" aka VM_PFNMAP. It allows a
    VM area to contain an arbitrary range of page table entries that the VM
    never touches, and never considers to be normal pages.

    Any user of "remap_pfn_range()" automatically gets this new
    functionality, and doesn't even have to mark the pages reserved or
    indeed mark them any other way. It just works. As a side effect, doing
    mmap() on /dev/mem works for arbitrary ranges.

    Sparc update from David in the next commit.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Nov, 2005

2 commits

  • Fix a 32 bit integer overflow in invalidate_inode_pages2_range.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Drokin
     
  • Closer attention to the arithmetic shows that neither ppc64 nor sparc really
    uses one page for multiple page tables: how on earth could they, while
    pte_alloc_one returns just a struct page pointer, with no offset?

    Well, arm26 manages it by returning a pte_t pointer cast to a struct page
    pointer, harumph, then compensating in its pmd_populate. But arm26 is never
    SMP, so it's not a problem for split ptlock either.

    And the PA-RISC situation has been recently improved: CONFIG_PA20 works
    without the 16-byte alignment which inflated its spinlock_t. But the current
    union of spinlock_t with private does make the 7xxx struct page significantly
    larger, even without debug, so disable its split ptlock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Nov, 2005

10 commits

  • If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
    it is possible for the nr_huge_pages variable to become incorrect. There
    is no locking in the set_max_huge_pages function around
    alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers
    to alloc_fresh_huge_page could race against each other as could a call to
    alloc_fresh_huge_page and a call to update_and_free_page. This patch just
    expands the area covered by the hugetlb_lock to cover the call into
    alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section
    is performance critical where more specific locking would be needed.

    My reproducer was to run a couple copies of the following script
    simultaneously

    while [ true ]; do
    echo 1000 > /proc/sys/vm/nr_hugepages
    echo 500 > /proc/sys/vm/nr_hugepages
    echo 750 > /proc/sys/vm/nr_hugepages
    echo 100 > /proc/sys/vm/nr_hugepages
    echo 0 > /proc/sys/vm/nr_hugepages
    done

    and then watch /proc/meminfo and eventually you will see things like

    HugePages_Total: 100
    HugePages_Free: 109

    After applying the patch all seemed well.

    Signed-off-by: Eric Paris
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • It used to be the case that PG_reserved pages were silently never freed, but
    in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should
    work through such cases as they appear, fixing the code; but for now it's
    safer to issue the message without freeing the page, leaving PG_reserved set.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas,
    let's not insert the ZERO_PAGE there - though whether it would matter will
    depend on what we decide about ZERO_PAGE refcounting.

    But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED
    area, do_no_page should never be: just BUG_ON.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area,
    zap_pte_range needs to free them, do_wp_page needs to COW them: just like
    ordinary pages, not like the unpaged.

    But recognizing them is a little subtle: because PageReserved is no longer a
    condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the
    distro permits, and whether it's advisable on this or that architecture, is
    another matter). So if we can see a PageAnon, it may not be ours to mess with
    (or may be ours from elsewhere in the address space). I suspect there's an
    entertaining insoluble self-referential problem here, but the page_is_anon
    function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will
    always be an odd choice.

    In updating the comment on page_address_in_vma, noticed a potential NULL
    dereference, in a path we don't actually take, but fixed it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do
    Copy-On-Write without touching the VM_UNPAGED's page counts - but this is
    incomplete, because the anonymous page it inserts will itself need to be
    handled, here and in other functions - next patch.

    We still don't copy the page if the pfn is invalid, because the
    copy_user_highpage interface does not allow it. But that's not been a problem
    in the past: can be added in later if the need arises.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's one peculiar use of VM_RESERVED which the previous patch left behind:
    because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout
    cursor, but should never meet VM_RESERVED vmas, it was a way of extending
    VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose.
    But that's an empty set - they don't have the populate function required. So
    just throw away those VM_RESERVED tests.

    But one more interesting in rmap.c has to go too: try_to_unmap_one will want
    to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few
    drivers set VM_RESERVED on areas which are then populated by nopage. The
    PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in
    zap_pte_range, without changing those drivers not to set it: so their pages
    just leak away.

    Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core,
    to flag the special areas where the ptes may have no struct page, or if they
    have then it's not to be touched. Replace most instances of VM_RESERVED in
    core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and
    sparc64 io_remap_pfn_range.

    Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it
    needed anywhere? It still governs the mm->reserved_vm statistic, and special
    vmas not to be merged, and areas not to be core dumped; but could probably be
    eliminated later (the drivers are probably specifying it because in 2.4 it
    kept swapout off the vma, but in 2.6 we work from the LRU, which these pages
    don't get on).

    Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
    purpose whatsoever, and should be removed from drivers when we clean up.

    Signed-off-by: Hugh Dickins
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It looks like snd_xxx is not the only nopage to be using PageReserved as a way
    of holding a high-order page together: which no longer works, but is masked by
    our failure to free from VM_RESERVED areas. We cannot fix that bug without
    first substituting another way to hold the high-order page together, while
    farming out the 0-order pages from within it.

    That's just what PageCompound is designed for, but it's been kept under
    CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line
    put_page), doesn't slow down what most needs to be fast (already using
    hugetlb), and unifies the way we handle high-order pages.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you
    tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed
    with -EACCES: because do_wp_page lacks the refinement to COW pages in those
    areas, nor do we expect to find anonymous pages in them; and it seemed just
    bloat to add code for handling such a peculiar case. But immediately it
    caused vbetool and ddcprobe (using lrmi) to fail.

    So revert the "deprecated" messages, letting mmap and mprotect succeed. But
    leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've
    added the code to do it right: so this particular patch is only good if the
    app doesn't really need to write to that private area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the areas
    flagged VM_RESERVED in place of PageReserved. That is correct in theory - we
    ought not to interfere with struct pages in such a reserved area; but in
    practice it broke BTTV for one.

    So revert to prohibiting only on VM_IO: if someone gets into trouble with
    get_user_pages on VM_RESERVED, it'll just be a "don't do that".

    You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the first
    place, but now's not the time for breaking drivers without notice.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

19 Nov, 2005

2 commits


18 Nov, 2005

1 commit