23 Nov, 2005

10 commits

  • If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
    it is possible for the nr_huge_pages variable to become incorrect. There
    is no locking in the set_max_huge_pages function around
    alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers
    to alloc_fresh_huge_page could race against each other as could a call to
    alloc_fresh_huge_page and a call to update_and_free_page. This patch just
    expands the area covered by the hugetlb_lock to cover the call into
    alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section
    is performance critical where more specific locking would be needed.

    My reproducer was to run a couple copies of the following script
    simultaneously

    while [ true ]; do
    echo 1000 > /proc/sys/vm/nr_hugepages
    echo 500 > /proc/sys/vm/nr_hugepages
    echo 750 > /proc/sys/vm/nr_hugepages
    echo 100 > /proc/sys/vm/nr_hugepages
    echo 0 > /proc/sys/vm/nr_hugepages
    done

    and then watch /proc/meminfo and eventually you will see things like

    HugePages_Total: 100
    HugePages_Free: 109

    After applying the patch all seemed well.

    Signed-off-by: Eric Paris
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • It used to be the case that PG_reserved pages were silently never freed, but
    in 2.6.15-rc1 they may be freed with a "Bad page state" message. We should
    work through such cases as they appear, fixing the code; but for now it's
    safer to issue the message without freeing the page, leaving PG_reserved set.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's strange enough to be looking out for anonymous pages in VM_UNPAGED areas,
    let's not insert the ZERO_PAGE there - though whether it would matter will
    depend on what we decide about ZERO_PAGE refcounting.

    But whereas do_anonymous_page may (exceptionally) be called on a VM_UNPAGED
    area, do_no_page should never be: just BUG_ON.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED area,
    zap_pte_range needs to free them, do_wp_page needs to COW them: just like
    ordinary pages, not like the unpaged.

    But recognizing them is a little subtle: because PageReserved is no longer a
    condition for remap_pfn_range, we can now mmap all of /dev/mem (whether the
    distro permits, and whether it's advisable on this or that architecture, is
    another matter). So if we can see a PageAnon, it may not be ours to mess with
    (or may be ours from elsewhere in the address space). I suspect there's an
    entertaining insoluble self-referential problem here, but the page_is_anon
    function does a good practical job, and MAP_PRIVATE PROT_WRITE VM_UNPAGED will
    always be an odd choice.

    In updating the comment on page_address_in_vma, noticed a potential NULL
    dereference, in a path we don't actually take, but fixed it.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let it do
    Copy-On-Write without touching the VM_UNPAGED's page counts - but this is
    incomplete, because the anonymous page it inserts will itself need to be
    handled, here and in other functions - next patch.

    We still don't copy the page if the pfn is invalid, because the
    copy_user_highpage interface does not allow it. But that's not been a problem
    in the past: can be added in later if the need arises.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • There's one peculiar use of VM_RESERVED which the previous patch left behind:
    because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data as a swapout
    cursor, but should never meet VM_RESERVED vmas, it was a way of extending
    VM_NONLINEAR to VM_RESERVED vmas using vm_private_data for some other purpose.
    But that's an empty set - they don't have the populate function required. So
    just throw away those VM_RESERVED tests.

    But one more interesting in rmap.c has to go too: try_to_unmap_one will want
    to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Although we tend to associate VM_RESERVED with remap_pfn_range, quite a few
    drivers set VM_RESERVED on areas which are then populated by nopage. The
    PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free pages in
    zap_pte_range, without changing those drivers not to set it: so their pages
    just leak away.

    Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the core,
    to flag the special areas where the ptes may have no struct page, or if they
    have then it's not to be touched. Replace most instances of VM_RESERVED in
    core mm by VM_UNPAGED. Force it on in remap_pfn_range, and the sparc and
    sparc64 io_remap_pfn_range.

    Revert addition of VM_RESERVED to powerpc vdso, it's not needed there. Is it
    needed anywhere? It still governs the mm->reserved_vm statistic, and special
    vmas not to be merged, and areas not to be core dumped; but could probably be
    eliminated later (the drivers are probably specifying it because in 2.4 it
    kept swapout off the vma, but in 2.6 we work from the LRU, which these pages
    don't get on).

    Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
    purpose whatsoever, and should be removed from drivers when we clean up.

    Signed-off-by: Hugh Dickins
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It looks like snd_xxx is not the only nopage to be using PageReserved as a way
    of holding a high-order page together: which no longer works, but is masked by
    our failure to free from VM_RESERVED areas. We cannot fix that bug without
    first substituting another way to hold the high-order page together, while
    farming out the 0-order pages from within it.

    That's just what PageCompound is designed for, but it's been kept under
    CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out- of-line
    put_page), doesn't slow down what most needs to be fast (already using
    hugetlb), and unifies the way we handle high-order pages.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message when you
    tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED, and failed
    with -EACCES: because do_wp_page lacks the refinement to COW pages in those
    areas, nor do we expect to find anonymous pages in them; and it seemed just
    bloat to add code for handling such a peculiar case. But immediately it
    caused vbetool and ddcprobe (using lrmi) to fail.

    So revert the "deprecated" messages, letting mmap and mprotect succeed. But
    leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place until we've
    added the code to do it right: so this particular patch is only good if the
    app doesn't really need to write to that private area.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the areas
    flagged VM_RESERVED in place of PageReserved. That is correct in theory - we
    ought not to interfere with struct pages in such a reserved area; but in
    practice it broke BTTV for one.

    So revert to prohibiting only on VM_IO: if someone gets into trouble with
    get_user_pages on VM_RESERVED, it'll just be a "don't do that".

    You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the first
    place, but now's not the time for breaking drivers without notice.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

19 Nov, 2005

2 commits


18 Nov, 2005

2 commits


15 Nov, 2005

4 commits

  • Linus Torvalds
     
  • Has been introduced for x86-64 at some point to save memory
    in struct page, but has been obsolete for some time. Just
    remove it.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Add a new 4GB GFP_DMA32 zone between the GFP_DMA and GFP_NORMAL zones.

    As a bit of historical background: when the x86-64 port
    was originally designed we had some discussion if we should
    use a 16MB DMA zone like i386 or a 4GB DMA zone like IA64 or
    both. Both was ruled out at this point because it was in early
    2.4 when VM is still quite shakey and had bad troubles even
    dealing with one DMA zone. We settled on the 16MB DMA zone mainly
    because we worried about older soundcards and the floppy.

    But this has always caused problems since then because
    device drivers had trouble getting enough DMA able memory. These days
    the VM works much better and the wide use of NUMA has proven
    it can deal with many zones successfully.

    So this patch adds both zones.

    This helps drivers who need a lot of memory below 4GB because
    their hardware is not accessing more (graphic drivers - proprietary
    and free ones, video frame buffer drivers, sound drivers etc.).
    Previously they could only use IOMMU+16MB GFP_DMA, which
    was not enough memory.

    Another common problem is that hardware who has full memory
    addressing for >4GB misses it for some control structures in memory
    (like transmit rings or other metadata). They tended to allocate memory
    in the 16MB GFP_DMA or the IOMMU/swiotlb then using pci_alloc_consistent,
    but that can tie up a lot of precious 16MB GFPDMA/IOMMU/swiotlb memory
    (even on AMD systems the IOMMU tends to be quite small) especially if you have
    many devices. With the new zone pci_alloc_consistent can just put
    this stuff into memory below 4GB which works better.

    One argument was still if the zone should be 4GB or 2GB. The main
    motivation for 2GB would be an unnamed not so unpopular hardware
    raid controller (mostly found in older machines from a particular four letter
    company) who has a strange 2GB restriction in firmware. But
    that one works ok with swiotlb/IOMMU anyways, so it doesn't really
    need GFP_DMA32. I chose 4GB to be compatible with IA64 and because
    it seems to be the most common restriction.

    The new zone is so far added only for x86-64.

    For other architectures who don't set up this
    new zone nothing changes. Architectures can set a compatibility
    define in Kconfig CONFIG_DMA_IS_DMA32 that will define GFP_DMA32
    as GFP_DMA. Otherwise it's a nop because on 32bit architectures
    it's normally not needed because GFP_NORMAL (=0) is DMA able
    enough.

    One problem is still that GFP_DMA means different things on different
    architectures. e.g. some drivers used to have #ifdef ia64 use GFP_DMA
    (trusting it to be 4GB) #elif __x86_64__ (use other hacks like
    the swiotlb because 16MB is not enough) ... . This was quite
    ugly and is now obsolete.

    These should be now converted to use GFP_DMA32 unconditionally. I haven't done
    this yet. Or best only use pci_alloc_consistent/dma_alloc_coherent
    which will use GFP_DMA32 transparently.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Nov, 2005

6 commits

  • The slab allocator never uses alloc_pages since kmem_getpages() is always
    called with a valid nodeid. Remove the branch and the code from
    kmem_getpages()

    Signed-off-by: Christoph Lameter
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This patch converts object cache page mapping macros to static inline
    functions to make the more explicit and readable.

    Signed-off-by: Pekka Enberg
    Cc: Manfred Spraul
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pekka Enberg
     
  • The pages_high - pages_low and pages_low - pages_min deltas are the asynch
    reclaim watermarks. As such, the should be in the same ratios as any other
    zone for highmem zones. It is the pages_min - 0 delta which is the
    PF_MEMALLOC reserve, and this is the region that isn't very useful for
    highmem.

    This patch ensures highmem systems have similar characteristics as non highmem
    ones with the same amount of memory, and also that highmem zones get similar
    reclaim pressures to other zones.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Clean up of __alloc_pages.

    Restoration of previous behaviour, plus further cleanups by introducing an
    'alloc_flags', removing the last of should_reclaim_zone.

    Signed-off-by: Rohit Seth
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rohit Seth
     
  • The address based work estimate for unmapping (for lockbreak) is and always
    was horribly inefficient for sparse mappings. The problem is most simply
    explained with an example:

    If we find a pgd is clear, we still have to call into unmap_page_range
    PGDIR_SIZE / ZAP_BLOCK_SIZE times, each time checking the clear pgd, in
    order to progress the working address to the next pgd.

    The fundamental way to solve the problem is to keep track of the end
    address we've processed and pass it back to the higher layers.

    From: Nick Piggin

    Modification to completely get away from address based work estimate
    and instead use an abstract count, with a very small cost for empty
    entries as opposed to present pages.

    On 2.6.14-git2, ppc64, and CONFIG_PREEMPT=y, mapping and unmapping 1TB
    of virtual address space takes 1.69s; with the following patch applied,
    this operation can be done 1000 times in less than 0.01s

    From: Andrew Morton

    With CONFIG_HUTETLB_PAGE=n:

    mm/memory.c: In function `unmap_vmas':
    mm/memory.c:779: warning: division by zero

    Due to

    zap_work -= (end - start) /
    (HPAGE_SIZE / PAGE_SIZE);

    So make the dummy HPAGE_SIZE non-zero

    Signed-off-by: Robin Holt
    Signed-off-by: Nick Piggin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • In __alloc_pages():

    if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
    /* go through the zonelist yet again, ignoring mins */
    for (i = 0; zones[i] != NULL; i++) {
    struct zone *z = zones[i];

    page = buffered_rmqueue(z, order, gfp_mask);
    if (page) {
    zone_statistics(zonelist, z);
    goto got_pg;
    }
    }
    goto nopage; <<<< HERE!!! FAIL...
    }

    kswapd (which has PF_MEMALLOC flag) can fail to allocate memory even when
    it allocates it with __GFP_NOFAIL flag.

    Signed-Off-By: Pavel Emelianov
    Signed-Off-By: Denis Lunev
    Signed-Off-By: Kirill Korotaev
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     

12 Nov, 2005

1 commit


11 Nov, 2005

1 commit


08 Nov, 2005

1 commit


07 Nov, 2005

13 commits