23 Nov, 2005

1 commit

  • If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
    it is possible for the nr_huge_pages variable to become incorrect. There
    is no locking in the set_max_huge_pages function around
    alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers
    to alloc_fresh_huge_page could race against each other as could a call to
    alloc_fresh_huge_page and a call to update_and_free_page. This patch just
    expands the area covered by the hugetlb_lock to cover the call into
    alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section
    is performance critical where more specific locking would be needed.

    My reproducer was to run a couple copies of the following script
    simultaneously

    while [ true ]; do
    echo 1000 > /proc/sys/vm/nr_hugepages
    echo 500 > /proc/sys/vm/nr_hugepages
    echo 750 > /proc/sys/vm/nr_hugepages
    echo 100 > /proc/sys/vm/nr_hugepages
    echo 0 > /proc/sys/vm/nr_hugepages
    done

    and then watch /proc/meminfo and eventually you will see things like

    HugePages_Total: 100
    HugePages_Free: 109

    After applying the patch all seemed well.

    Signed-off-by: Eric Paris
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     

07 Nov, 2005

2 commits

  • I didn't find any possible modular usage in the kernel.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
    base page size to 64K. The resulting kernel still boots on any
    hardware. On current machines with 4K pages support only, the kernel
    will maintain 16 "subpages" for each 64K page transparently.

    Note that while real 64K capable HW has been tested, the current patch
    will not enable it yet as such hardware is not released yet, and I'm
    still verifying with the firmware architects the proper to get the
    information from the newer hypervisors.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

30 Oct, 2005

5 commits

  • Below is a patch to implement demand faulting for huge pages. The main
    motivation for changing from prefaulting to demand faulting is so that huge
    page memory areas can be allocated according to NUMA policy.

    Thanks to consolidated hugetlb code, switching the behavior requires changing
    only one fault handler. The bulk of the patch just moves the logic from
    hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

    Signed-off-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Remove the page_table_lock from around the calls to unmap_vmas, and replace
    the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
    now safe to descend without page_table_lock.

    Don't attempt fancy locking for hugepages, just take page_table_lock in
    unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
    zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
    does unmap_vmas have much use for its mm arg now.

    The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
    page_table_lock: if they're implemented at all, they typically come down to
    flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
    (which we already audited for the mprotect case).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Oct, 2005

1 commit


20 Oct, 2005

1 commit

  • hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
    forgets that: crashes and misaccounting ensue.

    copy_hugetlb_page_range better grab the src page_table_lock since we don't
    want to guess what happens if concurrently truncated. unmap_hugepage_range
    rss accounting must not assume the full range was mapped. follow_hugetlb_page
    must guard with page_table_lock and be prepared to exit early.

    Restyle copy_hugetlb_page_range with a for loop like the others there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Sep, 2005

1 commit

  • Initial Post (Wed, 17 Aug 2005)

    This patch moves the
    if (! pte_none(*pte))
    hugetlb_clean_stale_pgtable(pte);
    logic into huge_pte_alloc() so all of its callers can be immune to the bug
    described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

    > It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
    > huge_pte_alloc() might return a pmd that points to a PTE page. It happens
    > if the virtual address for hugetlb mmap is recycled from previously used
    > normal page mmap. free_pgtables() might not scrub the pmd entry on
    > munmap and hugetlb_prefault skips on any pmd presence regardless what type
    > it is.

    Unless I am missing something, it seems more correct to place the check inside
    huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
    It also allows checking for this condition when lazily faulting huge pages
    later in the series.

    Signed-off-by: Adam Litke
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

06 Aug, 2005

1 commit

  • This patch fixes a crash in the hugepage code. unmap_hugepage_area() was
    assuming that (due to prefault) PTEs must exist for all the area in
    question. However, this may not be the case, if mmap() encounters an error
    before the prefault and calls unmap_region() to clean up any partial
    mapping.

    Depending on the hugepage configuration, this crash can be triggered by an
    unpriveleged user.

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

22 Jun, 2005

1 commit

  • A lot of the code in arch/*/mm/hugetlbpage.c is quite similar. This patch
    attempts to consolidate a lot of the code across the arch's, putting the
    combined version in mm/hugetlb.c. There are a couple of uglyish hacks in
    order to covert all the hugepage archs, but the result is a very large
    reduction in the total amount of code. It also means things like hugepage
    lazy allocation could be implemented in one place, instead of six.

    Tested, at least a little, on ppc64, i386 and x86_64.

    Notes:
    - this patch changes the meaning of set_huge_pte() to be more
    analagous to set_pte()
    - does SH4 need s special huge_ptep_get_and_clear()??

    Acked-by: William Lee Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds