29 Oct, 2006

1 commit

  • If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
    area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my
    preliminary i_size check before attempting to allocate the page (though this
    only fixes the most obvious case: more work will be needed here).

    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: David Gibson
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Oct, 2006

1 commit

  • commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
    libhugetlbfs test suite. The problem is that hugetlb pages can be shared
    by multiple mappings. Multiple threads can fight over page->lru in the
    unmap path and bad things happen. We now serialize __unmap_hugepage_range
    to void concurrent linked list manipulation. Such serialization is also
    needed for shared page table page on hugetlb area. This patch will fixed
    the bug and also serve as a prepatch for shared page table.

    Signed-off-by: Ken Chen
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

04 Oct, 2006

1 commit

  • Spotted by Hugh that hugetlb page is free'ed back to global pool before
    performing any TLB flush in unmap_hugepage_range(). This potentially allow
    threads to abuse free-alloc race condition.

    The generic tlb gather code is unsuitable to use by hugetlb, I just open
    coded a page gathering list and delayed put_page until tlb flush is
    performed.

    Cc: Hugh Dickins
    Signed-off-by: Ken Chen
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

26 Sep, 2006

2 commits


23 Jun, 2006

1 commit

  • Current hugetlb strict accounting for shared mapping always assume mapping
    starts at zero file offset and reserves pages between zero and size of the
    file. This assumption often reserves (or lock down) a lot more pages then
    necessary if application maps at none zero file offset. libhugetlbfs is
    one example that requires proper reservation on shared mapping starts at
    none zero offset.

    This patch extends the reservation and hugetlb strict accounting to support
    any arbitrary pair of (offset, len), resulting a much more robust and
    accurate scheme. More importantly, it won't lock down any hugetlb pages
    outside file mapping.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

01 Apr, 2006

2 commits

  • With strict page reservation, I think kernel should enforce number of free
    hugetlb page don't fall below reserved count. Currently it is possible in
    the sysctl path. Add proper check in sysctl to disallow that.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • git-commit: d5d4b0aa4e1430d73050babba999365593bdb9d2
    "[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas.

    I mis-interpret pages argument and made get_page() unconditional. It
    should only get a ref count when "pages" argument is non-null.

    Credit goes to Adam Litke who spotted the bug.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

22 Mar, 2006

9 commits

  • Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
    assuming that nodes are numbered contiguously from 0 to num_online_nodes().
    Once the hotplug folks get this far, that will be false.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • follow_hugetlb_page() walks a range of user virtual address and then fills
    in list of struct page * into an array that is passed from the argument
    list. It also gets a reference count via get_page(). For compound page,
    get_page() actually traverse back to head page via page_private() macro and
    then adds a reference count to the head page. Since we are doing a virt to
    pte look up, kernel already has a struct page pointer into the head page.
    So instead of traverse into the small unit page struct and then follow a
    link back to the head page, optimize that with incrementing the reference
    count directly on the head page.

    The benefit is that we don't take a cache miss on accessing page struct for
    the corresponding user address and more importantly, not to pollute the
    cache with a "not very useful" round trip of pointer chasing. This adds a
    moderate performance gain on an I/O intensive database transaction
    workload.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Originally, mm/hugetlb.c just handled the hugepage physical allocation path
    and its {alloc,free}_huge_page() functions were used from the arch specific
    hugepage code. These days those functions are only used with mm/hugetlb.c
    itself. Therefore, this patch makes them static and removes their
    prototypes from hugetlb.h. This requires a small rearrangement of code in
    mm/hugetlb.c to avoid a forward declaration.

    This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • These days, hugepages are demand-allocated at first fault time. There's a
    somewhat dubious (and racy) heuristic when making a new mmap() to check if
    there are enough available hugepages to fully satisfy that mapping.

    A particularly obvious case where the heuristic breaks down is where a
    process maps its hugepages not as a single chunk, but as a bunch of
    individually mmap()ed (or shmat()ed) blocks without touching and
    instantiating the pages in between allocations. In this case the size of
    each block is compared against the total number of available hugepages.
    It's thus easy for the process to become overcommitted, because each block
    mapping will succeed, although the total number of hugepages required by
    all blocks exceeds the number available. In particular, this defeats such
    a program which will detect a mapping failure and adjust its hugepage usage
    downward accordingly.

    The patch below addresses this problem, by strictly reserving a number of
    physical hugepages for hugepage inodes which have been mapped, but not
    instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
    mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
    trigger an OOM. (Actually SHARED mappings can technically still OOM, but
    only if the sysadmin explicitly reduces the hugepage pool between mapping
    and instantiation)

    This patch appears to address the problem at hand - it allows DB2 to start
    correctly, for instance, which previously suffered the failure described
    above.

    This patch causes no regressions on the libhugetblfs testsuite, and makes a
    test (designed to catch this problem) pass which previously failed (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Currently, no lock or mutex is held between allocating a hugepage and
    inserting it into the pagetables / page cache. When we do go to insert the
    page into pagetables or page cache, we recheck and may free the newly
    allocated hugepage. However, since the number of hugepages in the system
    is strictly limited, and it's usualy to want to use all of them, this can
    still lead to spurious allocation failures.

    For example, suppose two processes are both mapping (MAP_SHARED) the same
    hugepage file, large enough to consume the entire available hugepage pool.
    If they race instantiating the last page in the mapping, they will both
    attempt to allocate the last available hugepage. One will fail, of course,
    returning OOM from the fault and thus causing the process to be killed,
    despite the fact that the entire mapping can, in fact, be instantiated.

    The patch fixes this race by the simple method of adding a (sleeping) mutex
    to serialize the hugepage fault path between allocation and insertion into
    pagetables and/or page cache. It would be possible to avoid the
    serialization by catching the allocation failures, waiting on some
    condition, then rechecking to see if someone else has instantiated the page
    for us. Given the likely frequency of hugepage instantiations, it seems
    very doubtful it's worth the extra complexity.

    This patch causes no regression on the libhugetlbfs testsuite, and one
    test, which can trigger this race now passes where it previously failed.

    Actually, the test still sometimes fails, though less often and only as a
    shmat() failure, rather processes getting OOM killed by the VM. The dodgy
    heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
    space aren't protected by the new mutex, and would be ugly to do so, so
    there's still a race there. Another patch to replace those tests with
    something saner for this reason as well as others coming...

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
    own functions for clarity. As we do so, we add some checks of need_resched
    - we are, after all copying megabytes of memory here. We also add
    might_sleep() accordingly. We generally dropped locks around the clear and
    copy, already but not everyone has PREEMPT enabled, so we should still be
    checking explicitly.

    For this to work, we need to remove the clear_huge_page() from
    alloc_huge_page(), which is called with the page_table_lock held in the COW
    path. We move the clear_huge_page() to just after the alloc_huge_page() in
    the hugepage no-page path. In the COW path, the new page is about to be
    copied over, so clearing it was just a waste of time anyway. So as a side
    effect we also fix the fact that we held the page_table_lock for far too
    long in this path by calling alloc_huge_page() under it.

    It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • 2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
    mprotect.

    From: David Gibson

    Remove a test from the mprotect() path which checks that the mprotect()ed
    range on a hugepage VMA is hugepage aligned (yes, really, the sense of
    is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

    In fact, we don't need this test. If the given addresses match the
    beginning/end of a hugepage VMA they must already be suitably aligned. If
    they don't, then mprotect_fixup() will attempt to split the VMA. The very
    first test in split_vma() will check for a badly aligned address on a
    hugepage VMA and return -EINVAL if necessary.

    From: "Chen, Kenneth W"

    On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
    identify of hugetlb pte is lost when changing page protection via mprotect.
    A page fault occurs later will trigger a bug check in huge_pte_alloc().

    The fix is to always make new pte a hugetlb pte and also to clean up
    legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

    Signed-off-by: Zhang Yanmin
    Cc: David Gibson
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: William Lee Irwin III
    Signed-off-by: Ken Chen
    Signed-off-by: Nishanth Aravamudan
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     
  • set_page_count usage outside mm/ is limited to setting the refcount to 1.
    Remove set_page_count from outside mm/, and replace those users with
    init_page_count() and set_page_refcounted().

    This allows more debug checking, and tighter control on how code is allowed
    to play around with page->_count.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Insert "fresh" huge pages into the hugepage allocator by the same means as
    they are freed back into it. This reduces code size and allows
    enqueue_huge_page to be inlined into the hugepage free fastpath.

    Eliminate occurances of hugepages on the free list with non-zero refcount.
    This can allow stricter refcount checks in future. Also required for
    lockless pagecache.

    Signed-off-by: Nick Piggin

    "This patch also eliminates a leak "cleaned up" by re-clobbering the
    refcount on every allocation from the hugepage freelists. With respect to
    the lockless pagecache, the crucial aspect is to eliminate unconditional
    set_page_count() to 0 on pages with potentially nonzero refcounts, though
    closer inspection suggests the assignments removed are entirely spurious."

    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

15 Feb, 2006

1 commit

  • If a compound page has its own put_page_testzero destructor (the only current
    example is free_huge_page), that is noted in page[1].mapping of the compound
    page. But that's rather a poor place to keep it: functions which call
    set_page_dirty_lock after get_user_pages (e.g. Infiniband's
    __ib_umem_release) ought to be checking first, otherwise set_page_dirty is
    liable to crash on what's not the address of a struct address_space.

    And now I'm about to make that worse: it turns out that every compound page
    needs a destructor, so we can no longer rely on hugetlb pages going their own
    special way, to avoid further problems of page->mapping reuse. For example,
    not many people know that: on 50% of i386 -Os builds, the first tail page of a
    compound page purports to be PageAnon (when its destructor has an odd
    address), which surprises page_add_file_rmap.

    Keep the compound page destructor in page[1].lru.next instead. And to free up
    the common pairing of mapping and index, also move compound page order from
    index to lru.prev. Slab reuses page->lru too: but if we ever need slab to use
    compound pages, it can easily stack its use above this.

    (akpm: decoded version of the above: the tail pages of a compound page now
    have ->mapping==NULL, so there's no need for the set_page_dirty[_lock]()
    caller to check that they're not compund pages before doing the dirty).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Feb, 2006

2 commits

  • Remove wrong and misleading comments.

    Return VM_FAULT_OOM if the hugetlbpage fault handler cannot allocate a
    page. do_no_page will end up doing do_exit(SIGKILL).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • When hugepages are newly allocated to a file in mm/hugetlb.c, we clear them
    with a call to clear_highpage() on each of the subpages. We should be
    using clear_user_highpage(): on powerpc, at least, clear_highpage() doesn't
    correctly mark the page as icache dirty so if the page is executed shortly
    after it's possible to get strange results.

    Signed-off-by: David Gibson
    Acked-by: William Lee Irwin III
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

06 Feb, 2006

1 commit

  • I just spent some time researching a Bus Error. Turns out that the huge
    page fault handler can return VM_FAULT_SIGBUS for various conditions where
    no huge page is available.

    Add a note explaining the reasoning in the source.

    Signed-off-by: Christoph Lameter
    Acked-by: William Lee Irwin III
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

09 Jan, 2006

1 commit

  • See http://marc.theaimsgroup.com/?l=linux-kernel&m=113167000201265&w=2
    http://marc.theaimsgroup.com/?l=linux-mm&m=113167267527312&w=2

    Make hugepages obey cpusets.

    Signed-off-by: Christoph Lameter
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

07 Jan, 2006

7 commits

  • The number of parameters for find_or_alloc_page increases significantly after
    policy support is added to huge pages. Simplify the code by folding
    find_or_alloc_huge_page() into hugetlb_no_page().

    Adam Litke objected to this piece in an earlier patch but I think this is a
    good simplification. Diffstat shows that we can get rid of almost half of the
    lines of find_or_alloc_page(). If we can find no consensus then lets simply
    drop this patch.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The huge_zonelist() function in the memory policy layer provides an list of
    zones ordered by NUMA distance. The hugetlb layer will walk that list looking
    for a zone that has available huge pages but is also in the nodeset of the
    current cpuset.

    This patch does not contain the folding of find_or_alloc_huge_page() that was
    controversial in the earlier discussion.

    Signed-off-by: Christoph Lameter
    Cc: Andi Kleen
    Acked-by: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • This was discussed at
    http://marc.theaimsgroup.com/?l=linux-kernel&m=113166526217117&w=2

    This patch changes the dequeueing to select a huge page near the node
    executing instead of always beginning to check for free nodes from node 0.
    This will result in a placement of the huge pages near the executing
    processor improving performance.

    The existing implementation can place the huge pages far away from the
    executing processor causing significant degradation of performance. The
    search starting from zero also means that the lower zones quickly run out
    of memory. Selecting a huge page near the process distributed the huge
    pages better.

    Signed-off-by: Christoph Lameter
    Cc: William Lee Irwin III
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Implement copy-on-write support for hugetlb mappings so MAP_PRIVATE can be
    supported. This helps us to safely use hugetlb pages in many more
    applications. The patch makes the following changes. If needed, I also have
    it broken out according to the following paragraphs.

    1. Add a pair of functions to set/clear write access on huge ptes. The
    writable check in make_huge_pte is moved out to the caller for use by COW
    later.

    2. Hugetlb copy-on-write requires special case handling in the following
    situations:

    - copy_hugetlb_page_range() - Copied pages must be write protected so
    a COW fault will be triggered (if necessary) if those pages are written
    to.

    - find_or_alloc_huge_page() - Only MAP_SHARED pages are added to the
    page cache. MAP_PRIVATE pages still need to be locked however.

    3. Provide hugetlb_cow() and calls from hugetlb_fault() and
    hugetlb_no_page() which handles the COW fault by making the actual copy.

    4. Remove the check in hugetlbfs_file_map() so that MAP_PRIVATE mmaps
    will be allowed. Make MAP_HUGETLB exempt from the depricated VM_RESERVED
    mapping check.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • This patch splits the "no_page()" type activity into its own function,
    hugetlb_no_page(). hugetlb_fault() becomes the entry point for hugetlb faults
    and delegates to the appropriate handler depending on the type of fault.
    Right now we still have only hugetlb_no_page() but a later patch introduces a
    COW fault.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • find_lock_huge_page() isn't a great name, since it does extra things not
    analagous to find_lock_page(). Rename it find_or_alloc_huge_page() which is
    closer to the mark.

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • cleanup

    Signed-off-by: David Gibson
    Signed-off-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: "Seth, Rohit"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

23 Nov, 2005

1 commit

  • If there are multiple updaters to /proc/sys/vm/nr_hugepages simultaneously
    it is possible for the nr_huge_pages variable to become incorrect. There
    is no locking in the set_max_huge_pages function around
    alloc_fresh_huge_page which is able to update nr_huge_pages. Two callers
    to alloc_fresh_huge_page could race against each other as could a call to
    alloc_fresh_huge_page and a call to update_and_free_page. This patch just
    expands the area covered by the hugetlb_lock to cover the call into
    alloc_fresh_huge_page. I'm not sure how we could say that a sysctl section
    is performance critical where more specific locking would be needed.

    My reproducer was to run a couple copies of the following script
    simultaneously

    while [ true ]; do
    echo 1000 > /proc/sys/vm/nr_hugepages
    echo 500 > /proc/sys/vm/nr_hugepages
    echo 750 > /proc/sys/vm/nr_hugepages
    echo 100 > /proc/sys/vm/nr_hugepages
    echo 0 > /proc/sys/vm/nr_hugepages
    done

    and then watch /proc/meminfo and eventually you will see things like

    HugePages_Total: 100
    HugePages_Free: 109

    After applying the patch all seemed well.

    Signed-off-by: Eric Paris
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Paris
     

07 Nov, 2005

2 commits

  • I didn't find any possible modular usage in the kernel.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
    base page size to 64K. The resulting kernel still boots on any
    hardware. On current machines with 4K pages support only, the kernel
    will maintain 16 "subpages" for each 64K page transparently.

    Note that while real 64K capable HW has been tested, the current patch
    will not enable it yet as such hardware is not released yet, and I'm
    still verifying with the firmware architects the proper to get the
    information from the newer hypervisors.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

30 Oct, 2005

5 commits

  • Below is a patch to implement demand faulting for huge pages. The main
    motivation for changing from prefaulting to demand faulting is so that huge
    page memory areas can be allocated according to NUMA policy.

    Thanks to consolidated hugetlb code, switching the behavior requires changing
    only one fault handler. The bulk of the patch just moves the logic from
    hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page().

    Signed-off-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Remove the page_table_lock from around the calls to unmap_vmas, and replace
    the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are
    now safe to descend without page_table_lock.

    Don't attempt fancy locking for hugepages, just take page_table_lock in
    unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in
    zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor
    does unmap_vmas have much use for its mm arg now.

    The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without
    page_table_lock: if they're implemented at all, they typically come down to
    flush_cache_range (usually done outside page_table_lock) and flush_tlb_range
    (which we already audited for the mprotect case).

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • update_mem_hiwater has attracted various criticisms, in particular from those
    concerned with mm scalability. Originally it was called whenever rss or
    total_vm got raised. Then many of those callsites were replaced by a timer
    tick call from account_system_time. Now Frank van Maarseveen reports that to
    be found inadequate. How about this? Works for Frank.

    Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros
    update_hiwater_rss and update_hiwater_vm. Don't attempt to keep
    mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually
    by 1): those are hot paths. Do the opposite, update only when about to lower
    rss (usually by many), or just before final accounting in do_exit. Handle
    mm->hiwater_vm in the same way, though it's much less of an issue. Demand
    that whoever collects these hiwater statistics do the work of taking the
    maximum with rss or total_vm.

    And there has been no collector of these hiwater statistics in the tree. The
    new convention needs an example, so match Frank's usage by adding a VmPeak
    line above VmSize to /proc//status, and also a VmHWM line above VmRSS
    (High-Water-Mark or High-Water-Memory).

    There was a particular anomaly during mremap move, that hiwater_vm might be
    captured too high. A fleeting such anomaly remains, but it's quickly
    corrected now, whereas before it would stick.

    What locking? None: if the app is racy then these statistics will be racy,
    it's not worth any overhead to make them exact. But whenever it suits,
    hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under
    page_table_lock (for now) or with preemption disabled (later on): without
    going to any trouble, minimize the time between reading current values and
    updating, to minimize those occasions when a racing thread bumps a count up
    and back down in between.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I was lazy when we added anon_rss, and chose to change as few places as
    possible. So currently each anonymous page has to be counted twice, in rss
    and in anon_rss. Which won't be so good if those are atomic counts in some
    configurations.

    Change that around: keep file_rss and anon_rss separately, and add them
    together (with get_mm_rss macro) when the total is needed - reading two
    atomics is much cheaper than updating two atomics. And update anon_rss
    upfront, typically in memory.c, not tucked away in page_add_anon_rmap.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Oct, 2005

1 commit


20 Oct, 2005

1 commit

  • hugetlbfs allows truncation of its files (should it?), but hugetlb.c often
    forgets that: crashes and misaccounting ensue.

    copy_hugetlb_page_range better grab the src page_table_lock since we don't
    want to guess what happens if concurrently truncated. unmap_hugepage_range
    rss accounting must not assume the full range was mapped. follow_hugetlb_page
    must guard with page_table_lock and be prepared to exit early.

    Restyle copy_hugetlb_page_range with a for loop like the others there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Sep, 2005

1 commit

  • Initial Post (Wed, 17 Aug 2005)

    This patch moves the
    if (! pte_none(*pte))
    hugetlb_clean_stale_pgtable(pte);
    logic into huge_pte_alloc() so all of its callers can be immune to the bug
    described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246

    > It turns out there is a bug in hugetlb_prefault(): with 3 level page table,
    > huge_pte_alloc() might return a pmd that points to a PTE page. It happens
    > if the virtual address for hugetlb mmap is recycled from previously used
    > normal page mmap. free_pgtables() might not scrub the pmd entry on
    > munmap and hugetlb_prefault skips on any pmd presence regardless what type
    > it is.

    Unless I am missing something, it seems more correct to place the check inside
    huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated.
    It also allows checking for this condition when lazily faulting huge pages
    later in the series.

    Signed-off-by: Adam Litke
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke