01 Oct, 2007

1 commit

  • The virtual address space argument of clear_user_highpage is supposed to be
    the virtual address where the page being cleared will eventually be mapped.
    This allows architectures with virtually indexed caches a few clever
    tricks. That sort of trick falls over in painful ways if the virtual
    address argument is wrong.

    Signed-off-by: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralf Baechle
     

20 Sep, 2007

1 commit

  • This patch proposes fixes to the reference counting of memory policy in the
    page allocation paths and in show_numa_map(). Extracted from my "Memory
    Policy Cleanups and Enhancements" series as stand-alone.

    Shared policy lookup [shmem] has always added a reference to the policy,
    but this was never unrefed after page allocation or after formatting the
    numa map data.

    Default system policy should not require additional ref counting, nor
    should the current task's task policy. However, show_numa_map() calls
    get_vma_policy() to examine what may be [likely is] another task's policy.
    The latter case needs protection against freeing of the policy.

    This patch adds a reference count to a mempolicy returned by
    get_vma_policy() when the policy is a vma policy or another task's
    mempolicy. Again, shared policy is already reference counted on lookup. A
    matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
    shared and vma policies, and in show_numa_map() for shared and another
    task's mempolicy. We can call __mpol_free() directly, saving an admittedly
    inexpensive inline NULL test, because we know we have a non-NULL policy.

    Handling policy ref counts for hugepages is a bit trickier.
    huge_zonelist() returns a zone list that might come from a shared or vma
    'BIND policy. In this case, we should hold the reference until after the
    huge page allocation in dequeue_hugepage(). The patch modifies
    huge_zonelist() to return a pointer to the mempolicy if it needs to be
    unref'd after allocation.

    Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:

    w/o patch w/ refcount patch
    Avg Std Devn Avg Std Devn
    Real: 100.59 0.38 100.63 0.43
    User: 1209.60 0.37 1209.91 0.31
    System: 81.52 0.42 81.64 0.34

    Signed-off-by: Lee Schermerhorn
    Acked-by: Andi Kleen
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

23 Aug, 2007

1 commit

  • It seems a simple mistake was made when converting follow_hugetlb_page()
    over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit
    83c54070ee1a2d05c89793884bea1a03f2851ed4).

    By using the wrong bitmask, hugetlb_fault() failures are not being
    recognized. This results in an infinite loop whenever follow_hugetlb_page
    is involved in a failed fault.

    Signed-off-by: Adam Litke
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

25 Jul, 2007

1 commit

  • dequeue_huge_page() has a serious memory leak upon hugetlb page
    allocation. The for loop continues on allocating hugetlb pages out of
    all allowable zone, where this function is supposedly only dequeue one
    and only one pages.

    Fixed it by breaking out of the for loop once a hugetlb page is found.

    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

20 Jul, 2007

5 commits

  • Use appropriate accessor function to set compound page destructor
    function.

    Cc: William Irwin
    Signed-off-by: Akinobu Mita
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The fix to that race in alloc_fresh_huge_page() which could give an illegal
    node ID did not need nid_lock at all: the fix was to replace static int nid
    by static int prev_nid and do the work on local int nid. nid_lock did make
    sure that racers strictly roundrobin the nodes, but that's not something we
    need to enforce strictly. Kill nid_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • mm/hugetlb.c: In function `dequeue_huge_page':
    mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function

    Cc: Christoph Lameter
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • This patch completes Linus's wish that the fault return codes be made into
    bit flags, which I agree makes everything nicer. This requires requires
    all handle_mm_fault callers to be modified (possibly the modifications
    should go further and do things like fault accounting in handle_mm_fault --
    however that would be for another patch).

    [akpm@linux-foundation.org: fix alpha build]
    [akpm@linux-foundation.org: fix s390 build]
    [akpm@linux-foundation.org: fix sparc build]
    [akpm@linux-foundation.org: fix sparc64 build]
    [akpm@linux-foundation.org: fix ia64 build]
    Signed-off-by: Nick Piggin
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Bryan Wu
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Matthew Wilcox
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Chris Zankel
    Acked-by: Kyle McMartin
    Acked-by: Haavard Skinnemoen
    Acked-by: Ralf Baechle
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    [ Still apparently needs some ARM and PPC loving - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Jul, 2007

2 commits

  • Signed-off-by: Robert P. J. Day
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robert P. J. Day
     
  • Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
    as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
    can be used to satisfy hugepage allocations even when the system has been
    running a long time. This allows an administrator to resize the hugepage pool
    at runtime depending on the size of ZONE_MOVABLE.

    This patch adds a new sysctl called hugepages_treat_as_movable. When a
    non-zero value is written to it, future allocations for the huge page pool
    will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
    introduce additional external fragmentation of note as huge pages are always
    the largest contiguous block we care about.

    [akpm@linux-foundation.org: various fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Jul, 2007

2 commits


17 Jun, 2007

1 commit

  • Some changes done a while ago to avoid pounding on ptep_set_access_flags and
    update_mmu_cache in some race situations break sun4c which requires
    update_mmu_cache() to always be called on minor faults.

    This patch reworks ptep_set_access_flags() semantics, implementations and
    callers so that it's now responsible for returning whether an update is
    necessary or not (basically whether the PTE actually changed). This allow
    fixing the sparc implementation to always return 1 on sun4c.

    [akpm@linux-foundation.org: fixes, cleanups]
    Signed-off-by: Benjamin Herrenschmidt
    Cc: Hugh Dickins
    Cc: David Miller
    Cc: Mark Fortescue
    Acked-by: William Lee Irwin III
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

10 May, 2007

2 commits

  • When cpuset is configured, it breaks the strict hugetlb page reservation as
    the accounting is done on a global variable. Such reservation is
    completely rubbish in the presence of cpuset because the reservation is not
    checked against page availability for the current cpuset. Application can
    still potentially OOM'ed by kernel with lack of free htlb page in cpuset
    that the task is in. Attempt to enforce strict accounting with cpuset is
    almost impossible (or too ugly) because cpuset is too fluid that task or
    memory node can be dynamically moved between cpusets.

    The change of semantics for shared hugetlb mapping with cpuset is
    undesirable. However, in order to preserve some of the semantics, we fall
    back to check against current free page availability as a best attempt and
    hopefully to minimize the impact of changing semantics that cpuset has on
    hugetlb.

    Signed-off-by: Ken Chen
    Cc: Paul Jackson
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • The internal hugetlb resv_huge_pages variable can permanently leak nonzero
    value in the error path of hugetlb page fault handler when hugetlb page is
    used in combination of cpuset. The leaked count can permanently trap N
    number of hugetlb pages in unusable "reserved" state.

    Steps to reproduce the bug:

    (1) create two cpuset, user1 and user2
    (2) reserve 50 htlb pages in cpuset user1
    (3) attempt to shmget/shmat 50 htlb page inside cpuset user2
    (4) kernel oom the user process in step 3
    (5) ipcrm the shm segment

    At this point resv_huge_pages will have a count of 49, even though
    there are no active hugetlbfs file nor hugetlb shared memory segment
    in the system. The leak is permanent and there is no recovery method
    other than system reboot. The leaked count will hold up all future use
    of that many htlb pages in all cpusets.

    The culprit is that the error path of alloc_huge_page() did not
    properly undo the change it made to resv_huge_page, causing
    inconsistent state.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Cc: Adam Litke
    Cc: Martin Bligh
    Acked-by: David Gibson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

10 Feb, 2007

1 commit

  • __unmap_hugepage_range() is buggy that it does not preserve dirty state of
    huge_pte when unmapping hugepage range. It causes data corruption in the
    event of dop_caches being used by sys admin. For example, an application
    creates a hugetlb file, modify pages, then unmap it. While leaving the
    hugetlb file alive, comes along sys admin doing a "echo 3 >
    /proc/sys/vm/drop_caches".

    drop_pagecache_sb() will happily free all pages that aren't marked dirty if
    there are no active mapping. Later when application remaps the hugetlb
    file back and all data are gone, triggering catastrophic flip over on
    application.

    Not only that, the internal resv_huge_pages count will also get all messed
    up. Fix it up by marking page dirty appropriately.

    Signed-off-by: Ken Chen
    Cc: "Nish Aravamudan"
    Cc: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

14 Dec, 2006

2 commits

  • To allow a more effective copy_user_highpage() on certain architectures,
    a vma argument is added to the function and cow_user_page() allowing
    the implementation of these functions to check for the VM_EXEC bit.

    The main part of this patch was originally written by Ralf Baechle;
    Atushi Nemoto did the the debugging.

    Signed-off-by: Atsushi Nemoto
    Signed-off-by: Ralf Baechle
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • Elaborate the API for calling cpuset_zone_allowed(), so that users have to
    explicitly choose between the two variants:

    cpuset_zone_allowed_hardwall()
    cpuset_zone_allowed_softwall()

    Until now, whether or not you got the hardwall flavor depended solely on
    whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
    argument.

    If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
    version.

    Unfortunately, this meant that users would end up with the softwall version
    without thinking about it. Since only the softwall version might sleep,
    this led to bugs with possible sleeping in interrupt context on more than
    one occassion.

    The hardwall version requires that the current tasks mems_allowed allows
    the node of the specified zone (or that you're in interrupt or that
    __GFP_THISNODE is set or that you're on a one cpuset system.)

    The softwall version, depending on the gfp_mask, might allow a node if it
    was allowed in the nearest enclusing cpuset marked mem_exclusive (which
    requires taking the cpuset lock 'callback_mutex' to evaluate.)

    This patch removes the cpuset_zone_allowed() call, and forces the caller to
    explicitly choose between the hardwall and the softwall case.

    If the caller wants the gfp_mask to determine this choice, they should (1)
    be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
    cpuset_zone_allowed_softwall() routine.

    This adds another 100 or 200 bytes to the kernel text space, due to the few
    lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
    routines. It should save a few instructions executed for the calls that
    turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
    set (before the call) then check (within the call) the __GFP_HARDWALL flag.

    For the most critical call, from get_page_from_freelist(), the same
    instructions are executed as before -- the old cpuset_zone_allowed()
    routine it used to call is the same code as the
    cpuset_zone_allowed_softwall() routine that it calls now.

    Not a perfect win, but seems worth it, to reduce this chance of hitting a
    sleeping with irq off complaint again.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     

08 Dec, 2006

4 commits

  • Currently we we use the lru head link of the second page of a compound page
    to hold its destructor. This was ok when it was purely an internal
    implmentation detail. However, hugetlbfs overrides this destructor
    violating the layering. Abstract this out as explicit calls, also
    introduce a type for the callback function allowing them to be type
    checked. For each callback we pre-declare the function, causing a type
    error on definition rather than on use elsewhere.

    [akpm@osdl.org: cleanups]
    Signed-off-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • Imprecise RSS accounting is an irritating ill effect with pt sharing. After
    consulted with several VM experts, I have tried various methods to solve that
    problem: (1) iterate through all mm_structs that share the PT and increment
    count; (2) keep RSS count in page table structure and then sum them up at
    reporting time. None of the above methods yield any satisfactory
    implementation.

    Since process RSS accounting is pure information only, I propose we don't
    count them at all for hugetlb page. rlimit has such field, though there is
    absolutely no enforcement on limiting that resource. One other method is to
    account all RSS at hugetlb mmap time regardless they are faulted or not. I
    opt for the simplicity of no accounting at all.

    Hugetlb page are special, they are reserved up front in global reservation
    pool and is not reclaimable. From physical memory resource point of view, it
    is already consumed regardless whether there are users using them.

    If the concern is that RSS can be used to control resource allocation, we
    already can specify hugetlb fs size limit and sysadmin can enforce that at
    mount time. Combined with the two points mentioned above, I fail to see if
    there is anything got affected because of this patch.

    Signed-off-by: Ken Chen
    Acked-by: Hugh Dickins
    Cc: Dave McCracken
    Cc: William Lee Irwin III
    Cc: "Luck, Tony"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Adam Litke
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Following up with the work on shared page table done by Dave McCracken. This
    set of patch target shared page table for hugetlb memory only.

    The shared page table is particular useful in the situation of large number of
    independent processes sharing large shared memory segments. In the normal
    page case, the amount of memory saved from process' page table is quite
    significant. For hugetlb, the saving on page table memory is not the primary
    objective (as hugetlb itself already cuts down page table overhead
    significantly), instead, the purpose of using shared page table on hugetlb is
    to allow faster TLB refill and smaller cache pollution upon TLB miss.

    With PT sharing, pte entries are shared among hundreds of processes, the cache
    consumption used by all the page table is smaller and in return, application
    gets much higher cache hit ratio. One other effect is that cache hit ratio
    with hardware page walker hitting on pte in cache will be higher and this
    helps to reduce tlb miss latency. These two effects contribute to higher
    application performance.

    Signed-off-by: Ken Chen
    Acked-by: Hugh Dickins
    Cc: Dave McCracken
    Cc: William Lee Irwin III
    Cc: "Luck, Tony"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Gibson
    Cc: Adam Litke
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Signed-off-by: Ken Chen
    Cc: David Gibson
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

29 Oct, 2006

1 commit

  • If you truncated an mmap'ed hugetlbfs file, then faulted on the truncated
    area, /proc/meminfo's HugePages_Rsvd wrapped hugely "negative". Reinstate my
    preliminary i_size check before attempting to allocate the page (though this
    only fixes the most obvious case: more work will be needed here).

    Signed-off-by: Hugh Dickins
    Cc: Adam Litke
    Cc: David Gibson
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Oct, 2006

1 commit

  • commit fe1668ae5bf0145014c71797febd9ad5670d5d05 causes kernel to oops with
    libhugetlbfs test suite. The problem is that hugetlb pages can be shared
    by multiple mappings. Multiple threads can fight over page->lru in the
    unmap path and bad things happen. We now serialize __unmap_hugepage_range
    to void concurrent linked list manipulation. Such serialization is also
    needed for shared page table page on hugetlb area. This patch will fixed
    the bug and also serve as a prepatch for shared page table.

    Signed-off-by: Ken Chen
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

04 Oct, 2006

1 commit

  • Spotted by Hugh that hugetlb page is free'ed back to global pool before
    performing any TLB flush in unmap_hugepage_range(). This potentially allow
    threads to abuse free-alloc race condition.

    The generic tlb gather code is unsuitable to use by hugetlb, I just open
    coded a page gathering list and delayed put_page until tlb flush is
    performed.

    Cc: Hugh Dickins
    Signed-off-by: Ken Chen
    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

26 Sep, 2006

2 commits


23 Jun, 2006

1 commit

  • Current hugetlb strict accounting for shared mapping always assume mapping
    starts at zero file offset and reserves pages between zero and size of the
    file. This assumption often reserves (or lock down) a lot more pages then
    necessary if application maps at none zero file offset. libhugetlbfs is
    one example that requires proper reservation on shared mapping starts at
    none zero offset.

    This patch extends the reservation and hugetlb strict accounting to support
    any arbitrary pair of (offset, len), resulting a much more robust and
    accurate scheme. More importantly, it won't lock down any hugetlb pages
    outside file mapping.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

01 Apr, 2006

2 commits

  • With strict page reservation, I think kernel should enforce number of free
    hugetlb page don't fall below reserved count. Currently it is possible in
    the sysctl path. Add proper check in sysctl to disallow that.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • git-commit: d5d4b0aa4e1430d73050babba999365593bdb9d2
    "[PATCH] optimize follow_hugetlb_page" breaks mlock on hugepage areas.

    I mis-interpret pages argument and made get_page() unconditional. It
    should only get a ref count when "pages" argument is non-null.

    Credit goes to Adam Litke who spotted the bug.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

22 Mar, 2006

9 commits

  • Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
    assuming that nodes are numbered contiguously from 0 to num_online_nodes().
    Once the hotplug folks get this far, that will be false.

    Signed-off-by: Paul Jackson
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • follow_hugetlb_page() walks a range of user virtual address and then fills
    in list of struct page * into an array that is passed from the argument
    list. It also gets a reference count via get_page(). For compound page,
    get_page() actually traverse back to head page via page_private() macro and
    then adds a reference count to the head page. Since we are doing a virt to
    pte look up, kernel already has a struct page pointer into the head page.
    So instead of traverse into the small unit page struct and then follow a
    link back to the head page, optimize that with incrementing the reference
    count directly on the head page.

    The benefit is that we don't take a cache miss on accessing page struct for
    the corresponding user address and more importantly, not to pollute the
    cache with a "not very useful" round trip of pointer chasing. This adds a
    moderate performance gain on an I/O intensive database transaction
    workload.

    Signed-off-by: Ken Chen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Originally, mm/hugetlb.c just handled the hugepage physical allocation path
    and its {alloc,free}_huge_page() functions were used from the arch specific
    hugepage code. These days those functions are only used with mm/hugetlb.c
    itself. Therefore, this patch makes them static and removes their
    prototypes from hugetlb.h. This requires a small rearrangement of code in
    mm/hugetlb.c to avoid a forward declaration.

    This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • These days, hugepages are demand-allocated at first fault time. There's a
    somewhat dubious (and racy) heuristic when making a new mmap() to check if
    there are enough available hugepages to fully satisfy that mapping.

    A particularly obvious case where the heuristic breaks down is where a
    process maps its hugepages not as a single chunk, but as a bunch of
    individually mmap()ed (or shmat()ed) blocks without touching and
    instantiating the pages in between allocations. In this case the size of
    each block is compared against the total number of available hugepages.
    It's thus easy for the process to become overcommitted, because each block
    mapping will succeed, although the total number of hugepages required by
    all blocks exceeds the number available. In particular, this defeats such
    a program which will detect a mapping failure and adjust its hugepage usage
    downward accordingly.

    The patch below addresses this problem, by strictly reserving a number of
    physical hugepages for hugepage inodes which have been mapped, but not
    instatiated. MAP_SHARED mappings are thus "safe" - they will fail on
    mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still
    trigger an OOM. (Actually SHARED mappings can technically still OOM, but
    only if the sysadmin explicitly reduces the hugepage pool between mapping
    and instantiation)

    This patch appears to address the problem at hand - it allows DB2 to start
    correctly, for instance, which previously suffered the failure described
    above.

    This patch causes no regressions on the libhugetblfs testsuite, and makes a
    test (designed to catch this problem) pass which previously failed (ppc64,
    POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Currently, no lock or mutex is held between allocating a hugepage and
    inserting it into the pagetables / page cache. When we do go to insert the
    page into pagetables or page cache, we recheck and may free the newly
    allocated hugepage. However, since the number of hugepages in the system
    is strictly limited, and it's usualy to want to use all of them, this can
    still lead to spurious allocation failures.

    For example, suppose two processes are both mapping (MAP_SHARED) the same
    hugepage file, large enough to consume the entire available hugepage pool.
    If they race instantiating the last page in the mapping, they will both
    attempt to allocate the last available hugepage. One will fail, of course,
    returning OOM from the fault and thus causing the process to be killed,
    despite the fact that the entire mapping can, in fact, be instantiated.

    The patch fixes this race by the simple method of adding a (sleeping) mutex
    to serialize the hugepage fault path between allocation and insertion into
    pagetables and/or page cache. It would be possible to avoid the
    serialization by catching the allocation failures, waiting on some
    condition, then rechecking to see if someone else has instantiated the page
    for us. Given the likely frequency of hugepage instantiations, it seems
    very doubtful it's worth the extra complexity.

    This patch causes no regression on the libhugetlbfs testsuite, and one
    test, which can trigger this race now passes where it previously failed.

    Actually, the test still sometimes fails, though less often and only as a
    shmat() failure, rather processes getting OOM killed by the VM. The dodgy
    heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
    space aren't protected by the new mutex, and would be ugly to do so, so
    there's still a race there. Another patch to replace those tests with
    something saner for this reason as well as others coming...

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
    own functions for clarity. As we do so, we add some checks of need_resched
    - we are, after all copying megabytes of memory here. We also add
    might_sleep() accordingly. We generally dropped locks around the clear and
    copy, already but not everyone has PREEMPT enabled, so we should still be
    checking explicitly.

    For this to work, we need to remove the clear_huge_page() from
    alloc_huge_page(), which is called with the page_table_lock held in the COW
    path. We move the clear_huge_page() to just after the alloc_huge_page() in
    the hugepage no-page path. In the COW path, the new page is about to be
    copied over, so clearing it was just a waste of time anyway. So as a side
    effect we also fix the fact that we held the page_table_lock for far too
    long in this path by calling alloc_huge_page() under it.

    It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     
  • 2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
    mprotect.

    From: David Gibson

    Remove a test from the mprotect() path which checks that the mprotect()ed
    range on a hugepage VMA is hugepage aligned (yes, really, the sense of
    is_aligned_hugepage_range() is the opposite of what you'd guess :-/).

    In fact, we don't need this test. If the given addresses match the
    beginning/end of a hugepage VMA they must already be suitably aligned. If
    they don't, then mprotect_fixup() will attempt to split the VMA. The very
    first test in split_vma() will check for a badly aligned address on a
    hugepage VMA and return -EINVAL if necessary.

    From: "Chen, Kenneth W"

    On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The
    identify of hugetlb pte is lost when changing page protection via mprotect.
    A page fault occurs later will trigger a bug check in huge_pte_alloc().

    The fix is to always make new pte a hugetlb pte and also to clean up
    legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.

    Signed-off-by: Zhang Yanmin
    Cc: David Gibson
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: William Lee Irwin III
    Signed-off-by: Ken Chen
    Signed-off-by: Nishanth Aravamudan
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang, Yanmin
     
  • set_page_count usage outside mm/ is limited to setting the refcount to 1.
    Remove set_page_count from outside mm/, and replace those users with
    init_page_count() and set_page_refcounted().

    This allows more debug checking, and tighter control on how code is allowed
    to play around with page->_count.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Insert "fresh" huge pages into the hugepage allocator by the same means as
    they are freed back into it. This reduces code size and allows
    enqueue_huge_page to be inlined into the hugepage free fastpath.

    Eliminate occurances of hugepages on the free list with non-zero refcount.
    This can allow stricter refcount checks in future. Also required for
    lockless pagecache.

    Signed-off-by: Nick Piggin

    "This patch also eliminates a leak "cleaned up" by re-clobbering the
    refcount on every allocation from the hugepage freelists. With respect to
    the lockless pagecache, the crucial aspect is to eliminate unconditional
    set_page_count() to 0 on pages with potentially nonzero refcounts, though
    closer inspection suggests the assignments removed are entirely spurious."

    Acked-by: William Irwin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin