29 Apr, 2008

2 commits

  • Add __GFP_REPEAT to hugepage allocations. Do so to not necessitate userspace
    putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages
    to grow the pool. With the previous patch to allow for large-order
    __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this
    increases the likelihood of getting hugepages when the system experiences (or
    recently experienced) load.

    Mel tested the patchset on an x86_32 laptop. With the patches, it was easier
    to use the proc interface to grow the hugepage pool. The following is the
    output of a script that grows the pool as much as possible running on
    2.6.25-rc9.

    Allocating hugepages test
    -------------------------
    Disabling OOM Killer for current test process
    Starting page count: 0
    Attempt 1: 57 pages Progress made with 57 pages
    Attempt 2: 73 pages Progress made with 16 pages
    Attempt 3: 74 pages Progress made with 1 pages
    Attempt 4: 75 pages Progress made with 1 pages
    Attempt 5: 77 pages Progress made with 2 pages

    77 pages was the most it allocated but it took 5 attempts from userspace
    to get it. With the 3 patches in this series applied,

    Allocating hugepages test
    -------------------------
    Disabling OOM Killer for current test process
    Starting page count: 0
    Attempt 1: 75 pages Progress made with 75 pages
    Attempt 2: 76 pages Progress made with 1 pages
    Attempt 3: 79 pages Progress made with 3 pages

    And 79 pages was the most it got. Your patches were able to allocate the
    bulk of possible pages on the first attempt.

    Signed-off-by: Nishanth Aravamudan
    Cc: Andy Whitcroft
    Tested-by: Mel Gorman
    Cc: Dave Hansen
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • mm/hugetlb.c:207:11: warning: Using plain integer as NULL pointer

    Signed-off-by: Harvey Harrison
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

28 Apr, 2008

9 commits

  • Huge ptes have a special type on s390 and cannot be handled with the standard
    pte functions in certain cases, e.g. because of a different location of the
    invalid bit. This patch adds some new architecture- specific functions to
    hugetlb common code, as a prerequisite for the s390 large page support.

    This won't affect other architectures in functionality, but I need to add some
    new dummy inline functions to the headers.

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Cc: Paul Mundt
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • A cow break on a hugetlbfs page with page_count > 1 will set a new pte with
    set_huge_pte_at(), w/o any tlb flush operation. The old pte will remain in
    the tlb and subsequent write access to the page will result in a page fault
    loop, for as long as it may take until the tlb is flushed from somewhere else.
    This patch introduces an architecture-specific huge_ptep_clear_flush()
    function, which is called before the the set_huge_pte_at() in hugetlb_cow().

    ATTENTION: This is just a nop on all architectures for now, the s390
    implementation will come with our large page patch later. Other architectures
    should define their own huge_ptep_clear_flush() if needed.

    Acked-by: Martin Schwidefsky
    Signed-off-by: Gerald Schaefer
    Cc: Paul Mundt
    Cc: "Luck, Tony"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • After further discussion with Christoph Lameter, it has become clear that my
    earlier attempts to clean up the mempolicy reference counting were a bit of
    overkill in some areas, resulting in superflous ref/unref in what are usually
    fast paths. In other areas, further inspection reveals that I botched the
    unref for interleave policies.

    A separate patch, suitable for upstream/stable trees, fixes up the known
    errors in the previous attempt to fix reference counting.

    This patch reworks the memory policy referencing counting and, one hopes,
    simplifies the code. Maybe I'll get it right this time.

    See the update to the numa_memory_policy.txt document for a discussion of
    memory policy reference counting that motivates this patch.

    Summary:

    Lookup of mempolicy, based on (vma, address) need only add a reference for
    shared policy, and we need only unref the policy when finished for shared
    policies. So, this patch backs out all of the unneeded extra reference
    counting added by my previous attempt. It then unrefs only shared policies
    when we're finished with them, using the mpol_cond_put() [conditional put]
    helper function introduced by this patch.

    Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
    containing just the policy. read_swap_cache_async() can call alloc_page_vma()
    multiple times, so we can't let alloc_page_vma() unref the shared policy in
    this case. To avoid this, we make a copy of any non-null shared policy and
    remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading
    a page [or multiple pages] from swap, so the overhead should not be an issue
    here.

    I introduced a new static inline function "mpol_cond_copy()" to copy the
    shared policy to an on-stack policy and remove the flags that would require a
    conditional free. The current implementation of mpol_cond_copy() assumes that
    the struct mempolicy contains no pointers to dynamically allocated structures
    that must be duplicated or reference counted during copy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Allocating huge pages directly from the buddy allocator is not guaranteed to
    succeed. Success depends on several factors (such as the amount of physical
    memory available and the level of fragmentation). With the addition of
    dynamic hugetlb pool resizing, allocations can occur much more frequently.
    For these reasons it is desirable to keep track of huge page allocation
    successes and failures.

    Add two new vmstat entries to track huge page allocations that succeed and
    fail. The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
    being enabled.

    [akpm@linux-foundation.org: reduced ifdeffery]
    Signed-off-by: Adam Litke
    Signed-off-by: Eric Munson
    Tested-by: Mel Gorman
    Reviewed-by: Andy Whitcroft
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
    pages, scan the page list in two parts. First, transfer the needed pages to
    the hugetlb pool. Then drop the lock and free the remaining pages back to the
    buddy allocator.

    In the common case there are zero excess pages and no lock operations are
    required.

    Thanks Mel Gorman for this improvement.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: William Lee Irwin III
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • The MPOL_BIND policy creates a zonelist that is used for allocations
    controlled by that mempolicy. As the per-node zonelist is already being
    filtered based on a zone id, this patch adds a version of __alloc_pages() that
    takes a nodemask for further filtering. This eliminates the need for
    MPOL_BIND to create a custom zonelist.

    A positive benefit of this is that allocations using MPOL_BIND now use the
    local node's distance-ordered zonelist instead of a custom node-id-ordered
    zonelist. I.e., pages will be allocated from the closest allowed node with
    available memory.

    [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
    [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Mar, 2008

2 commits

  • Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
    and 2.6.25-rc5-mm1:

    BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
    NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
    REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1)
    MSR: 8000000000009032 CR: 48008448 XER: 20000000
    TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
    GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
    GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
    GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
    GPR12: 0000000028000442 c000000000770080
    NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
    LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
    Call Trace:
    [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
    [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
    [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
    [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
    [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
    [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
    [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
    [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
    [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
    [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
    [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
    [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
    [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
    [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
    Instruction dump:
    ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
    60000000 38800010 38a00000 7c601b78 2f800010 409d0008 38000010

    This was tracked down to a potential livelock in
    return_unused_surplus_hugepages(). In the case where we have surplus
    pages on some node, but no free pages on the same node, we may never
    break out of the loop. To avoid this livelock, terminate the search if
    we iterate a number of times equal to the number of online nodes without
    freeing a page.

    Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
    the patch.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • Currently we show the surplus hugetlb pool state in /proc/meminfo, but
    not in the per-node meminfo files, even though we track the information
    on a per-node basis. Printing it there can help track down dynamic pool
    bugs including the one in the follow-on patch.

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

11 Mar, 2008

1 commit

  • Free pages in the hugetlb pool are free and as such have a reference count of
    zero. Regular allocations into the pool from the buddy are "freed" into the
    pool which results in their page_count dropping to zero. However, surplus
    pages can be directly utilized by the caller without first being freed to the
    pool. Therefore, a call to put_page_testzero() is in order so that such a
    page will be handed to the caller with a correct count.

    This has not affected end users because the bad page count is reset before the
    page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when
    the page count is validated.

    Thanks go to Mel for first spotting this issue and providing an initial fix.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: William Lee Irwin III
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

05 Mar, 2008

2 commits

  • Adam Litke noticed that currently we grow the hugepage pool independent of any
    cpuset the running process may be in, but when shrinking the pool, the cpuset
    is checked. This leads to inconsistency when shrinking the pool in a
    restricted cpuset -- an administrator may have been able to grow the pool on a
    node restricted by a containing cpuset, but they cannot shrink it there.

    There are two options: either prevent growing of the pool outside of the
    cpuset or allow shrinking outside of the cpuset. >From previous discussions
    on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
    should not be restricted by cpusets. So allow shrinking the pool by removing
    pages from nodes outside of current's cpuset.

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Irwin
    Cc: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Paul Jackson
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • A hugetlb reservation may be inadequately backed in the event of racing
    allocations and frees when utilizing surplus huge pages. Consider the
    following series of events in processes A and B:

    A) Allocates some surplus pages to satisfy a reservation
    B) Frees some huge pages
    A) A notices the extra free pages and drops hugetlb_lock to free some of
    its surplus pages back to the buddy allocator.
    B) Allocates some huge pages
    A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()

    Avoid this by commiting the reservation after pages have been allocated but
    before dropping the lock to free excess pages. For parity, release the
    reservation in return_unused_surplus_pages().

    This patch also corrects the cpuset_mems_nr() error path in
    hugetlb_acct_memory(). If the cpuset check fails, uncommit the
    reservation, but also be sure to return any surplus huge pages that may
    have been allocated to back the failed reservation.

    Thanks to Andy Whitcroft for discovering this.

    Signed-off-by: Adam Litke
    Cc: Mel Gorman
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: William Lee Irwin III
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

24 Feb, 2008

1 commit

  • When we free a page via free_huge_page and we detect that we are in surplus
    the page will be returned to the buddy. After this we no longer own the page.

    However at the end free_huge_page we clear out our mapping pointer from
    page private. Even where the page is not a surplus we free the page to
    the hugepage pool, drop the pool locks and then clear page private. In
    either case the page may have been reallocated. BAD.

    Make sure we clear out page private before we free the page.

    Signed-off-by: Andy Whitcroft
    Acked-by: Adam Litke
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

14 Feb, 2008

1 commit

  • proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
    hold hugetlb_lock over the call. Use a dummy variable to store the sysctl
    result, like in hugetlb_sysctl_handler(), then grab the lock to update
    nr_overcommit_huge_pages.

    Signed-off-by: Nishanth Aravamudan
    Reported-by: Miles Lane
    Cc: Adam Litke
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

09 Feb, 2008

1 commit

  • When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
    proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
    require that all counter modifications occur under the hugetlb_lock. Add a
    callback into the hugetlb code similar to the one for nr_hugepages. Grab
    the lock around the manipulation of nr_overcommit_hugepages in
    proc_doulongvec_minmax().

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

06 Feb, 2008

1 commit

  • After running SetPageUptodate, preceeding stores to the page contents to
    actually bring it uptodate may not be ordered with the store to set the
    page uptodate.

    Therefore, another CPU which checks PageUptodate is true, then reads the
    page contents can get stale data.

    Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
    PageUptodate.

    Many places that test PageUptodate, do so with the page locked, and this
    would be enough to ensure memory ordering in those places if
    SetPageUptodate were only called while the page is locked. Unfortunately
    that is not always the case for some filesystems, but it could be an idea
    for the future.

    Also bring the handling of anonymous page uptodateness in line with that of
    file backed page management, by marking anon pages as uptodate when they
    _are_ uptodate, rather than when our implementation requires that they be
    marked as such. Doing allows us to get rid of the smp_wmb's in the page
    copying functions, which were especially added for anonymous pages for an
    analogous memory ordering problem. Both file and anonymous pages are
    handled with the same barriers.

    FAQ:
    Q. Why not do this in flush_dcache_page?
    A. Firstly, flush_dcache_page handles only one side (the smb side) of the
    ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
    memory barriers in a completely unrelated function is nasty; at least in the
    PageUptodate macros, they are located together with (half) the operations
    involved in the ordering. Thirdly, the smp_wmb is only required when first
    bringing the page uptodate, wheras flush_dcache_page should be called each time
    it is written to through the kernel mapping. It is logically the wrong place to
    put it.

    Q. Why does this increase my text size / reduce my performance / etc.
    A. Because it is adding the necessary instructions to eliminate the data-race.

    Q. Can it be improved?
    A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
    run under the page lock, we could avoid the smp_rmb places where PageUptodate
    is queried under the page lock. Requires audit of all filesystems and at least
    some would need reworking. That's great you're interested, I'm eagerly awaiting
    your patches.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

25 Jan, 2008

1 commit

  • The shared page table code for hugetlb memory on x86 and x86_64
    is causing a leak. When a user of hugepages exits using this code
    the system leaks some of the hugepages.

    -------------------------------------------------------
    Part of /proc/meminfo just before database startup:
    HugePages_Total: 5500
    HugePages_Free: 5500
    HugePages_Rsvd: 0
    Hugepagesize: 2048 kB

    Just before shutdown:
    HugePages_Total: 5500
    HugePages_Free: 4475
    HugePages_Rsvd: 0
    Hugepagesize: 2048 kB

    After shutdown:
    HugePages_Total: 5500
    HugePages_Free: 4988
    HugePages_Rsvd:
    0 Hugepagesize: 2048 kB
    ----------------------------------------------------------

    The problem occurs durring a fork, in copy_hugetlb_page_range(). It
    locates the dst_pte using huge_pte_alloc(). Since huge_pte_alloc() calls
    huge_pmd_share() it will share the pmd page if can, yet the main loop in
    copy_hugetlb_page_range() does a get_page() on every hugepage. This is a
    violation of the shared hugepmd pagetable protocol and creates additional
    referenced to the hugepages causing a leak when the unmap of the VMA
    occurs. We can skip the entire replication of the ptes when the hugepage
    pagetables are shared. The attached patch skips copying the ptes and the
    get_page() calls if the hugetlbpage pagetable is shared.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Larry Woodman
    Signed-off-by: Adam Litke
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     

15 Jan, 2008

1 commit

  • In the error path of both shared and private hugetlb page allocation,
    the file system quota is never undone, leading to fs quota leak. Fix
    them up.

    [akpm@linux-foundation.org: cleanup, micro-optimise]
    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     

18 Dec, 2007

2 commits

  • This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
    Add hugetlb_dynamic_pool sysctl")

    Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
    sysctl is not needed, as its semantics can be expressed by 0 in the
    overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
    (pool enabled).

    (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • hugetlb: introduce nr_overcommit_hugepages sysctl

    While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
    became convinced that having a boolean sysctl was insufficient:

    1) To support per-node control of hugepages, I have previously submitted
    patches to add a sysfs attribute related to nr_hugepages. However, with
    a boolean global value and per-mount quota enforcement constraining the
    dynamic pool, adding corresponding control of the dynamic pool on a
    per-node basis seems inconsistent to me.

    2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
    mount points is, arguably, more arduous than it needs to be. Each quota
    would need to be set separately, and the sum would need to be monitored.

    To ease the administration, and to help make the way for per-node
    control of the static & dynamic hugepage pool, I added a separate
    sysctl, nr_overcommit_hugepages. This value serves as a high watermark
    for the overall hugepage pool, while nr_hugepages serves as a low
    watermark. The boolean sysctl can then be removed, as the condition

    nr_overcommit_hugepages > 0

    indicates the same administrative setting as

    hugetlb_dynamic_pool == 1

    Quotas still serve as local enforcement of the size of the pool on a
    per-mount basis.

    A few caveats:

    1) There is a race whereby the global surplus huge page counter is
    incremented before a hugepage has allocated. Another process could then
    try grow the pool, and fail to convert a surplus huge page to a normal
    huge page and instead allocate a fresh huge page. I believe this is
    benign, as no memory is leaked (the actual pages are still tracked
    correctly) and the counters won't go out of sync.

    2) Shrinking the static pool while a surplus is in effect will allow the
    number of surplus huge pages to exceed the overcommit value. As long as
    this condition holds, however, no more surplus huge pages will be
    allowed on the system until one of the two sysctls are increased
    sufficiently, or the surplus huge pages go out of use and are freed.

    Successfully tested on x86_64 with the current libhugetlbfs snapshot,
    modified to use the new sysctl.

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Adam Litke
    Cc: William Lee Irwin III
    Cc: Dave Hansen
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     

11 Dec, 2007

1 commit

  • The follow_hugetlb_page() fix I posted (merged as git commit
    5b23dbe8173c212d6a326e35347b038705603d39) missed one case. If the pte is
    present, but not writable and write access is requested by the caller to
    get_user_pages(), the code will do the wrong thing. Rather than calling
    hugetlb_fault to make the pte writable, it notes the presence of the pte
    and continues.

    This simple one-liner makes sure we also fault on the pte for this case.
    Please apply.

    Signed-off-by: Adam Litke
    Acked-by: Dave Kleikamp
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

15 Nov, 2007

8 commits

  • For administrative purpose, we want to query actual block usage for
    hugetlbfs file via fstat. Currently, hugetlbfs always return 0. Fix that
    up since kernel already has all the information to track it properly.

    Signed-off-by: Ken Chen
    Acked-by: Adam Litke
    Cc: Badari Pulavarty
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ken Chen
     
  • return_unused_surplus_pages() can become static.

    Signed-off-by: Adrian Bunk
    Acked-by: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
    to guarantee no problems will occur later when instantiating pages. If quotas
    are in force, page instantiation could fail due to a race with another process
    or an oversized (but approved) shared mapping.

    To prevent these scenarios, debit the quota for the full reservation amount up
    front and credit the unused quota when the reservation is released.

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
    allow bulk updating of the sbinfo->free_blocks counter. This will be used by
    the next patch in the series.

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
    seem out of place. The alloc/free API is unbalanced because we handle the
    hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
    Move the get inside alloc_huge_page to clean up this disparity.

    This patch has been kept apart from the previous patch because of the somewhat
    dodgy ERR_PTR() use herein. Moving the quota logic means that
    alloc_huge_page() has two failure modes. Quota failure must result in a
    SIGBUS while a standard allocation failure is OOM. Unfortunately, ERR_PTR()
    doesn't like the small positive errnos we have in VM_FAULT_* so they must be
    negated before they are used.

    Does anyone take issue with the way I am using PTR_ERR. If so, what are your
    thoughts on how to clean this up (without needing an if,else if,else block at
    each alloc_huge_page() callsite)?

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
    mappings when that support was added. Currently, quota is debited at page
    instantiation and credited at file truncation. This approach works correctly
    for shared pages but is incomplete for private pages. In addition to
    hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
    this function does not respect quotas.

    Private huge pages are treated very much like normal, anonymous pages. They
    are not "backed" by the hugetlbfs file and are not stored in the mapping's
    radix tree. This means that private pages are invisible to
    truncate_hugepages() so that function will not credit the quota.

    This patch (based on a prototype provided by Ken Chen) moves quota crediting
    for all pages into free_huge_page(). page->private is used to store a pointer
    to the mapping to which this page belongs. This is used to credit quota on
    the appropriate hugetlbfs instance.

    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Hugetlbfs implements a quota system which can limit the amount of memory that
    can be used by the filesystem. Before allocating a new huge page for a file,
    the quota is checked and debited. The quota is then credited when truncating
    the file. I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
    mappings. Before detailing the problems and my proposed solutions, we should
    agree on a definition of quotas that properly addresses both private and
    shared pages. Since the purpose of quotas is to limit total memory
    consumption on a per-filesystem basis, I argue that all pages allocated by the
    fs (private and shared) should be charged against quota.

    Private Mappings
    ================

    The current code will debit quota for private pages sometimes, but will never
    credit it. At a minimum, this causes a leak in the quota accounting which
    renders the accounting essentially useless as it is. Shared pages have a one
    to one mapping with a hugetlbfs file and are easy to account by debiting on
    allocation and crediting on truncate. Private pages are anonymous in nature
    and have a many to one relationship with their hugetlbfs files (due to copy on
    write). Because private pages are not indexed by the mapping's radix tree,
    thier quota cannot be credited at file truncation time. Crediting must be
    done when the page is unmapped and freed.

    Shared Pages
    ============

    I discovered an issue concerning the interaction between the MAP_SHARED
    reservation system and quotas. Since quota is not checked until page
    instantiation, an over-quota mmap/reservation will initially succeed. When
    instantiating the first over-quota page, the program will receive SIGBUS.
    This is inconsistent since the reservation is supposed to be a guarantee. The
    solution is to debit the full amount of quota at reservation time and credit
    the unused portion when the reservation is released.

    This patch series brings quotas back in line by making the following
    modifications:
    * Private pages
    - Debit quota in alloc_huge_page()
    - Credit quota in free_huge_page()
    * Shared pages
    - Debit quota for entire reservation at mmap time
    - Credit quota for instantiated pages in free_huge_page()
    - Credit quota for unused reservation at munmap time

    This patch:

    The shared page reservation and dynamic pool resizing features have made the
    allocation of private vs. shared huge pages quite different. By splitting
    out the private/shared-specific portions of the process into their own
    functions, readability is greatly improved. alloc_huge_page now calls the
    proper helper and performs common operations.

    [akpm@linux-foundation.org: coding-style cleanups]
    Signed-off-by: Adam Litke
    Cc: Ken Chen
    Cc: Andy Whitcroft
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • When calling get_user_pages(), a write flag is passed in by the caller to
    indicate if write access is required on the faulted-in pages. Currently,
    follow_hugetlb_page() ignores this flag and always faults pages for
    read-only access. This can cause data corruption because a device driver
    that calls get_user_pages() with write set will not expect COW faults to
    occur on the returned pages.

    This patch passes the write flag down to follow_hugetlb_page() and makes
    sure hugetlb_fault() is called with the right write_access parameter.

    [ezk@cs.sunysb.edu: build fix]
    Signed-off-by: Adam Litke
    Reviewed-by: Ken Chen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc: Badari Pulavarty
    Signed-off-by: Erez Zadok
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

20 Oct, 2007

1 commit


19 Oct, 2007

1 commit

  • Get rid of sparse related warnings from places that use integer as NULL
    pointer.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Hemminger
    Cc: Andi Kleen
    Cc: Jeff Garzik
    Cc: Matt Mackall
    Cc: Ian Kent
    Cc: Arnd Bergmann
    Cc: Davide Libenzi
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     

17 Oct, 2007

5 commits

  • When gather_surplus_pages() fails to allocate enough huge pages to satisfy
    the requested reservation, it frees what it did allocate back to the buddy
    allocator. put_page() should be called instead of update_and_free_page()
    to ensure that pool counters are updated as appropriate and the page's
    refcount is decremented.

    Signed-off-by: Adam Litke
    Acked-by: Dave Hansen
    Cc: David Gibson
    Cc: William Lee Irwin III
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Anton found a problem with the hugetlb pool allocation when some nodes have
    no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2). Lee worked
    on versions that tried to fix it, but none were accepted. Christoph has
    created a set of patches which allow for GFP_THISNODE allocations to fail
    if the node has no memory.

    Currently, alloc_fresh_huge_page() returns NULL when it is not able to
    allocate a huge page on the current node, as specified by its custom
    interleave variable. The callers of this function, though, assume that a
    failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
    on the system period. This might not be the case, for instance, if we have
    an uneven NUMA system, and we happen to try to allocate a hugepage on a
    node with less memory and fail, while there is still plenty of free memory
    on the other nodes.

    To correct this, make alloc_fresh_huge_page() search through all online
    nodes before deciding no hugepages can be allocated. Add a helper function
    for actually allocating the hugepage. Use a new global nid iterator to
    control which nid to allocate on.

    Note: we expect particular semantics for __GFP_THISNODE, which are now
    enforced even for memoryless nodes. That is, there is should be no
    fallback to other nodes. Therefore, we rely on the nid passed into
    alloc_pages_node() to be the nid the page comes from. If this is
    incorrect, accounting will break.

    Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
    memoryless nodes).

    Before on the ppc64 box:
    Trying to clear the hugetlb pool
    Done. 0 free
    Trying to resize the pool to 100
    Node 0 HugePages_Free: 25
    Node 1 HugePages_Free: 75
    Node 2 HugePages_Free: 0
    Node 3 HugePages_Free: 0
    Done. Initially 100 free
    Trying to resize the pool to 200
    Node 0 HugePages_Free: 50
    Node 1 HugePages_Free: 150
    Node 2 HugePages_Free: 0
    Node 3 HugePages_Free: 0
    Done. 200 free

    After:
    Trying to clear the hugetlb pool
    Done. 0 free
    Trying to resize the pool to 100
    Node 0 HugePages_Free: 50
    Node 1 HugePages_Free: 50
    Node 2 HugePages_Free: 0
    Node 3 HugePages_Free: 0
    Done. Initially 100 free
    Trying to resize the pool to 200
    Node 0 HugePages_Free: 100
    Node 1 HugePages_Free: 100
    Node 2 HugePages_Free: 0
    Node 3 HugePages_Free: 0
    Done. 200 free

    Signed-off-by: Nishanth Aravamudan
    Acked-by: Christoph Lameter
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Badari Pulavarty
    Cc: Ken Chen
    Cc: William Lee Irwin III
    Cc: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nishanth Aravamudan
     
  • When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
    are careful to keep enough pages around to satisfy reservations. But the
    calculation is flawed for the following scenario:

    Action Pool Counters (Total, Free, Resv)
    ====== =============
    Set pool to 1 page 1 1 0
    Map 1 page MAP_PRIVATE 1 1 0
    Touch the page to fault it in 1 0 0
    Set pool to 3 pages 3 2 0
    Map 2 pages MAP_SHARED 3 2 2
    Set pool to 2 pages 2 1 2
    Acked-by: Ken Chen
    Cc: David Gibson
    Cc: Badari Pulavarty
    Cc: William Lee Irwin III
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • The maximum size of the huge page pool can be controlled using the overall
    size of the hugetlb filesystem (via its 'size' mount option). However in the
    common case the this will not be set as the pool is traditionally fixed in
    size at boot time. In order to maintain the expected semantics, we need to
    prevent the pool expanding by default.

    This patch introduces a new sysctl controlling dynamic pool resizing. When
    this is enabled the pool will expand beyond its base size up to the size of
    the hugetlb filesystem. It is disabled by default.

    Signed-off-by: Adam Litke
    Acked-by: Andy Whitcroft
    Acked-by: Dave McCracken
    Cc: William Irwin
    Cc: David Gibson
    Cc: Ken Chen
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     
  • Shared mappings require special handling because the huge pages needed to
    fully populate the VMA must be reserved at mmap time. If not enough pages are
    available when making the reservation, allocate all of the shortfall at once
    from the buddy allocator and add the pages directly to the hugetlb pool. If
    they cannot be allocated, then fail the mapping. The page surplus is
    accounted for in the same way as for private mappings; faulted surplus pages
    will be freed at unmap time. Reserved, surplus pages that have not been used
    must be freed separately when their reservation has been released.

    Signed-off-by: Adam Litke
    Acked-by: Andy Whitcroft
    Acked-by: Dave McCracken
    Cc: William Irwin
    Cc: David Gibson
    Cc: Ken Chen
    Cc: Badari Pulavarty
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke