16 Dec, 2009

9 commits

  • If a user asks for a hugepage pool resize but specified a large number,
    the machine can begin trashing. In response, they might hit ctrl-c but
    signals are ignored and the pool resize continues until it fails an
    allocation. This can take a considerable amount of time so this patch
    aborts a pool resize if a signal is pending.

    Suggested by Dave Hansen.

    Signed-off-by: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When the owner of a mapping fails COW because a child process is holding a
    reference, the children VMAs are walked and the page is unmapped. The
    i_mmap_lock is taken for the unmapping of the page but not the walking of
    the prio_tree. In theory, that tree could be changing if the lock is not
    held. This patch takes the i_mmap_lock properly for the duration of the
    prio_tree walk.

    [hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
    Signed-off-by: Mel Gorman
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hugetlb_fault() takes the mm->page_table_lock spinlock then calls
    hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
    insufficient huge page pool it calls unmap_ref_private() with the
    mm->page_table_lock held. unmap_ref_private() then calls
    unmap_hugepage_range() which tries to acquire the mm->page_table_lock.

    [] print_circular_bug_tail+0x80/0x9f
    [] ? check_noncircular+0xb0/0xe8
    [] __lock_acquire+0x956/0xc0e
    [] lock_acquire+0xee/0x12e
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? unmap_hugepage_range+0x3e/0x84
    [] _spin_lock+0x40/0x89
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? alloc_huge_page+0x218/0x318
    [] unmap_hugepage_range+0x3e/0x84
    [] hugetlb_cow+0x1e2/0x3f4
    [] ? hugetlb_fault+0x453/0x4f6
    [] hugetlb_fault+0x480/0x4f6
    [] follow_hugetlb_page+0x116/0x2d9
    [] ? _spin_unlock_irq+0x3a/0x5c
    [] __get_user_pages+0x2a3/0x427
    [] get_user_pages+0x3e/0x54
    [] get_user_pages_fast+0x170/0x1b5
    [] dio_get_page+0x64/0x14a
    [] __blockdev_direct_IO+0x4b7/0xb31
    [] blkdev_direct_IO+0x58/0x6e
    [] ? blkdev_get_blocks+0x0/0xb8
    [] generic_file_aio_read+0xdd/0x528
    [] ? avc_has_perm+0x66/0x8c
    [] do_sync_read+0xf5/0x146
    [] ? autoremove_wake_function+0x0/0x5a
    [] ? security_file_permission+0x24/0x3a
    [] vfs_read+0xb5/0x126
    [] ? fget_light+0x5e/0xf8
    [] sys_read+0x54/0x8c
    [] system_call_fastpath+0x16/0x1b

    This can be fixed by dropping the mm->page_table_lock around the call to
    unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
    the normal path anyway. However, earlier in the that function, it's also
    possible to call into the page allocator with the same spinlock held.

    What this patch does is drop the spinlock before the page allocator is
    potentially entered. The check for page allocation failure can be made
    without the page_table_lock as well as the copy of the huge page. Even if
    the PTE changed while the spinlock was held, the consequence is that a
    huge page is copied unnecessarily. This resolves both the double taking
    of the lock and sleeping with the spinlock held.

    [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
    Signed-off-by: Larry Woodman
    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Cc: Andy Whitcroft
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     
  • Objects passed to NODEMASK_ALLOC() are relatively small in size and are
    backed by slab caches that are not of large order, traditionally never
    greater than PAGE_ALLOC_COSTLY_ORDER.

    Thus, using GFP_KERNEL for these allocations on large machines when
    CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
    the allocation attempt, each time invoking both direct reclaim or the oom
    killer.

    This is of particular interest when using NODEMASK_ALLOC() from a
    mempolicy context (either directly in mm/mempolicy.c or the mempolicy
    constrained hugetlb allocations) since the oom killer always kills current
    when allocations are constrained by mempolicies. So for all present use
    cases in the kernel, current would end up being oom killed when direct
    reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
    current would have sacrificed itself upon returning.

    This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
    CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
    All current use cases either directly from hugetlb code or indirectly via
    NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
    killer when the slab allocator needs to allocate additional pages.

    The side-effect of this change is that all current use cases of either
    NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
    when the allocation fails (never for CONFIG_NODES_SHIFT
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Register per node hstate sysfs attributes only for nodes with memory.
    Global replacement of 'all online nodes" with "all nodes with memory" in
    mm/hugetlb.c. Suggested by David Rientjes.

    A subsequent patch will handle adding/removing of per node hstate sysfs
    attributes when nodes transition to/from memoryless state via memory
    hotplug.

    NOTE: this patch has not been tested with memoryless nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add the per huge page size control/query attributes to the per node
    sysdevs:

    /sys/devices/system/node/node/hugepages/hugepages-/
    nr_hugepages - r/w
    free_huge_pages - r/o
    surplus_huge_pages - r/o

    The patch attempts to re-use/share as much of the existing global hstate
    attribute initialization and handling, and the "nodes_allowed" constraint
    processing as possible.

    Calling set_max_huge_pages() with no node indicates a change to global
    hstate parameters. In this case, any non-default task mempolicy will be
    used to generate the nodes_allowed mask. A valid node id indicates an
    update to that node's hstate parameters, and the count argument specifies
    the target count for the specified node. From this info, we compute the
    target global count for the hstate and construct a nodes_allowed node mask
    contain only the specified node.

    Setting the node specific nr_hugepages via the per node attribute
    effectively ignores any task mempolicy or cpuset constraints.

    With this patch:

    (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
    ./ ../ free_hugepages nr_hugepages surplus_hugepages

    Starting from:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 0
    Node 2 HugePages_Free: 0
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 0

    Allocate 16 persistent huge pages on node 2:
    (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

    [Note that this is equivalent to:
    numactl -m 2 hugeadmin --pool-pages-min 2M:+16
    ]

    Yields:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 16
    Node 2 HugePages_Free: 16
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 16

    Global controls work as expected--reduce pool to 8 persistent huge pages:
    (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 8
    Node 2 HugePages_Free: 8
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In preparation for constraining huge page allocation and freeing by the
    controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
    to the allocate, free and surplus adjustment functions. For now, pass
    NULL to indicate default behavior--i.e., use node_online_map. A
    subsqeuent patch will derive a non-default mask from the controlling
    task's numa mempolicy.

    Note that this method of updating the global hstate nr_hugepages under the
    constraint of a nodemask simplifies keeping the global state
    consistent--especially the number of persistent and surplus pages relative
    to reservations and overcommit limits. There are undoubtedly other ways
    to do this, but this works for both interfaces: mempolicy and per node
    attributes.

    [rientjes@google.com: fix HIGHMEM compile error]
    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Modify the hstate_next_node* functions to allow them to be called to
    obtain the "start_nid". Then, whereas prior to this patch we
    unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
    we successfully allocated/freed a huge page on the node, now we only call
    these functions on failure to alloc/free to advance to next allowed node.

    Factor out the next_node_allowed() function to handle wrap at end of
    node_online_map. In this version, the allowed nodes include all of the
    online nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

28 Sep, 2009

1 commit


24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

5 commits

  • Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
    and add more comments, as suggested by Mel Gorman.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I noticed that alloc_bootmem_huge_page() will only advance to the next
    node on failure to allocate a huge page, potentially filling nodes with
    huge-pages. I asked about this on linux-mm and linux-numa, cc'ing the
    usual huge page suspects.

    Mel Gorman responded:

    I strongly suspect that the same node being used until allocation
    failure instead of round-robin is an oversight and not deliberate
    at all. It appears to be a side-effect of a fix made way back in
    commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix
    hugepage allocation with memoryless nodes"]. Prior to that patch
    it looked like allocations would always round-robin even when
    allocation was successful.

    This patch--factored out of my "hugetlb mempolicy" series--moves the
    advance of the hstate next node from which to allocate up before the test
    for success of the attempted allocation.

    Note that alloc_bootmem_huge_page() is only used for order > MAX_ORDER
    huge pages.

    I'll post a separate patch for mainline/stable, as the above mentioned
    "balance freeing" series renamed the next node to alloc function.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Reviewed-by: Andy Whitcroft
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Use the [modified] free_pool_huge_page() function to return unused
    surplus pages. This will help keep huge pages balanced across nodes
    between freeing of unused surplus pages and freeing of persistent huge
    pages [from set_max_huge_pages] by using the same node id "cursor". It
    also eliminates some code duplication.

    Signed-off-by: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Free huges pages from nodes in round robin fashion in an attempt to keep
    [persistent a.k.a static] hugepages balanced across nodes

    New function free_pool_huge_page() is modeled on and performs roughly the
    inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which
    now has no callers, so this patch removes it.

    Helper function hstate_next_node_to_free() uses new hstate member
    next_to_free_nid to distribute "frees" across all nodes with huge pages.

    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

10 Sep, 2009

1 commit


30 Jul, 2009

1 commit

  • As reported in Red Hat bz #509671, i_blocks for files on hugetlbfs get
    accounting wrong when doing something like:

    $ > foo
    $ date > foo
    date: write error: Invalid argument
    $ /usr/bin/stat foo
    File: `foo'
    Size: 0 Blocks: 18446744073709547520 IO Block: 2097152 regular
    ...

    This is because hugetlb_unreserve_pages() is unconditionally removing
    blocks_per_huge_page(h) on each call rather than using the freed amount.
    If there were 0 blocks, it goes negative, resulting in the above.

    This is a regression from commit a5516438959d90b071ff0a484ce4f3f523dc3152
    ("hugetlb: modular state for hugetlb page size")

    which did:

    - inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
    + inode->i_blocks -= blocks_per_huge_page(h);

    so just put back the freed multiplier, and it's all happy again.

    Signed-off-by: Eric Sandeen
    Acked-by: Andi Kleen
    Cc: William Lee Irwin III
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

24 Jun, 2009

1 commit


17 Jun, 2009

3 commits

  • A series of patches to enhance the /proc/pagemap interface and to add a
    userspace executable which can be used to present the pagemap data.

    Export 10 more flags to end users (and more for kernel developers):

    11. KPF_MMAP (pseudo flag) memory mapped page
    12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
    13. KPF_SWAPCACHE page is in swap cache
    14. KPF_SWAPBACKED page is swap/RAM backed
    15. KPF_COMPOUND_HEAD (*)
    16. KPF_COMPOUND_TAIL (*)
    17. KPF_HUGE hugeTLB pages
    18. KPF_UNEVICTABLE page is in the unevictable LRU list
    19. KPF_HWPOISON hardware detected corruption
    20. KPF_NOPAGE (pseudo flag) no page frame at the address

    (*) For compound pages, exporting _both_ head/tail info enables
    users to tell where a compound page starts/ends, and its order.

    a simple demo of the page-types tool

    # ./page-types -h
    page-types [options]
    -r|--raw Raw mode, for kernel developers
    -a|--addr addr-spec Walk a range of pages
    -b|--bits bits-spec Walk pages with specified bits
    -l|--list Show page details in ranges
    -L|--list-each Show page details one by one
    -N|--no-summary Don't show summay info
    -h|--help Show this usage message
    addr-spec:
    N one page at offset N (unit: pages)
    N+M pages range from N to N+M-1
    N,M pages range from N to M-1
    N, pages range from N to end
    ,M pages range from 0 to M
    bits-spec:
    bit1,bit2 (flags & (bit1|bit2)) != 0
    bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
    bit1,~bit2 (flags & (bit1|bit2)) == bit1
    =bit1,bit2 flags == (bit1|bit2)
    bit-names:
    locked error referenced uptodate
    dirty lru active slab
    writeback reclaim buddy mmap
    anonymous swapcache swapbacked compound_head
    compound_tail huge unevictable hwpoison
    nopage reserved(r) mlocked(r) mappedtodisk(r)
    private(r) private_2(r) owner_private(r) arch(r)
    uncached(r) readahead(o) slob_free(o) slub_frozen(o)
    slub_debug(o)
    (r) raw mode bits (o) overloaded bits

    # ./page-types
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 487369 1903 _________________________________
    0x0000000000000014 5 0 __R_D____________________________ referenced,dirty
    0x0000000000000020 1 0 _____l___________________________ lru
    0x0000000000000024 34 0 __R__l___________________________ referenced,lru
    0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru
    0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead
    0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru
    0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
    0x0000000000000040 8344 32 ______A__________________________ active
    0x0000000000000060 1 0 _____lA__________________________ lru,active
    0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active
    0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
    0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active
    0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
    0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
    0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
    0x0000000000000400 503 1 __________B______________________ buddy
    0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
    0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
    0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
    0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
    0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
    0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
    0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
    0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
    0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
    0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
    0x0000000000001000 492 1 ____________a____________________ anonymous
    0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 513968 2007

    # ./page-types -r
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 468002 1828 _________________________________
    0x0000000100000000 19102 74 _____________________r___________ reserved
    0x0000000000008000 41 0 _______________H_________________ compound_head
    0x0000000000010000 188 0 ________________T________________ compound_tail
    0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head
    0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail
    0x0000000000000020 1 0 _____l___________________________ lru
    0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private
    0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru
    0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead
    0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk
    0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead
    0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru
    0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
    0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk
    0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private
    0x0000000800000040 8124 31 ______A_________________P________ active,private
    0x0000000000000040 219 0 ______A__________________________ active
    0x0000000800000060 1 0 _____lA_________________P________ lru,active,private
    0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active
    0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
    0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk
    0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private
    0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active
    0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
    0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk
    0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private
    0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private
    0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private
    0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
    0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
    0x0000000000000400 538 2 __________B______________________ buddy
    0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
    0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
    0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
    0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
    0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
    0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
    0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
    0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
    0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
    0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
    0x0000000000001000 492 1 ____________a____________________ anonymous
    0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked
    0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
    0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 513968 2007

    # ./page-types --raw --list --no-summary --bits reserved
    offset count flags
    0 15 _____________________r___________
    31 4 _____________________r___________
    159 97 _____________________r___________
    4096 2067 _____________________r___________
    6752 2390 _____________________r___________
    9355 3 _____________________r___________
    9728 14526 _____________________r___________

    This patch:

    Introduce PageHuge(), which identifies huge/gigantic pages by their
    dedicated compound destructor functions.

    Also move prep_compound_gigantic_page() to hugetlb.c and make
    __free_pages_ok() non-static.

    Signed-off-by: Wu Fengguang
    Cc: KOSAKI Motohiro
    Cc: Andi Kleen
    Cc: Matt Mackall
    Cc: Alexey Dobriyan
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • num_online_nodes() is called in a number of places but most often by the
    page allocator when deciding whether the zonelist needs to be filtered
    based on cpusets or the zonelist cache. This is actually a heavy function
    and touches a number of cache lines.

    This patch stores the number of online nodes at boot time and updates the
    value when nodes get onlined and offlined. The value is then used in a
    number of important paths in place of num_online_nodes().

    [rientjes@google.com: do not override definition of node_set_online() with macro]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 May, 2009

1 commit

  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302

    hugetlbfs reserves huge pages but does not fault them at mmap() time to
    ensure that future faults succeed. The reservation behaviour differs
    depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
    For MAP_SHARED mappings, hugepages are reserved when mmap() is first
    called and are tracked based on information associated with the inode.
    Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE
    track the reservations based on the VMA created as part of the mmap()
    operation. Each process mapping MAP_PRIVATE must make its own
    reservation.

    hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
    and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs,
    VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
    read-write. If a shared memory mapping was mapped shared-read-write for
    populating of data and mapped shared-read-only by other processes, then
    hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This
    causes processes to fail to map the file MAP_SHARED even though it should
    succeed as the reservation is there.

    This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
    when the intent of the code was to check whether the VMA was mapped
    MAP_SHARED or MAP_PRIVATE.

    Signed-off-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc:
    Cc: Lee Schermerhorn
    Cc: KOSAKI Motohiro
    Cc:
    Cc: Eric B Munson
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Apr, 2009

1 commit

  • chg is unsigned, so it cannot be less than 0.

    Also, since region_chg returns long, let vma_needs_reservation() forward
    this to alloc_huge_page(). Store it as long as well. all callers cast it
    to long anyway.

    Signed-off-by: Roel Kluin
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     

12 Feb, 2009

1 commit

  • Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
    in line with the core VM by obeying VM_NORESERVE and not reserving
    hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
    are specified. However, it is still taking filesystem quota
    unconditionally.

    At fault time, if there are no reserves and attempt is made to allocate
    the page and account for filesystem quota. If either fail, the fault
    fails. The impact is that quota is getting accounted for twice. This
    patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
    help prevent this mistake happening again, it improves the documentation
    of hugetlb_reserve_pages()

    Reported-by: Andy Whitcroft
    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Jan, 2009

4 commits

  • At this point we already know that 'addr' is not NULL so get rid of
    redundant 'if'. Probably gcc eliminate it by optimization pass.

    [akpm@linux-foundation.org: use __weak, too]
    Signed-off-by: Cyrill Gorcunov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • Fix the following sparse warnings:

    mm/hugetlb.c:375:3: warning: returning void-valued expression
    mm/hugetlb.c:408:3: warning: returning void-valued expression

    Signed-off-by: Hannes Eder
    Acked-by: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hannes Eder
     
  • The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the
    kernel to back a VMA. This matches the size used by the MMU in the
    majority of cases. However, one counter-example occurs on PPC64 kernels
    whereby a kernel using 64K as a base pagesize may still use 4K pages for
    the MMU on older processor. To distinguish, this patch reports
    MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps.

    Signed-off-by: Mel Gorman
    Cc: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • It is useful to verify a hugepage-aware application is using the expected
    pagesizes for its memory regions. This patch creates an entry called
    KernelPageSize in /proc/pid/smaps that is the size of page used by the
    kernel to back a VMA. The entry is not called PageSize as it is possible
    the MMU uses a different size. This extension should not break any sensible
    parser that skips lines containing unrecognised information.

    Signed-off-by: Mel Gorman
    Acked-by: "KOSAKI Motohiro"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Nov, 2008

1 commit

  • Oops. Part of the hugetlb private reservation code was not fully
    converted to use hstates.

    When a huge page must be unmapped from VMAs due to a failed COW,
    HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of
    the page size being used. This works if the VMA is using the default
    huge page size. Otherwise we might unmap too much, too little, or
    trigger a BUG_ON. Rare but serious -- fix it.

    Signed-off-by: Adam Litke
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adam Litke
     

07 Nov, 2008

2 commits

  • As we can determine exactly when a gigantic page is in use we can optimise
    the common regular page cases by pulling out gigantic page initialisation
    into its own function. As gigantic pages are never released to buddy we
    do not need a destructor. This effectivly reverts the previous change to
    the main buddy allocator. It also adds a paranoid check to ensure we
    never release gigantic pages from hugetlbfs to the main buddy.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • When working with hugepages, hugetlbfs assumes that those hugepages are
    smaller than MAX_ORDER. Specifically it assumes that the mem_map is
    contigious and uses that to optimise access to the elements of the mem_map
    that represent the hugepage. Gigantic pages (such as 16GB pages on
    powerpc) by definition are of greater order than MAX_ORDER (larger than
    MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of
    the buddy alloctor guarentees for the contiguity of the mem_map, which
    ensures that the mem_map is at least contigious for maximmally aligned
    areas of MAX_ORDER_NR_PAGES pages.

    This patch adds new mem_map accessors and iterator helpers which handle
    any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to
    implement gigantic page versions of copy_huge_page and clear_huge_page,
    and to allow follow_hugetlb_page handle gigantic pages.

    Signed-off-by: Andy Whitcroft
    Cc: Jon Tollefson
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: [2.6.27.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     

23 Oct, 2008

1 commit


20 Oct, 2008

3 commits

  • Presently hugepage doesn't use zero page at all because zero page is only
    used for coredumping and hugepage can't core dump.

    However we have now implemented hugepage coredumping. Therefore we should
    implement the zero page of hugepage.

    Implementation note:

    o Why do we only check VM_SHARED for zero page?
    normal page checked as ..

    static inline int use_zero_page(struct vm_area_struct *vma)
    {
    if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
    return 0;

    return !vma->vm_ops || !vma->vm_ops->fault;
    }

    First, hugepages are never mlock()ed. We aren't concerned with VM_LOCKED.

    Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
    doesn't have any file backing. Thus ops->fault checking is meaningless.

    o Why don't we use zero page if !pte.

    !pte indicate {pud, pmd} doesn't exist or some error happened. So we
    shouldn't return zero page if any error occurred.

    Signed-off-by: KOSAKI Motohiro
    Cc: Adam Litke
    Cc: Hugh Dickins
    Cc: Kawai Hidehiro
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
    mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
    mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
    mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Acked-by: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

17 Oct, 2008

1 commit

  • The page fault path for normal pages, if the fault is neither a no-page
    fault nor a write-protect fault, will update the DIRTY and ACCESSED bits
    in the page table appropriately.

    The hugepage fault path, however, does not do this, handling only no-page
    or write-protect type faults. It assumes that either the ACCESSED and
    DIRTY bits are irrelevant for hugepages (usually true, since they are
    never swapped) or that they are handled by the arch code.

    This is inconvenient for some software-loaded TLB architectures, where the
    _PAGE_ACCESSED (_PAGE_DIRTY) bits need to be set to enable read (write)
    access to the page at the TLB miss. This could be worked around in the
    arch TLB miss code, but the TLB miss fast path can be made simple more
    easily if the hugetlb_fault() path handles this, as the normal page fault
    path does.

    Signed-off-by: David Gibson
    Cc: William Lee Irwin III
    Cc: Hugh Dickins
    Cc: Adam Litke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Gibson
     

13 Aug, 2008

2 commits

  • [Andrew this should replace the previous version which did not check
    the returns from the region prepare for errors. This has been tested by
    us and Gerald and it looks good.

    Bah, while reviewing the locking based on your previous email I spotted
    that we need to check the return from the vma_needs_reservation call for
    allocation errors. Here is an updated patch to correct this. This passes
    testing here.]

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft
     
  • In the normal case, hugetlbfs reserves hugepages at map time so that the
    pages exist for future faults. A struct file_region is used to track when
    reservations have been consumed and where. These file_regions are
    allocated as necessary with kmalloc() which can sleep with the
    mm->page_table_lock held. This is wrong and triggers may-sleep warning
    when PREEMPT is enabled.

    Updates to the underlying file_region are done in two phases. The first
    phase prepares the region for the change, allocating any necessary memory,
    without actually making the change. The second phase actually commits the
    change. This patch makes use of this by checking the reservations before
    the page_table_lock is taken; triggering any necessary allocations. This
    may then be safely repeated within the locks without any allocations being
    required.

    Credit to Mel Gorman for diagnosing this failure and initial versions of
    the patch.

    Signed-off-by: Andy Whitcroft
    Tested-by: Gerald Schaefer
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Whitcroft