13 Aug, 2010

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6:
    hugetlb: add missing unlock in avoidcopy path in hugetlb_cow()
    hwpoison: rename CONFIG
    HWPOISON, hugetlb: support hwpoison injection for hugepage
    HWPOISON, hugetlb: detect hwpoison in hugetlb code
    HWPOISON, hugetlb: isolate corrupted hugepage
    HWPOISON, hugetlb: maintain mce_bad_pages in handling hugepage error
    HWPOISON, hugetlb: set/clear PG_hwpoison bits on hugepage
    HWPOISON, hugetlb: enable error handling path for hugepage
    hugetlb, rmap: add reverse mapping for hugepage
    hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h

    Fix up trivial conflicts in mm/memory-failure.c

    Linus Torvalds
     

11 Aug, 2010

5 commits

  • This patch fixes possible deadlock in hugepage lock_page()
    by adding missing unlock_page().

    libhugetlbfs test will hit this bug when the next patch in this
    patchset ("hugetlb, HWPOISON: move PG_HWPoison bit check") is applied.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables hwpoison injection through debug/hwpoison interfaces,
    with which we can test memory error handling for free or reserved
    hugepages (which cannot be tested by madvise() injector).

    [AK: Export PageHuge too for the injection module]
    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch enables to block access to hwpoisoned hugepage and
    also enables to block unmapping for it.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • If error hugepage is not in-use, we can fully recovery from error
    by dequeuing it from freelist, so return RECOVERY.
    Otherwise whether or not we can recovery depends on user processes,
    so return DELAYED.

    Dependency:
    "HWPOISON, hugetlb: enable error handling path for hugepage"

    Signed-off-by: Naoya Horiguchi
    Cc: Andrew Morton
    Acked-by: Fengguang Wu
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     
  • This patch adds reverse mapping feature for hugepage by introducing
    mapcount for shared/private-mapped hugepage and anon_vma for
    private-mapped hugepage.

    While hugepage is not currently swappable, reverse mapping can be useful
    for memory error handler.

    Without this patch, memory error handler cannot identify processes
    using the bad hugepage nor unmap it from them. That is:
    - for shared hugepage:
    we can collect processes using a hugepage through pagecache,
    but can not unmap the hugepage because of the lack of mapcount.
    - for privately mapped hugepage:
    we can neither collect processes nor unmap the hugepage.
    This patch solves these problems.

    This patch include the bug fix given by commit 23be7468e8, so reverts it.

    Dependency:
    "hugetlb: move definition of is_vm_hugetlb_page() to hugepage_inline.h"

    ChangeLog since May 24.
    - create hugetlb_inline.h and move is_vm_hugetlb_index() in it.
    - move functions setting up anon_vma for hugepage into mm/rmap.c.

    ChangeLog since May 13.
    - rebased to 2.6.34
    - fix logic error (in case that private mapping and shared mapping coexist)
    - move is_vm_hugetlb_page() into include/linux/mm.h to use this function
    from linear_page_index()
    - define and use linear_hugepage_index() instead of compound_order()
    - use page_move_anon_rmap() in hugetlb_cow()
    - copy exclusive switch of __set_page_anon_rmap() into hugepage counterpart.
    - revert commit 24be7468 completely

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Acked-by: Fengguang Wu
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

1 commit

  • When a copy-on-write occurs, we take one of two paths in handle_mm_fault:
    through handle_pte_fault for normal pages, or through hugetlb_fault for
    huge pages.

    In the normal page case, we eventually get to do_wp_page and call mmu
    notifiers via ptep_clear_flush_notify. There is no callout to the mmmu
    notifiers in the huge page case. This patch fixes that.

    Signed-off-by: Doug Doan
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Doan
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

12 May, 2010

1 commit

  • Ordinarily, application using hugetlbfs will create mappings with
    reserves. For shared mappings, these pages are reserved before mmap()
    returns success and for private mappings, the caller process is guaranteed
    and a child process that cannot get the pages gets killed with sigbus.

    An application that uses MAP_NORESERVE gets no reservations and mmap()
    will always succeed at the risk the page will not be available at fault
    time. This might be used for example on very large sparse mappings where
    the developer is confident the necessary huge pages exist to satisfy all
    faults even though the whole mapping cannot be backed by huge pages.
    Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the
    fault handler which proceeds to trigger the OOM-killer. This is
    unhelpful.

    Even without hugetlbfs mounted, a user using mmap() can trivially trigger
    the OOM-killer because VM_FAULT_OOM is returned (will provide example
    program if desired - it's a whopping 24 lines long). It could be
    considered a DOS available to an unprivileged user.

    This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE
    where huge pages were not available with SIGBUS instead of triggering the
    OOM killer.

    This change affects hugetlb_cow() as well. I feel there is a failure case
    in there, but I didn't create one. It would need a fairly specific target
    in terms of the faulting application and the hugepage pool size. The
    hugetlb_no_page() path is much easier to hit but both might as well be
    closed.

    Signed-off-by: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

25 Apr, 2010

1 commit

  • If a futex key happens to be located within a huge page mapped
    MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a
    page->mapping that will never exist.

    See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details
    about the problem.

    This patch makes page->mapping a poisoned value that includes
    PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to
    continue but because of PAGE_MAPPING_ANON, the poisoned value is not
    dereferenced or used by futex. No other part of the VM should be
    dereferencing the page->mapping of a hugetlbfs page as its page cache is
    not on the LRU.

    This patch fixes the problem with the test case described in the bugzilla.

    [akpm@linux-foundation.org: mel cant spel]
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra
    Acked-by: Darren Hart
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

02 Mar, 2010

1 commit

  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits)
    ARM: Eliminate decompressor -Dstatic= PIC hack
    ARM: 5958/1: ARM: U300: fix inverted clk round rate
    ARM: 5956/1: misplaced parentheses
    ARM: 5955/1: ep93xx: move timer defines into core.c and document
    ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c
    ARM: 5953/1: ep93xx: fix broken build of clock.c
    ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig
    ARM: 5949/1: NUC900 add gpio virtual memory map
    ARM: 5948/1: Enable timer0 to time4 clock support for nuc910
    ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk
    ARM: make_coherent(): fix problems with highpte, part 2
    MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
    ARM: 5945/1: ep93xx: include correct irq.h in core.c
    ARM: 5933/1: amba-pl011: support hardware flow control
    ARM: 5930/1: Add PKMAP area description to memory.txt.
    ARM: 5929/1: Add checks to detect overlap of memory regions.
    ARM: 5928/1: Change type of VMALLOC_END to unsigned long.
    ARM: 5927/1: Make delimiters of DMA area globally visibly.
    ARM: 5926/1: Add "Virtual kernel memory..." printout.
    ARM: 5920/1: OMAP4: Enable L2 Cache
    ...

    Fix up trivial conflict in arch/arm/mach-mx25/clock.c

    Linus Torvalds
     

21 Feb, 2010

1 commit

  • On VIVT ARM, when we have multiple shared mappings of the same file
    in the same MM, we need to ensure that we have coherency across all
    copies. We do this via make_coherent() by making the pages
    uncacheable.

    This used to work fine, until we allowed highmem with highpte - we
    now have a page table which is mapped as required, and is not available
    for modification via update_mmu_cache().

    Ralf Beache suggested getting rid of the PTE value passed to
    update_mmu_cache():

    On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
    to construct a pointer to the pte again. Passing a pte_t * is much
    more elegant. Maybe we might even replace the pte argument with the
    pte_t?

    Ben Herrenschmidt would also like the pte pointer for PowerPC:

    Passing the ptep in there is exactly what I want. I want that
    -instead- of the PTE value, because I have issue on some ppc cases,
    for I$/D$ coherency, where set_pte_at() may decide to mask out the
    _PAGE_EXEC.

    So, pass in the mapped page table pointer into update_mmu_cache(), and
    remove the PTE value, updating all implementations and call sites to
    suit.

    Includes a fix from Stephen Rothwell:

    sparc: fix fallout from update_mmu_cache API change

    Signed-off-by: Stephen Rothwell

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Russell King

    Russell King
     

03 Feb, 2010

1 commit

  • hugetlb_sysfs_add_hstate is called by hugetlb_register_node directly
    during init and also indirectly via sysfs after init.

    This patch removes the __init tag from hugetlb_sysfs_add_hstate.

    Signed-off-by: Jeff Mahoney
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Mahoney
     

12 Jan, 2010

1 commit


16 Dec, 2009

9 commits

  • If a user asks for a hugepage pool resize but specified a large number,
    the machine can begin trashing. In response, they might hit ctrl-c but
    signals are ignored and the pool resize continues until it fails an
    allocation. This can take a considerable amount of time so this patch
    aborts a pool resize if a signal is pending.

    Suggested by Dave Hansen.

    Signed-off-by: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When the owner of a mapping fails COW because a child process is holding a
    reference, the children VMAs are walked and the page is unmapped. The
    i_mmap_lock is taken for the unmapping of the page but not the walking of
    the prio_tree. In theory, that tree could be changing if the lock is not
    held. This patch takes the i_mmap_lock properly for the duration of the
    prio_tree walk.

    [hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
    Signed-off-by: Mel Gorman
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hugetlb_fault() takes the mm->page_table_lock spinlock then calls
    hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an
    insufficient huge page pool it calls unmap_ref_private() with the
    mm->page_table_lock held. unmap_ref_private() then calls
    unmap_hugepage_range() which tries to acquire the mm->page_table_lock.

    [] print_circular_bug_tail+0x80/0x9f
    [] ? check_noncircular+0xb0/0xe8
    [] __lock_acquire+0x956/0xc0e
    [] lock_acquire+0xee/0x12e
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? unmap_hugepage_range+0x3e/0x84
    [] _spin_lock+0x40/0x89
    [] ? unmap_hugepage_range+0x3e/0x84
    [] ? alloc_huge_page+0x218/0x318
    [] unmap_hugepage_range+0x3e/0x84
    [] hugetlb_cow+0x1e2/0x3f4
    [] ? hugetlb_fault+0x453/0x4f6
    [] hugetlb_fault+0x480/0x4f6
    [] follow_hugetlb_page+0x116/0x2d9
    [] ? _spin_unlock_irq+0x3a/0x5c
    [] __get_user_pages+0x2a3/0x427
    [] get_user_pages+0x3e/0x54
    [] get_user_pages_fast+0x170/0x1b5
    [] dio_get_page+0x64/0x14a
    [] __blockdev_direct_IO+0x4b7/0xb31
    [] blkdev_direct_IO+0x58/0x6e
    [] ? blkdev_get_blocks+0x0/0xb8
    [] generic_file_aio_read+0xdd/0x528
    [] ? avc_has_perm+0x66/0x8c
    [] do_sync_read+0xf5/0x146
    [] ? autoremove_wake_function+0x0/0x5a
    [] ? security_file_permission+0x24/0x3a
    [] vfs_read+0xb5/0x126
    [] ? fget_light+0x5e/0xf8
    [] sys_read+0x54/0x8c
    [] system_call_fastpath+0x16/0x1b

    This can be fixed by dropping the mm->page_table_lock around the call to
    unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
    the normal path anyway. However, earlier in the that function, it's also
    possible to call into the page allocator with the same spinlock held.

    What this patch does is drop the spinlock before the page allocator is
    potentially entered. The check for page allocation failure can be made
    without the page_table_lock as well as the copy of the huge page. Even if
    the PTE changed while the spinlock was held, the consequence is that a
    huge page is copied unnecessarily. This resolves both the double taking
    of the lock and sleeping with the spinlock held.

    [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
    Signed-off-by: Larry Woodman
    Signed-off-by: Mel Gorman
    Acked-by: Adam Litke
    Cc: Andy Whitcroft
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: David Gibson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Larry Woodman
     
  • Objects passed to NODEMASK_ALLOC() are relatively small in size and are
    backed by slab caches that are not of large order, traditionally never
    greater than PAGE_ALLOC_COSTLY_ORDER.

    Thus, using GFP_KERNEL for these allocations on large machines when
    CONFIG_NODES_SHIFT > 8 will cause the page allocator to loop endlessly in
    the allocation attempt, each time invoking both direct reclaim or the oom
    killer.

    This is of particular interest when using NODEMASK_ALLOC() from a
    mempolicy context (either directly in mm/mempolicy.c or the mempolicy
    constrained hugetlb allocations) since the oom killer always kills current
    when allocations are constrained by mempolicies. So for all present use
    cases in the kernel, current would end up being oom killed when direct
    reclaim fails. That would allow the NODEMASK_ALLOC() to succeed but
    current would have sacrificed itself upon returning.

    This patch adds gfp flags to NODEMASK_ALLOC() to pass to kmalloc() on
    CONFIG_NODES_SHIFT > 8; this parameter is a nop on other configurations.
    All current use cases either directly from hugetlb code or indirectly via
    NODEMASK_SCRATCH() union __GFP_NORETRY to avoid direct reclaim and the oom
    killer when the slab allocator needs to allocate additional pages.

    The side-effect of this change is that all current use cases of either
    NODEMASK_ALLOC() or NODEMASK_SCRATCH() need appropriate -ENOMEM handling
    when the allocation fails (never for CONFIG_NODES_SHIFT
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Register per node hstate sysfs attributes only for nodes with memory.
    Global replacement of 'all online nodes" with "all nodes with memory" in
    mm/hugetlb.c. Suggested by David Rientjes.

    A subsequent patch will handle adding/removing of per node hstate sysfs
    attributes when nodes transition to/from memoryless state via memory
    hotplug.

    NOTE: this patch has not been tested with memoryless nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Add the per huge page size control/query attributes to the per node
    sysdevs:

    /sys/devices/system/node/node/hugepages/hugepages-/
    nr_hugepages - r/w
    free_huge_pages - r/o
    surplus_huge_pages - r/o

    The patch attempts to re-use/share as much of the existing global hstate
    attribute initialization and handling, and the "nodes_allowed" constraint
    processing as possible.

    Calling set_max_huge_pages() with no node indicates a change to global
    hstate parameters. In this case, any non-default task mempolicy will be
    used to generate the nodes_allowed mask. A valid node id indicates an
    update to that node's hstate parameters, and the count argument specifies
    the target count for the specified node. From this info, we compute the
    target global count for the hstate and construct a nodes_allowed node mask
    contain only the specified node.

    Setting the node specific nr_hugepages via the per node attribute
    effectively ignores any task mempolicy or cpuset constraints.

    With this patch:

    (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
    ./ ../ free_hugepages nr_hugepages surplus_hugepages

    Starting from:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 0
    Node 2 HugePages_Free: 0
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 0

    Allocate 16 persistent huge pages on node 2:
    (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

    [Note that this is equivalent to:
    numactl -m 2 hugeadmin --pool-pages-min 2M:+16
    ]

    Yields:
    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 16
    Node 2 HugePages_Free: 16
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0
    vm.nr_hugepages = 16

    Global controls work as expected--reduce pool to 8 persistent huge pages:
    (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    Node 0 HugePages_Total: 0
    Node 0 HugePages_Free: 0
    Node 0 HugePages_Surp: 0
    Node 1 HugePages_Total: 0
    Node 1 HugePages_Free: 0
    Node 1 HugePages_Surp: 0
    Node 2 HugePages_Total: 8
    Node 2 HugePages_Free: 8
    Node 2 HugePages_Surp: 0
    Node 3 HugePages_Total: 0
    Node 3 HugePages_Free: 0
    Node 3 HugePages_Surp: 0

    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This patch derives a "nodes_allowed" node mask from the numa mempolicy of
    the task modifying the number of persistent huge pages to control the
    allocation, freeing and adjusting of surplus huge pages when the pool page
    count is modified via the new sysctl or sysfs attribute
    "nr_hugepages_mempolicy". The nodes_allowed mask is derived as follows:

    * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
    is produced. This will cause the hugetlb subsystem to use
    node_online_map as the "nodes_allowed". This preserves the
    behavior before this patch.
    * For "preferred" mempolicy, including explicit local allocation,
    a nodemask with the single preferred node will be produced.
    "local" policy will NOT track any internode migrations of the
    task adjusting nr_hugepages.
    * For "bind" and "interleave" policy, the mempolicy's nodemask
    will be used.
    * Other than to inform the construction of the nodes_allowed node
    mask, the actual mempolicy mode is ignored. That is, all modes
    behave like interleave over the resulting nodes_allowed mask
    with no "fallback".

    See the updated documentation [next patch] for more information
    about the implications of this patch.

    Examples:

    Starting with:

    Node 0 HugePages_Total: 0
    Node 1 HugePages_Total: 0
    Node 2 HugePages_Total: 0
    Node 3 HugePages_Total: 0

    Default behavior [with or without this patch] balances persistent
    hugepage allocation across nodes [with sufficient contiguous memory]:

    sysctl vm.nr_hugepages[_mempolicy]=32

    yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 8
    Node 3 HugePages_Total: 8

    Of course, we only have nr_hugepages_mempolicy with the patch,
    but with default mempolicy, nr_hugepages_mempolicy behaves the
    same as nr_hugepages.

    Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
    '--membind' because it allows multiple nodes to be specified
    and it's easy to type]--we can allocate huge pages on
    individual nodes or sets of nodes. So, starting from the
    condition above, with 8 huge pages per node, add 8 more to
    node 2 using:

    numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

    This yields:

    Node 0 HugePages_Total: 8
    Node 1 HugePages_Total: 8
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The incremental 8 huge pages were restricted to node 2 by the
    specified mempolicy.

    Similarly, we can use mempolicy to free persistent huge pages
    from specified nodes:

    numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

    yields:

    Node 0 HugePages_Total: 4
    Node 1 HugePages_Total: 4
    Node 2 HugePages_Total: 16
    Node 3 HugePages_Total: 8

    The 8 huge pages freed were balanced over nodes 0 and 1.

    [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • In preparation for constraining huge page allocation and freeing by the
    controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
    to the allocate, free and surplus adjustment functions. For now, pass
    NULL to indicate default behavior--i.e., use node_online_map. A
    subsqeuent patch will derive a non-default mask from the controlling
    task's numa mempolicy.

    Note that this method of updating the global hstate nr_hugepages under the
    constraint of a nodemask simplifies keeping the global state
    consistent--especially the number of persistent and surplus pages relative
    to reservations and overcommit limits. There are undoubtedly other ways
    to do this, but this works for both interfaces: mempolicy and per node
    attributes.

    [rientjes@google.com: fix HIGHMEM compile error]
    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Modify the hstate_next_node* functions to allow them to be called to
    obtain the "start_nid". Then, whereas prior to this patch we
    unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
    we successfully allocated/freed a huge page on the node, now we only call
    these functions on failure to alloc/free to advance to next allowed node.

    Factor out the next_node_allowed() function to handle wrap at end of
    node_online_map. In this version, the allowed nodes include all of the
    online nodes.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Acked-by: David Rientjes
    Reviewed-by: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

28 Sep, 2009

1 commit


24 Sep, 2009

1 commit

  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

22 Sep, 2009

5 commits

  • Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
    and add more comments, as suggested by Mel Gorman.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • follow_hugetlb_page() shouldn't be guessing about the coredump case
    either: pass the foll_flags down to it, instead of just the write bit.

    Remove that obscure huge_zeropage_ok() test. The decision is easy,
    though unlike the non-huge case - here vm_ops->fault is always set.
    But we know that a fault would serve up zeroes, unless there's
    already a hugetlbfs pagecache page to back the range.

    (Alternatively, since hugetlb pages aren't swapped out under pressure,
    you could save more dump space by arguing that a page not yet faulted
    into this process cannot be relevant to the dump; but that would be
    more surprising.)

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • I noticed that alloc_bootmem_huge_page() will only advance to the next
    node on failure to allocate a huge page, potentially filling nodes with
    huge-pages. I asked about this on linux-mm and linux-numa, cc'ing the
    usual huge page suspects.

    Mel Gorman responded:

    I strongly suspect that the same node being used until allocation
    failure instead of round-robin is an oversight and not deliberate
    at all. It appears to be a side-effect of a fix made way back in
    commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix
    hugepage allocation with memoryless nodes"]. Prior to that patch
    it looked like allocations would always round-robin even when
    allocation was successful.

    This patch--factored out of my "hugetlb mempolicy" series--moves the
    advance of the hstate next node from which to allocate up before the test
    for success of the attempted allocation.

    Note that alloc_bootmem_huge_page() is only used for order > MAX_ORDER
    huge pages.

    I'll post a separate patch for mainline/stable, as the above mentioned
    "balance freeing" series renamed the next node to alloc function.

    Signed-off-by: Lee Schermerhorn
    Reviewed-by: Mel Gorman
    Reviewed-by: Andy Whitcroft
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Use the [modified] free_pool_huge_page() function to return unused
    surplus pages. This will help keep huge pages balanced across nodes
    between freeing of unused surplus pages and freeing of persistent huge
    pages [from set_max_huge_pages] by using the same node id "cursor". It
    also eliminates some code duplication.

    Signed-off-by: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Nishanth Aravamudan
    Acked-by: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Free huges pages from nodes in round robin fashion in an attempt to keep
    [persistent a.k.a static] hugepages balanced across nodes

    New function free_pool_huge_page() is modeled on and performs roughly the
    inverse of alloc_fresh_huge_page(). Replaces dequeue_huge_page() which
    now has no callers, so this patch removes it.

    Helper function hstate_next_node_to_free() uses new hstate member
    next_to_free_nid to distribute "frees" across all nodes with huge pages.

    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Acked-by: Mel Gorman
    Cc: Nishanth Aravamudan
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

10 Sep, 2009

1 commit


30 Jul, 2009

1 commit

  • As reported in Red Hat bz #509671, i_blocks for files on hugetlbfs get
    accounting wrong when doing something like:

    $ > foo
    $ date > foo
    date: write error: Invalid argument
    $ /usr/bin/stat foo
    File: `foo'
    Size: 0 Blocks: 18446744073709547520 IO Block: 2097152 regular
    ...

    This is because hugetlb_unreserve_pages() is unconditionally removing
    blocks_per_huge_page(h) on each call rather than using the freed amount.
    If there were 0 blocks, it goes negative, resulting in the above.

    This is a regression from commit a5516438959d90b071ff0a484ce4f3f523dc3152
    ("hugetlb: modular state for hugetlb page size")

    which did:

    - inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed;
    + inode->i_blocks -= blocks_per_huge_page(h);

    so just put back the freed multiplier, and it's all happy again.

    Signed-off-by: Eric Sandeen
    Acked-by: Andi Kleen
    Cc: William Lee Irwin III
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     

24 Jun, 2009

1 commit


17 Jun, 2009

3 commits

  • A series of patches to enhance the /proc/pagemap interface and to add a
    userspace executable which can be used to present the pagemap data.

    Export 10 more flags to end users (and more for kernel developers):

    11. KPF_MMAP (pseudo flag) memory mapped page
    12. KPF_ANON (pseudo flag) memory mapped page (anonymous)
    13. KPF_SWAPCACHE page is in swap cache
    14. KPF_SWAPBACKED page is swap/RAM backed
    15. KPF_COMPOUND_HEAD (*)
    16. KPF_COMPOUND_TAIL (*)
    17. KPF_HUGE hugeTLB pages
    18. KPF_UNEVICTABLE page is in the unevictable LRU list
    19. KPF_HWPOISON hardware detected corruption
    20. KPF_NOPAGE (pseudo flag) no page frame at the address

    (*) For compound pages, exporting _both_ head/tail info enables
    users to tell where a compound page starts/ends, and its order.

    a simple demo of the page-types tool

    # ./page-types -h
    page-types [options]
    -r|--raw Raw mode, for kernel developers
    -a|--addr addr-spec Walk a range of pages
    -b|--bits bits-spec Walk pages with specified bits
    -l|--list Show page details in ranges
    -L|--list-each Show page details one by one
    -N|--no-summary Don't show summay info
    -h|--help Show this usage message
    addr-spec:
    N one page at offset N (unit: pages)
    N+M pages range from N to N+M-1
    N,M pages range from N to M-1
    N, pages range from N to end
    ,M pages range from 0 to M
    bits-spec:
    bit1,bit2 (flags & (bit1|bit2)) != 0
    bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1
    bit1,~bit2 (flags & (bit1|bit2)) == bit1
    =bit1,bit2 flags == (bit1|bit2)
    bit-names:
    locked error referenced uptodate
    dirty lru active slab
    writeback reclaim buddy mmap
    anonymous swapcache swapbacked compound_head
    compound_tail huge unevictable hwpoison
    nopage reserved(r) mlocked(r) mappedtodisk(r)
    private(r) private_2(r) owner_private(r) arch(r)
    uncached(r) readahead(o) slob_free(o) slub_frozen(o)
    slub_debug(o)
    (r) raw mode bits (o) overloaded bits

    # ./page-types
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 487369 1903 _________________________________
    0x0000000000000014 5 0 __R_D____________________________ referenced,dirty
    0x0000000000000020 1 0 _____l___________________________ lru
    0x0000000000000024 34 0 __R__l___________________________ referenced,lru
    0x0000000000000028 3838 14 ___U_l___________________________ uptodate,lru
    0x0001000000000028 48 0 ___U_l_______________________I___ uptodate,lru,readahead
    0x000000000000002c 6478 25 __RU_l___________________________ referenced,uptodate,lru
    0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
    0x0000000000000040 8344 32 ______A__________________________ active
    0x0000000000000060 1 0 _____lA__________________________ lru,active
    0x0000000000000068 348 1 ___U_lA__________________________ uptodate,lru,active
    0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
    0x000000000000006c 988 3 __RU_lA__________________________ referenced,uptodate,lru,active
    0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
    0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
    0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
    0x0000000000000400 503 1 __________B______________________ buddy
    0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
    0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
    0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
    0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
    0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
    0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
    0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
    0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
    0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
    0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
    0x0000000000001000 492 1 ____________a____________________ anonymous
    0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 30 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 513968 2007

    # ./page-types -r
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000000 468002 1828 _________________________________
    0x0000000100000000 19102 74 _____________________r___________ reserved
    0x0000000000008000 41 0 _______________H_________________ compound_head
    0x0000000000010000 188 0 ________________T________________ compound_tail
    0x0000000000008014 1 0 __R_D__________H_________________ referenced,dirty,compound_head
    0x0000000000010014 4 0 __R_D___________T________________ referenced,dirty,compound_tail
    0x0000000000000020 1 0 _____l___________________________ lru
    0x0000000800000024 34 0 __R__l__________________P________ referenced,lru,private
    0x0000000000000028 3794 14 ___U_l___________________________ uptodate,lru
    0x0001000000000028 46 0 ___U_l_______________________I___ uptodate,lru,readahead
    0x0000000400000028 44 0 ___U_l_________________d_________ uptodate,lru,mappedtodisk
    0x0001000400000028 2 0 ___U_l_________________d_____I___ uptodate,lru,mappedtodisk,readahead
    0x000000000000002c 6434 25 __RU_l___________________________ referenced,uptodate,lru
    0x000100000000002c 47 0 __RU_l_______________________I___ referenced,uptodate,lru,readahead
    0x000000040000002c 14 0 __RU_l_________________d_________ referenced,uptodate,lru,mappedtodisk
    0x000000080000002c 30 0 __RU_l__________________P________ referenced,uptodate,lru,private
    0x0000000800000040 8124 31 ______A_________________P________ active,private
    0x0000000000000040 219 0 ______A__________________________ active
    0x0000000800000060 1 0 _____lA_________________P________ lru,active,private
    0x0000000000000068 322 1 ___U_lA__________________________ uptodate,lru,active
    0x0001000000000068 12 0 ___U_lA______________________I___ uptodate,lru,active,readahead
    0x0000000400000068 13 0 ___U_lA________________d_________ uptodate,lru,active,mappedtodisk
    0x0000000800000068 12 0 ___U_lA_________________P________ uptodate,lru,active,private
    0x000000000000006c 977 3 __RU_lA__________________________ referenced,uptodate,lru,active
    0x000100000000006c 48 0 __RU_lA______________________I___ referenced,uptodate,lru,active,readahead
    0x000000040000006c 5 0 __RU_lA________________d_________ referenced,uptodate,lru,active,mappedtodisk
    0x000000080000006c 3 0 __RU_lA_________________P________ referenced,uptodate,lru,active,private
    0x0000000c0000006c 3 0 __RU_lA________________dP________ referenced,uptodate,lru,active,mappedtodisk,private
    0x0000000c00000068 1 0 ___U_lA________________dP________ uptodate,lru,active,mappedtodisk,private
    0x0000000000004078 1 0 ___UDlA_______b__________________ uptodate,dirty,lru,active,swapbacked
    0x000000000000407c 34 0 __RUDlA_______b__________________ referenced,uptodate,dirty,lru,active,swapbacked
    0x0000000000000400 538 2 __________B______________________ buddy
    0x0000000000000804 1 0 __R________M_____________________ referenced,mmap
    0x0000000000000828 1029 4 ___U_l_____M_____________________ uptodate,lru,mmap
    0x0001000000000828 43 0 ___U_l_____M_________________I___ uptodate,lru,mmap,readahead
    0x000000000000082c 382 1 __RU_l_____M_____________________ referenced,uptodate,lru,mmap
    0x000100000000082c 12 0 __RU_l_____M_________________I___ referenced,uptodate,lru,mmap,readahead
    0x0000000000000868 192 0 ___U_lA____M_____________________ uptodate,lru,active,mmap
    0x0001000000000868 12 0 ___U_lA____M_________________I___ uptodate,lru,active,mmap,readahead
    0x000000000000086c 800 3 __RU_lA____M_____________________ referenced,uptodate,lru,active,mmap
    0x000100000000086c 31 0 __RU_lA____M_________________I___ referenced,uptodate,lru,active,mmap,readahead
    0x0000000000004878 2 0 ___UDlA____M__b__________________ uptodate,dirty,lru,active,mmap,swapbacked
    0x0000000000001000 492 1 ____________a____________________ anonymous
    0x0000000000005008 2 0 ___U________a_b__________________ uptodate,anonymous,swapbacked
    0x0000000000005808 4 0 ___U_______Ma_b__________________ uptodate,mmap,anonymous,swapbacked
    0x000000000000580c 1 0 __RU_______Ma_b__________________ referenced,uptodate,mmap,anonymous,swapbacked
    0x0000000000005868 2839 11 ___U_lA____Ma_b__________________ uptodate,lru,active,mmap,anonymous,swapbacked
    0x000000000000586c 29 0 __RU_lA____Ma_b__________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
    total 513968 2007

    # ./page-types --raw --list --no-summary --bits reserved
    offset count flags
    0 15 _____________________r___________
    31 4 _____________________r___________
    159 97 _____________________r___________
    4096 2067 _____________________r___________
    6752 2390 _____________________r___________
    9355 3 _____________________r___________
    9728 14526 _____________________r___________

    This patch:

    Introduce PageHuge(), which identifies huge/gigantic pages by their
    dedicated compound destructor functions.

    Also move prep_compound_gigantic_page() to hugetlb.c and make
    __free_pages_ok() non-static.

    Signed-off-by: Wu Fengguang
    Cc: KOSAKI Motohiro
    Cc: Andi Kleen
    Cc: Matt Mackall
    Cc: Alexey Dobriyan
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • num_online_nodes() is called in a number of places but most often by the
    page allocator when deciding whether the zonelist needs to be filtered
    based on cpusets or the zonelist cache. This is actually a heavy function
    and touches a number of cache lines.

    This patch stores the number of online nodes at boot time and updates the
    value when nodes get onlined and offlined. The value is then used in a
    number of important paths in place of num_online_nodes().

    [rientjes@google.com: do not override definition of node_set_online() with macro]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Pekka Enberg
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • Callers of alloc_pages_node() can optionally specify -1 as a node to mean
    "allocate from the current node". However, a number of the callers in
    fast paths know for a fact their node is valid. To avoid a comparison and
    branch, this patch adds alloc_pages_exact_node() that only checks the nid
    with VM_BUG_ON(). Callers that know their node is valid are then
    converted.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Pekka Enberg
    Acked-by: Paul Mundt [for the SLOB NUMA bits]
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Dave Hansen
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 May, 2009

1 commit

  • Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302

    hugetlbfs reserves huge pages but does not fault them at mmap() time to
    ensure that future faults succeed. The reservation behaviour differs
    depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE.
    For MAP_SHARED mappings, hugepages are reserved when mmap() is first
    called and are tracked based on information associated with the inode.
    Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE
    track the reservations based on the VMA created as part of the mmap()
    operation. Each process mapping MAP_PRIVATE must make its own
    reservation.

    hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag
    and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs,
    VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened
    read-write. If a shared memory mapping was mapped shared-read-write for
    populating of data and mapped shared-read-only by other processes, then
    hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This
    causes processes to fail to map the file MAP_SHARED even though it should
    succeed as the reservation is there.

    This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE
    when the intent of the code was to check whether the VMA was mapped
    MAP_SHARED or MAP_PRIVATE.

    Signed-off-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc:
    Cc: Lee Schermerhorn
    Cc: KOSAKI Motohiro
    Cc:
    Cc: Eric B Munson
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Apr, 2009

1 commit

  • chg is unsigned, so it cannot be less than 0.

    Also, since region_chg returns long, let vma_needs_reservation() forward
    this to alloc_huge_page(). Store it as long as well. all callers cast it
    to long anyway.

    Signed-off-by: Roel Kluin
    Cc: Andy Whitcroft
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     

12 Feb, 2009

1 commit

  • Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
    in line with the core VM by obeying VM_NORESERVE and not reserving
    hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
    are specified. However, it is still taking filesystem quota
    unconditionally.

    At fault time, if there are no reserves and attempt is made to allocate
    the page and account for filesystem quota. If either fail, the fault
    fails. The impact is that quota is getting accounted for twice. This
    patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
    help prevent this mistake happening again, it improves the documentation
    of hugetlb_reserve_pages()

    Reported-by: Andy Whitcroft
    Signed-off-by: Mel Gorman
    Acked-by: Andy Whitcroft
    Signed-off-by: Linus Torvalds

    Mel Gorman